|
|
Data quality is a major issue in scientific, government and business settings. In some instances, significant resources are devoted to extremely labor intensive -and therefore expensive - strategies designed to detect and correct data quality problems. Data quality problems are exacerbated by at least three factors:
- The ubiquity of data, not only in traditional settings such as scientific and medical research, but in such new contexts as homeland security and electronic commerce;
- The dramatically increasing scale and complexity of data, and the increasing dependence on "automatic" means of data collection or assembly;
- The burgeoning use of data for purposes other than those for which they were originally intended.
Despite widespread recognition that data quality is a problem, to date there is no clear evidence that the strategies to improve it even work, or that they are necessary, let alone that they are cost effective. Nor is it clear whether the issues are so problem- and domain-specific that general approaches and tools are feasible (Karr et al., 2004). Indeed, some have asked whether there is or can be a science of data quality. To help answer this question, DQRI has submitted a prospectus for exploratory research aimed at assessing the feasibility and understanding the issues associated with constructing a decision-theoretic framework for data quality.
- Initiate construction of a decision-theoretic framework for addressing such questions as to what extent do clean-up strategies actually improve data quality, and are there lesser or alternative strategies, including modification of the techniques used to perform analyses, that are equally effective as a given strategy, as well as less costly?
- Apply the framework to specific testbed databases of clinical data, answering questions such as those listed above.
- Consider additional but similar databases, to gain an initial understanding of the generalizability of the framework.
The research team, drawn from the National Institute of Statistical Sciences (NISS) and the Data Quality Research Institute (DQRI), includes expertise on statistics, computation, data quality and domain knowledge pertinent to testbed databases.
Because of exploratory nature of the research, we expect its main product to be a deeper understanding of the issues, which is not attainable without research and experimentation. In quite a real sense, the success of the project will be measured by the extent to which the problems at the end of the research differ from those at the beginning. Among longer run questions on which it which sheds light, but does not answer, are whether there is scientific generalizability to the models developed, and what are appropriate abstractions and methods to incorporate domain knowledge into data quality analyses.
|
|
|
The research will have four major components. The first is to identify a specific data context and testbed database to which one of more clean-up processes have been applied. We plan to use clinical data, which exemplify a number of important characteristics. Second, we will focus on one set of data quality strategies. We plan this to be clean-up processes intended to detect and correct errors in databases. Third, we will address evaluation, using measures of the effectiveness of data cleaning strategies. Specifically, building on NISS research on data quality and data confidentiality, we will employ measures of effectiveness derived from data quality metrics measuring to what extent statistical inferences from the cleaned-up data differ from those from the data prior to clean-up. The rationale is that, independent of cost considerations, if the same conclusions would have been reached without the clean-up, then application of the strategies did not add value to the data. Fourth, we will develop predictive statistical models used to select optimal clean-up strategies. Many clean-up strategies are in some sense parameterized by a level of intensity, and which is of course a decision variable. A full cost-effectiveness formulation of the problem of selecting an optimal level is beyond the scope of this exploratory research, but we do plan to address some of the crucial underlying issues. The most important of these is statistical models for data quality as a function of intensity of effort. How these could be combined with cost models in order to optimize will be examined as well.
|