Curation-DB: Curating Scientific Data in LIFE

Curating Scientific Data data errors, syntax or format errors and semantic errors

Introduction LIFE is a epidemiological study aiming at discovering causes of common disorders as well as therapy and diagnostic possibilities. It applies a huge set (currently more then 400) of complex instruments including different kinds of interviews, questionnaires, and technically founded investigations on thousands of Leipzig inhabitants. Correlations in data, e.g., between diseases on the one hand and a combination of life conditions on the other requires high quality data. Data errors affect this data quality. However, avoiding every error is nearly impossible. Therefore, the captured data routinely needs to be validated and revised (curated) in case of error. Methods From the data-perspective, we differentiate between two main types of data errors, syntax or format errors and semantic errors. Syntax errors mostly occur when the data needs to be converted to change its data type, e.g., from text to number or from text to date/time fields. This is often the case when data is captured as text by the data input system but should be centrally managed and analyzed in a different format. Hence, the data conversion is only successful when the input data contains the data in the right format. Data conversion is applied when the data is transfered from data input systems to the central research database collecting all captured data in an integrated and harmonized form. Corrupted data that cannot be converted to the target data type is replaced by a missing value (also called null value, nil etc.). The definition of a default value is not sufficient since the default usually depends on the corresponding question or measurement input field and can strain analysis results when they are not concerned. Moreover, the definition process for every question/input field would be to time-consuming. Semantic errors are much harder to detect than syntax errors. Typically, they are semantically implausible outliers or are part of other artefacts, e.g.,when data of two input fields is mixed up. Currently, we let the detection of semantic errors to a epidemiological quality analysis that is performed by several statisticians. Conversely, syntax errors can be easily technically detected; they are logged when they occur in the process of transferring data from data input systems to the central research database. With respect to both types of errors, syntax and semantic errors, we designed and implemented a software application called Curation-DB allowing to curate (adapt and change) data. In particular, the system lists the logged syntax errors occurring during the data conversion step daily at night. A user can adapt the current input value by specifying a new (target) value for a listed syntax problem. With this specification, the corresponding input value is replaced by the specified value before the next conversion step is started. This specification process can be iteratively applied for a corresponding input value when the syntax problem is not solved by the current specification. The semantic errors need to be first detected separately. Then, a user can specify value changes replacing an existing value with the new specified one. Results & Discussion The Curation-DB application is already in use. Currently, selected quality managers routinely check the listed syntax errors. There are currently more than 2000 of such errors curated. In near future, we will extend this software to manage rules validating research data semantically to automatically detect obvious semantic errors.

Curation-DB: Curating Scientific Data in LIFE

Related items

Navigate

Health Atlas - Local Data Hub/Leipzig

Registered Repository