Case study

Data Cleaning in the Evaluation of a MultiSite Intervention Project

Authors: {'first_name': 'Gavin', 'last_name': 'Welch'},{'first_name': 'Friedrich', 'last_name': 'von Recklinghausen'},{'first_name': 'Andreas', 'last_name': 'Taenzer'},{'first_name': 'Lucy', 'last_name': 'Savitz'},{'first_name': 'Lisa', 'last_name': 'Weiss'}


Context: The High Value Healthcare Collaborative (HVHC) sepsis project was a two-year multi-site project where Member health care delivery systems worked on improving sepsis care using a dissemination & implementation framework designed by HVHC. As part of the project evaluation, participating Members provided 5 data submissions over the project period. Members created data files using a uniform specification, but the data sources and methods used to create the data sets differed. Extensive data cleaning was necessary to get a data set usable for the evaluation analysis. 

Case Description: HVHC was the coordinating center for the project and received and cleaned all data submissions. Submissions received 3 sequentially more detailed levels of checking by HVHC. The most detailed level evaluated validity by comparing values within-Member over time and between Member. For a subset of episodes Member-submitted data were compared to matched Medicare claims data.

Findings: Inconsistencies in data submissions, particularly for length-of-stay variables were common in early submissions and decreased with subsequent submissions. Multiple resubmissions were sometimes required to get clean data. Data checking also uncovered a systematic difference in the way Medicare and some members defined intensive care unit stay.

Conclusions: Data checking is a critical for ensuring valid analytic results for projects using electronic health record data. It is important to budget sufficient resources for data checking. Interim data submissions and checks help find anomalies early. Data resubmissions should be checked as fixes can introduce new errors. Communicating with those responsible for creating the data set provides critical information.

Keywords: data qualityroutinely collected health dataelectronic health recorddata validitydata errordata completeness 

Galley file missing.

Please contact support [at]