How incremental ETL can help address data quality issues in clinical research and beyond

My team at the First Line Healthcare Practice has conducted numerous clinical data extraction and transformation projects (a process known as ETL) from EHRs and other systems over the years. We do this work in the context of clinical research, operational analytics, and other purposes. It has always been a revelation for our clients – healthcare institutions and research organizations – how many quality issues their clinical data has once extracted from the source EHR systems.

Of course, data quality should be viewed in the context of specific usage scenarios. What may be sufficient for one system to function could be entirely inadequate for other systems and workflows.

Common Data Models (CDMs), such as OHDSI OMOP, which is used in clinical research and clinical trials, have strict rules for the quality and demand the data to be well structured. Since the data extracted to OMOP (and other CDM) repositories must comply with these strict rules, the data in these repositories is often of better quality than the source.

Electronic clinical quality measures (CQMs) defined by CMS intend to measure quality by tracking various evidence-based elements of structure, process, and outcomes using clinical data recorded in Electronic Health Record systems (EHRs). Each year, the Centers for Medicare and Medicaid Services (CMS) update the electronic specifications of the CQMs. Many of these version changes are related to terminology design and evolution, while others may be related to actual evidence changes. However, the CQMs primarily focus on the quality of care, not the quality of data – these are related but not the same topics.

There are numerous reasons why clinical data may exhibit poor quality. In this article, you will learn about the five most important arguments based on more than 20 years of experience performing systems integration and building data pipelines. While we are focusing on the data extracted from EHR systems, the stated quality issues are often present in other systems.

Reason 1 – Organic growth of the EHR data

Organic data growth means that the information is accumulated impromptu over time without overall data governance.

Data in EHR systems is updated through a variety of means. Users update it manually; the EHR systems update themselves due to specific clinical events. External interfaces and connected devices also contribute to the system. Without continuous monitoring for consistency, uniformity, reconciliation, normalization, and harmonization, data quality progressively deteriorates.

Here are a few examples of the quality issues that reflect the organic growth of the data:

Lack of consistency in identifiers and terminology concepts.
Uncoded, unstructured textual values alongside coded values for the same types of data elements.
Semantic inconsistencies, such as in the lab results. For example, one lab may return “incomplete” results and another – “partial” results, which both have the same meaning but are expressed differently.

Reason 2 – Systems evolve over time

Patient data is often collected over many years, maybe even a lifetime. Over time, the functionality of the systems evolves, and so do the data structure, the vocabulary used, and the mechanisms to collect the data. It often results in inconsistencies of the same data generated at different points of care and time.

Changing clinical guidelines and best practices, updates in Coded Medical Terminologies, and new functionality and workflow – all contribute to inconsistencies and degradation of the quality of the clinical data.

What is significantly important is that such longitudinal inconsistencies may happen within the same patient record. Uncoded allergy reactions of the past may be coded today. The system may collect different and additional metadata and properties as the functionality evolves. Such data variability – especially within the same patient record – makes it difficult to define reliable mapping rules for data extraction to external data sets.

Reason 3 – Variability in EHR implementation and configuration within the same institution

Most larger EHR systems contain modules and components serving different departments and units in a Healthcare organization. Various modules are used for pediatric care in ICUs, general medicine, surgical units, and others.

EHR implementation and deployment is usually an effort of many analysts, consultants, and IT specialists working for different organizational units independently. These teams build and configure the EHR system from the beginning of the implementation and through time, so they don’t always coordinate and align their changes.

It results in duplication and contradictions. Such variability is especially significant when multiple heterogeneous systems are used within the same organization. The data extracted and aggregated from various systems nearly always exhibit substantial quality issues.

Reason 4 – EHR data is inherently not designed for interoperability

How their data looks outside of their environments is not the primary point of concern for EHR vendors. Their main goal is to ensure their system operates correctly and efficiently. They (often reluctantly) expose their data to the outside world via standard interfaces – like FHIR and HL7 or the reporting infrastructures. The data extracted from the same EHR at two different institutions using the same standard interfaces exhibits significant variability.

At the recent AMIA yearly Symposium, we heard several similar presentations focused on cross-institutional decision support using the CDS Hooks standard. All teams have indicated difficulties extracting data from the same EHR system at different hospitals using the same FHIR API endpoints. These identical endpoints returned somewhat different results at each participating institution.

Reason 5 – Usage of Interfaces leads to data transformation and quality loss

The source data is interpreted and transformed whenever we invoke an outbound interface, regardless of the protocol used – HL7, FHIR, CDA, others. For example, an FHIR resource (demographics, medications, observations) has a distinct structure for its payload. The source data in EHR must be transformed to comply with this structure, which inevitably leads to potential data, metadata loss, and other quality issues.

Moreover, this data transformation and interpretation during ETL processes happens twice – firstly, when data is exposed through the interfaces and secondly, when it is mapped to the destination data format and saved in the destination repository.

Is there a solution?

In the clinical research world, data latency, in most cases, is not a critical factor. The primary purpose of CDM repositories is to facilitate eligibility queries – finding patients eligible for clinical studies or trials. The data ETL is performed infrequently, often quarterly. If the quality issues are observed, they are addressed on the entire data set, and the mapping rules are updated accordingly.

There are several downsides to this approach. First, every time an ETL is performed, the quality analysis remediation must be conducted all over again and for every patient in the destination repository. This is an extensive and lengthy effort. Second, when the data has been committed to the destination repository, meaningful content and metadata may be irreversibly lost and distorted – no quality remediation effort can restore it. And finally, the several-month-old data is hardly useful for anything other than eligibility queries.

Of course, in an ideal scenario, the quality issues should be discovered and addressed in the source data. This is rarely an available option – tinkering with the EHR operational data is impossible and may lead to functional breakdowns of EHR workflows.

One possible compromise is implementing a partial (near) real-time ETL from the source system and performing localized, incremental quality analysis and remediation during data transformation. With this approach, mapping rules are adjusted and refined on individual patient records when issues are identified. The mapping rules may even differ longitudinally across the timeline of individual patients or the population of patients. Partial ETL can be efficiently executed with incremental refinements without the cost and complexity of refactoring the entire data set.

This seems like an obvious and common sense approach, yet – to the best of our knowledge – it is rarely done with popular CDMs. One of the reasons for this is that the mapping tools and infrastructures used by the research communities are not sophisticated enough to allow for rich data pipeline workflows and throughput that are necessary for real-time and partial ETLs. These ETLs are often implemented as SQL scripts that map one relational DB to another with very few capabilities.

A modern commercial integration engine is capable of many useful features that can help with incremental and partial ETL. These features include visual data mapping, parallel processing, scheduling, quarantining failed records and patients, monitoring, alerting, and many other features. We have been using HealthConnect from Intersystems to build a flexible data pipeline and have implemented a comprehensive quality framework with excellent results. Further integration and workflow orchestration engines provide similar capabilities.

There are many other compelling benefits of the partial near-real time ETLs to CDMs. At the core, they bring high-quality, clean data repositories in a standard format with minimal latency. If multiple institutions implement such ETLs to a single or federated data store in the exact CDM representation, the cross-institutional decision support can be performed without integrating each EHR system individually. Organizations can perform operational analytics, train machine learning algorithms, conduct drug surveillance, and several other valuable functions – all without the need to integrate with EHR systems individually for each of these functions.

Anatoly Postilnik
VP, Global Healthcare Consulting
Boston, MA
Anatoly has more than 30 years of technology, product development, and solutions delivery experience, including over 20 years in the Healthcare Industry. Anatoly resides in Boston, MA. He is an avid hiker and has reached numerous mountain tops in Europe, Eastern and Western United States, and Asia.