How Incremental Data Extraction into Common Data Models Can Translate EHR into Real-World Evidence

HeaIth records become research evidence

One of the healthcare challenges is a lack of interoperability and standardization regarding clinical data. There are many different ways that hospitals and other healthcare organizations can store their patient information, which makes it difficult for researchers and clinicians to access this information for research purposes. 

Therefore, organizations participating in the research networks commit to using specific CDMs (Common Data Models). OHDSI (Observational Health Data Sciences and Informatics) led the industry to evolve and adopt an OMOP Common Data Model.

CDMs have emerged and are used for clinical research. The primary purpose of data extraction into CDM is to conduct eligibility queries to find patients eligible for clinical trials or studies. This is called Real World Evidence Research. 

The primary purpose of Common Data Models is to expand the universe of patients that may be eligible for studies across multiple institutions. Research partners can include academic medical centers, private hospitals, and other facilities with access to large numbers of patient data. All of this helps generate reliable evidence and make it accessible. CDMs can also be helpful for drug discovery, research collaboration, and other areas requiring patient information sharing between institutions.

Two priorities for Data Quality

People can use data differently. Even if EHR records work for the staff and clinicians, the data quality issue arises. Once data leaves the organization’s doors, it has to make sense for the outside world. For example, EHRs (Electronic Health Records) can be very different in two different hospitals, but both would have to follow the same CDM standards to make their data available to the healthcare community. There are two main priorities: 

Creating a common structure

Health records data can take a form of a table. So, for the system to work, it must be based on the same tables and structures. 

Creating a common vocabulary

Make sure a created anthology makes sense so the community can use the same identifiers. For example:

  • Login Codes for names and ID codes of laboratory results;
  • International Classification of Diseases, Tenth Revision, Clinical Modification (The ICD-10-CM ) to classify and code all diagnoses, symptoms, and procedures in the United States. 

Keeping data fresh in CDM through partial updates

The quality of data in research repositories is likely better than the source data. The goal is to transform data in EHR into reliable data.

Traditionally, CDM repositories are updated just once a quarter. During such update, the entire CDM is refreshed, upgraded, and reconciled, including patient records, vocabulary, and the version of the CDM. It infrequently happens because, for this type of research, you do not necessarily need real-life data. 

However, in terms of data quality, if only a tiny part of the system needs an update (say, 20% of patients require an update on their secondary diagnosis), it would be helpful to address it without refreshing the system. Over time, these minor updates improve the overall quality of data, which eventually can lead to CDM being able to be updated right away.

This is a crucial point. Every time CDM needs an update, there is a choice between two options:

  • You can update the entire database at once;
  • Or you can extract data in increments;

Both approaches have time and place, but this article focuses on the incremental approach.

The ETL (extract, transform, and load) data processes are complex efforts by experienced organizations leading to more and better-structured content, the presence of standard concept identifiers, vocabularies, and valuable metadata. This is accomplished through a series of steps:

  • Extracting – extracting all relevant raw source data that must be transformed into something more useful (e.g., removing all non-essential information, such as dates or lines)
  • Transforming – transforming this raw source into a clean format (e.g., changing how dates appear)
  • Loading – loading this clean format into a system where users can access it.

An ETL process takes data from many sources, cleans it, and loads it into your enterprise system. This is a complex effort that requires significant expertise on the part of the organization because there are many different types of information involved (e.g., clinical data from hospitals and research articles from journals). 

The benefits of using partial increments of data:

  1. Improving data quality in small batches;
  2. ETL process and reconciliation are faster and more efficient; 
  3. You can perform operational analytics and decision support on the extracted data, sorted and clean, rather than on the source EHR data;
  4. Ability to perform cross-organizational decision support with ease.

The challenges of using partial increments of data:

  1. Change triggers. Partial data updates require change triggers that are not easy to implement. In general, finding out that something is changed is not a simple task.
  2. The CDM version and vocabularies still need to be synchronized periodically in addition to aligning the patient data. It will require much effort from CTOs and their teams, but it’s an achievable goal if they can work together as one team with one vision.
  3. Sometimes, the performance of partial updates with many patients may be slower and worse than a complete refresh.
  4. It means you have to sort through the mess. Incremental Data Extraction requires an infrastructure supporting uninterrupted data ingestion and processing. However, at the heart of incremental data extraction is a robust infrastructure and the data pipeline that includes managing failed records, monitoring, and alerting.


Incremental data is a valuable tool that can save time and resources and lead to more accurate research results and, eventually, better care. Implementing the best practices in healthcare tech is essential, so the people in clinical research can focus on what matters the most: integrity and innovation. Of course, as with any solution, there is a time and place for it, but it is a practice that can improve by getting more reliable evidence and eventually influencing people’s lives.

Talk To Our Team Today

Talk to Our Team Today

Related Blogs

Interested in talking?

Whether you have a problem that needs solving or a great idea you’d like to explore, our team is always on hand to help you.