RESEARCH REPORT
Incremental Lifecycle Management for OHDSI Standardized Vocabularies
Lena Aljehane, MS4
Alexander Davydov, MD1
Margaret Dobbins4
Thomas Kim4
Gregory Klebanov, MS 1
Vlad Korsik, MD1
Sarah Manglicmot, MSN, RN4
Yaroslav Molodkov, MS1
Anna Ostropolets, MD, PhD2
John Philip, MS4
Christian Reich, MD, PhD3
Tatyana Sandler, MS, RN4
- Odysseus Data Services, Inc., an EPAM company, Cambridge, MA;
- Columbia University, New York, NY;
- Northeastern University, Boston, MA;
- Memorial Sloan Kettering Cancer Center, New York, NY
Introduction
The Observational Health Data Sciences and Informatics (OHDSI) Standardized Vocabularies constitute a curated repository of medical vocabularies, coding systems and ontologies utilized across diverse countries, regions, healthcare systems and institutions, and harmonized to a global standard for observational research . Vocabularies in this repository are consolidated into a common schema, while meanings, concept names and original relationships from each vocabulary are preserved. Vocabularies are also integrated, i.e., one standard concept per semantic meaning is chosen, concepts of equivalent meaning in different vocabularies are consolidated to a standard concept, and others are considered source concepts and mapped to the standard using relationships. For some purposes — and if no suitable public standard is available — vocabularies and relationships are authored de novo by the OHDSI Vocabulary Team. This entire system serves as a cornerstone of the OHDSI initiative and is mandatory to be used across all data sources in the network, totaling more than 500 databases. It provides reference data for all information encoded in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) and enables a global research network with federated analytics.
Problem
Vocabularies are dynamic systems that undergo constant evolution, with changes including a combination of additions, deprecations and modifications introduced by both the original authors and the Vocabulary Team. These often have ripple effects, because the change of a concept in one vocabulary might affect or disrupt the mappings or other types of relationships to concepts in other vocabularies. As the size of the OHDSI Standardized Vocabularies has grown significantly over time, each release contains more of these changes. This is even though a large portion of content stays constant between releases.
Data managers and researchers are the main users of the OHDSI Vocabularies: they must apply the Vocabularies to the data transformation to the OMOP CDM schema and query the data for purposes of cohort definition, covariate construction, large-scale analytics and result reporting. A new release of the Vocabularies might disrupt these use cases, forcing them to manually assess all granular details of the changes and their impact.
Current Approach to Vocabularies Distribution
The traditional approach applies the OHDSI Vocabularies with each release as a batch. Once the new version of the Vocabularies is released, it is pushed to the distribution service for users to download. Users then select the vocabularies they need and download the entire corpus, which then should be loaded into the vocabulary tables in their environment. In this way, only the current version of the Vocabularies is available for distribution, and it overwrites the previously released version. Therefore, the users cannot: a) browse or download different versions of Vocabularies or b) download delta between versions.
A New Solution from Memorial Sloan Kettering Cancer Center (MSKCC) and Odyssesus, an EPAM Company
A new solution, designed and implemented by Odysseus, in collaboration with and sponsored by MSKCC, allows users to generate a delta between versions for all vocabulary tables, including CONCEPT, CONCEPT_RELATIONSHIP, CONCEPT_ANCESTOR, CONCEPT_SYNONYM, DOMAIN and VOCABULARY. This approach significantly reduces the infrastructure and resource requirements for updates. Figure 1 demonstrates a comparison of the new solution and the traditional approach.
Upon downloading these delta files, the users can do the following:
- Review the updates between releases to assess impact. These updates encompass groups of changes, including additions, deletions and modifications to existing content, all documented in unified CSV files.
- Create the new version by applying a SQL script that directly updates the respective tables of the existing database.
In addition, the new system introduces multiple version support for traditional batches, so users can start their journey from any available version and generate deltas between the two. Both the CSV and the SQL files are created only for the vocabularies the user has subscribed to in the Athena application, which is powered by EPAM.
Conclusion
Incremental lifecycle management for OHDSI Standardized Vocabularies offers a more efficient and scalable approach to vocabulary updates, addressing the challenges posed by increasing resource requirements and use case dependencies. By adopting an incremental methodology, stakeholders can assess and track the consistency, integrity and reliability of data representations.
References
Reich, C., Ostropolets, A., Ryan, P., Rijnbeek, P., Schuemie, M., Davydov, A., . . . Hripcsak, G. (2024, Feb 16). OHDSI Standardized Vocabularies-a large-scale centralized reference ontology for international data harmonization. J Am Med Inform Assoc, 31(3), 583-590. doi:10.1093/jamia/ocad247