De-identified EHR for Obstetric and Maternal Care Dataset
Description
The dataset contains anonymized clinical information standardized using the OMOP Common Data Model version 5.4, corresponding to the clinical profiles of 23,879 women aged 18 to 47 years who received care in the Gynecology and Obstetrics Department of the Clínica Universitaria Bolivariana (CUB) between 2015 and 2017. Each patient has at least one visit related to pregnancy, childbirth, and/or the postpartum period, documented through both structured data and unstructured clinical notes. In total, the database comprises: 200,070 clinical observations, 143,385 documented clinical conditions, 2,494,424 clinical measurements, 3,776,555 clinical notes in natural language, all of which collectively form a comprehensive and cohesive set of obstetric clinical events linked through primary and foreign keys. The dataset consists of seven files: six CSV tables and one .ipynb notebook containing the exploratory data analysis (EDA). Each CSV file adheres to the OMOP v5.4 Common Data Model, ensuring interoperability and enabling comparative analyses with other standardized healthcare databases. Formal descriptions of tables and attributes are available in the official OHDSI documentation, including the Concept table, which references international medical terminologies such as SNOMED CT, LOINC, and RxNorm. The files included are: 01_Person: contains the person_id and demographic information. 02_Visit: documents each medical visit (89,893 records). 03_Observation: includes coded clinical observations (200,070 records). 04_Condition: records diagnoses and reasons for consultation (143,385 records). 05_Measurement: stores quantitative and qualitative clinical measurements (2,494,424 records). 06_Note: contains clinical notes in natural language (3,776,555 records). 07_Notebook: an .ipynb file with the exploratory data analysis. A hierarchical relational structure underlies the dataset: each 01_Person is associated with at least one 02_Visit, which in turn links to 03_Observation, 04_Condition, 05_Measurement, and 06_Note. The person_id field functions as the central key enabling reconstruction of full clinical cases across all tables. Because the dataset is fully standardized to OMOP v5.4 — including structured data and unstructured clinical notes — it represents a robust source for real-world evidence generation, supporting advances in research, clinical surveillance, maternal health analytics, and outcomes-driven medicine.
Files
Steps to reproduce
This dataset contains anonymized clinical information standardized under the OMOP Common Data Model v5.4 and derived from electronic health records of 23,879 women treated in the Gynecology and Obstetrics service between 2015 and 2017. The dataset integrates both structured data (demographics, visits, observations, diagnoses, measurements) and unstructured clinical narratives, enabling reproducible research in maternal health, clinical text mining, and real-world evidence studies. The raw data were extracted from the Servinte Clinical Suite EHR system using period-, service- and event-based SQL filters. All structured fields underwent removal of personally identifiable information (PII), while clinical notes were anonymized using a fine-tuned language model (“bsc-bio-ehr-es”) to automatically mask sensitive identifiers. The resulting data were transformed into OMOP v5.4 using the OHDSI tools WhiteRabbit and Usagi, with standardized mappings to SNOMED CT, LOINC, and RxNorm. A secondary privacy review identified residual date-of-birth information, which was replaced by age-at-visit computed programmatically prior to distribution. Additional filtering removed patients under 18 years. The final OMOP tables were exported as UTF-8 CSV files. The dataset includes both structured data (demographics, visits, observations, diagnoses, measurements) and unstructured data (clinical notes). Table-level characteristics are summarized as follows: - Person: 23.879 records - Visit_Occurrence: 89.893 records - Observation: 200.070 records - Condition_Occurrence: 143.385 records - Measurement: 2.494.424 records - Note: 3.776. 555 records Together, these tables preserve the semantic and relational structure of OMOP, facilitating interoperability and supporting downstream applications such as epidemiological modeling, clinical NLP, and maternal-fetal risk analysis.
Institutions
- Instituto Tecnologico Metropolitano
- Universidad Pontificia Bolivariana