Synthetic Indian Clinical Notes for Natural Language Processing Research

Name: Synthetic Indian Clinical Notes for Natural Language Processing Research
Creator: Amandeep Aman
Published: 2026-02-10T08:57:51.563Z
Keywords: Health

Aman, Amandeep

doi:10.17632/bzgjmph5n2.1

Synthetic Indian Clinical Notes for Natural Language Processing Research

Published: 10 February 2026| Version 1 | DOI: 10.17632/bzgjmph5n2.1

Contributor:

Amandeep Aman

Description

This dataset contains 10,000 synthetic clinical notes designed to reflect the unique linguistic and structural characteristics of clinical documentation in India. The notes were generated using a custom Python script that leverages a variety of templates and semantic patterns derived from real-world examples of Indian clinical documentation. The dataset is intended for researchers and developers working on Natural Language Processing (NLP) tasks in the medical domain, particularly those focused on the Indian context. It can be used for training and evaluating language models, developing information extraction systems, and exploring the semantic nuances of Indian clinical text. The dataset includes notes from five different medical specialties: cardiology, neurology, pediatrics, general medicine, and orthopaedics. The notes are available in a medium length, providing a balance of detail and conciseness.

Synthetic Indian Clinical Notes for Natural Language Processing Research

Description

Files

Steps to reproduce

Institutions

Categories

Licence