STEM_QA_Hindi

Published: 7 January 2026| Version 1 | DOI: 10.17632/53hyvjzj3b.1
Contributor:
InfoBay AI

Description

The full corpus is curated across multiple STEM disciplines and structured for use in LLM training, evaluation, and instruction tuning (SFT/RLHF). This sample represents the structure and quality of the larger dataset. Dataset composition (full corpus): Text corpus: 1.6B+ words of curated STEM and Non-STEM educational content across 22000+ texbooks in 7 languages(English, Hindi, Arabic, Bahasa, Tamil, Telegu, Kannada) Question–Answer pairs: 6.5M+ high-quality Q&A pairs of STEM and Non-STEM in (English, Arabic, Hindi and Indic languages) Video data: 100K+ hours of STEM Videos and 30K+ hours of UGC. Audio data: 821K+ hours of Podcasts and Call Center data(Dual Channel) Medical datasets: 30M+ files including clinical and diagnostic data like CT Scan, MRI, X-ray, Pathology, EHRs, USG Reports and Echo Reports. This repository includes: A small preview subset of the STEM English Q&A data Flat, viewer-friendly schema for inspection Parquet files suitable for benchmarking and evaluation Purpose of this dataset: Dataset preview and validation Model evaluation and experimentation Schema and format inspection before full-scale access ⚠️ Note: This repository contains sample data only. Access to the complete dataset is available separately under appropriate licensing or partnership terms. Note: This is not the full dataset. For full details, Please contact [Em: vipul.mishra@infobay.ai]

Files

Steps to reproduce

1.Manual Data Creation All question–answer pairs were manually crafted by domain experts to ensure accuracy, clarity, and relevance. No automated scraping or third-party datasets were used. 2.Topic Selection Topics were selected based on commonly referenced concepts in [insert domain: e.g., medical education, AI fundamentals, etc.], ensuring comprehensive coverage. 3.Formatting & Structure Data is structured in JSON/CSV format however data sample is uploaded in Parquet format with each entry containing: 1. Question 2. Answer 3. Category/topic (if applicable) 4.Quality Assurance Each Q&A pair was reviewed for duplication, bias, and factual correctness before final inclusion. 5. Tooling Basic editing and formatting were done using spreadsheet tools and Python (Pandas) for final dataset export.

Categories

Chemistry, Mathematics, Physics, Hindi Language, Meta Dataset

Licence