STEM_QA_Hindi
Description
The full corpus is curated across multiple STEM disciplines and structured for use in LLM training, evaluation, and instruction tuning (SFT/RLHF). This sample represents the structure and quality of the larger dataset. Dataset composition (full corpus): Text corpus: 1.6B+ words of curated STEM and Non-STEM educational content across 22000+ texbooks in 7 languages(English, Hindi, Arabic, Bahasa, Tamil, Telegu, Kannada) Question–Answer pairs: 6.5M+ high-quality Q&A pairs of STEM and Non-STEM in (English, Arabic, Hindi and Indic languages) Video data: 100K+ hours of STEM Videos and 30K+ hours of UGC. Audio data: 821K+ hours of Podcasts and Call Center data(Dual Channel) Medical datasets: 30M+ files including clinical and diagnostic data like CT Scan, MRI, X-ray, Pathology, EHRs, USG Reports and Echo Reports. This repository includes: A small preview subset of the STEM English Q&A data Flat, viewer-friendly schema for inspection Parquet files suitable for benchmarking and evaluation Purpose of this dataset: Dataset preview and validation Model evaluation and experimentation Schema and format inspection before full-scale access ⚠️ Note: This repository contains sample data only. Access to the complete dataset is available separately under appropriate licensing or partnership terms. Note: This is not the full dataset. For full details, Please contact [Em: vipul.mishra@infobay.ai]
Files
Steps to reproduce
1.Manual Data Creation All question–answer pairs were manually crafted by domain experts to ensure accuracy, clarity, and relevance. No automated scraping or third-party datasets were used. 2.Topic Selection Topics were selected based on commonly referenced concepts in [insert domain: e.g., medical education, AI fundamentals, etc.], ensuring comprehensive coverage. 3.Formatting & Structure Data is structured in JSON/CSV format however data sample is uploaded in Parquet format with each entry containing: 1. Question 2. Answer 3. Category/topic (if applicable) 4.Quality Assurance Each Q&A pair was reviewed for duplication, bias, and factual correctness before final inclusion. 5. Tooling Basic editing and formatting were done using spreadsheet tools and Python (Pandas) for final dataset export.