Datasets Comparison

Version 1

Non-STEM_QA_English

Published:7 January 2026|Version 1|DOI:10.17632/rfd4g9p8sc.1

Contributor:InfoBay AI

Description

The full corpus is curated across multiple STEM/Non-STEM disciplines and structured for use in LLM training, evaluation, and instruction tuning (SFT/RLHF). This sample represents the structure and quality of the larger dataset. Dataset composition (full corpus): Text corpus: 1.6B+ words of curated STEM and Non-STEM educational content across 22000+ textbooks in 7 languages(English, Hindi, Arabic, Bahasa, Tamil, Telegu, Kannada) Question–Answer pairs: 6.5M+ high-quality Q&A pairs of STEM and Non-STEM in (English, Arabic, Hindi and Indic languages) Video data: 100K+ hours of STEM Videos and 30K+ hours of UGC. Audio data: 821K+ hours of Podcasts and Call Center data(Dual Channel) Medical datasets: 30M+ files including clinical and diagnostic data like CT Scan, MRI, X-ray, Pathology, EHRs, USG Reports and Echo Reports. This repository includes: A small preview subset of the STEM English Q&A data Flat, viewer-friendly schema for inspection Parquet files suitable for benchmarking and evaluation Purpose of this dataset: Dataset preview and validation Model evaluation and experimentation Schema and format inspection before full-scale access ⚠️ Note: This repository contains sample data only. Access to the complete dataset is available separately under appropriate licensing or partnership terms. Note: This is not the full dataset. For further details contact vipul.mishra@infobay.ai

Steps to reproduce

1. ~~Manual Data Creation~~ All ~~question–answer pairs were manually crafted~~ by ~~domain experts to ensure accuracy~~, ~~clarity~~, ~~and relevance. No automated scraping~~ or third-party datasets were used. 2. ~~Topic~~ Selection ~~Topics~~ were selected based on ~~commonly referenced~~ concepts ~~in [insert domain: e.g.~~, ~~medical education~~, AI ~~fundamentals, etc~~.~~], ensuring comprehensive coverage.~~ 3. ~~Formatting~~ & ~~Structure Data~~ was structured ~~in JSON/CSV~~ format ~~with each~~ entry ~~containing~~: ~~1. Question 2. answer 3. category~~/~~topic~~ (if applicable) 4. Quality Assurance ~~Each Q&A pair~~ was reviewed ~~for duplication~~, ~~bias,~~ and ~~factual correctness before final inclusion~~. 5. Tooling ~~Basic editing~~ and formatting were ~~done~~ using ~~spreadsheet~~ tools and Python (Pandas) ~~for final dataset export~~.

Licence

Creative Commons Attribution 4.0 International

Version 2

Non-STEM_QA_English

Published:14 January 2026|Version 2|DOI:10.17632/rfd4g9p8sc.2

Contributor:InfoBay AI

Description

Steps to reproduce

1. Source Material Collection All textual content was sourced directly from textbooks published and owned by InfoBay AI (formerly EduGorilla). As the organization holds full copyright and distribution rights over these materials, no external copyrighted content, web-scraped data, or third-party datasets were used in the creation of this dataset. 2. Content Selection Chapters and sections were selected based on their relevance to the dataset’s educational objectives. Content was chosen to provide thorough coverage of essential concepts, explanations, examples, and practice material included within InfoBay AI’s textbooks. 3. Data Extraction & Structuring Selected textbook material was manually extracted and organized into a structured, machine-readable format. Each entry includes: Chapter/Section Title Extracted Text Subtopic / Category / Concept Tag (when applicable) Source Metadata (e.g., textbook title, edition, and internal reference) The dataset was formatted as JSON/CSV to ensure compatibility with downstream machine learning, NLP, and analytics workflows. 4. Quality Assurance All extracted content was reviewed internally by the InfoBay AI team to ensure: accuracy and fidelity to the original textbook content removal of layout or formatting inconsistencies correction of typographical errors consistent structure and labeling across all entries Because the textbooks are owned and published by InfoBay AI (formerly EduGorilla), all included text is fully compliant with copyright and distribution guidelines. 5. Tooling Initial extraction and formatting were performed using standard document processing tools. Final cleaning, structuring, and dataset export were conducted using Python (Pandas) to maintain reproducibility and standardized formatting.

Licence

Creative Commons Attribution 4.0 International

Datasets Comparison

Version 1

Non-STEM_QA_English

Description

Steps to reproduce

Categories

Licence

Version 2

Non-STEM_QA_English

Description

Steps to reproduce

Categories

Licence