Datasets Comparison
Version 1
Non-STEM_QA_English
Description
The full corpus is curated across multiple STEM/Non-STEM disciplines and structured for use in LLM training, evaluation, and instruction tuning (SFT/RLHF). This sample represents the structure and quality of the larger dataset.
Dataset composition (full corpus):
Text corpus: 1.6B+ words of curated STEM and Non-STEM educational content across 22000+ textbooks in 7 languages(English, Hindi, Arabic, Bahasa, Tamil, Telegu, Kannada)
Question–Answer pairs: 6.5M+ high-quality Q&A pairs of STEM and Non-STEM in (English, Arabic, Hindi and Indic languages)
Video data: 100K+ hours of STEM Videos and 30K+ hours of UGC.
Audio data: 821K+ hours of Podcasts and Call Center data(Dual Channel)
Medical datasets: 30M+ files including clinical and diagnostic data like CT Scan, MRI, X-ray, Pathology, EHRs, USG Reports and Echo Reports.
This repository includes:
A small preview subset of the STEM English Q&A data
Flat, viewer-friendly schema for inspection
Parquet files suitable for benchmarking and evaluation
Purpose of this dataset:
Dataset preview and validation
Model evaluation and experimentation
Schema and format inspection before full-scale access
⚠️ Note: This repository contains sample data only. Access to the complete dataset is available separately under appropriate licensing or partnership terms. Note: This is not the full dataset.
For further details contact vipul.mishra@infobay.ai
Steps to reproduce
1. Manual Data Creation
All question–answer pairs were manually crafted by domain experts to ensure accuracy, clarity, and relevance. No automated scraping or third-party datasets were used.
2. Topic Selection
Topics were selected based on commonly referenced concepts in [insert domain: e.g., medical education, AI fundamentals, etc.], ensuring comprehensive coverage.
3. Formatting & Structure
Data was structured in JSON/CSV format with each entry containing:
1. Question
2. answer
3. category/topic (if applicable)
4. Quality Assurance
Each Q&A pair was reviewed for duplication, bias, and factual correctness before final inclusion.
5. Tooling
Basic editing and formatting were done using spreadsheet tools and Python (Pandas) for final dataset export.
Categories
Arts and Humanities, English, Reasoning, Meta Dataset
Licence
Creative Commons Attribution 4.0 International
Version 2
Non-STEM_QA_English
Description
The full corpus is curated across multiple STEM/Non-STEM disciplines and structured for use in LLM training, evaluation, and instruction tuning (SFT/RLHF). This sample represents the structure and quality of the larger dataset.
Dataset composition (full corpus):
Text corpus: 1.6B+ words of curated STEM and Non-STEM educational content across 22000+ textbooks in 7 languages(English, Hindi, Arabic, Bahasa, Tamil, Telegu, Kannada)
Question–Answer pairs: 6.5M+ high-quality Q&A pairs of STEM and Non-STEM in (English, Arabic, Hindi and Indic languages)
Video data: 100K+ hours of STEM Videos and 30K+ hours of UGC.
Audio data: 821K+ hours of Podcasts and Call Center data(Dual Channel)
Medical datasets: 30M+ files including clinical and diagnostic data like CT Scan, MRI, X-ray, Pathology, EHRs, USG Reports and Echo Reports.
This repository includes:
A small preview subset of the STEM English Q&A data
Flat, viewer-friendly schema for inspection
Parquet files suitable for benchmarking and evaluation
Purpose of this dataset:
Dataset preview and validation
Model evaluation and experimentation
Schema and format inspection before full-scale access
⚠️ Note: This repository contains sample data only. Access to the complete dataset is available separately under appropriate licensing or partnership terms. Note: This is not the full dataset.
For further details contact vipul.mishra@infobay.ai
Steps to reproduce
1. Source Material Collection
All textual content was sourced directly from textbooks published and owned by InfoBay AI (formerly EduGorilla). As the organization holds full copyright and distribution rights over these materials, no external copyrighted content, web-scraped data, or third-party datasets were used in the creation of this dataset.
2. Content Selection
Chapters and sections were selected based on their relevance to the dataset’s educational objectives. Content was chosen to provide thorough coverage of essential concepts, explanations, examples, and practice material included within InfoBay AI’s textbooks.
3. Data Extraction & Structuring
Selected textbook material was manually extracted and organized into a structured, machine-readable format. Each entry includes:
Chapter/Section Title
Extracted Text
Subtopic / Category / Concept Tag (when applicable)
Source Metadata (e.g., textbook title, edition, and internal reference)
The dataset was formatted as JSON/CSV to ensure compatibility with downstream machine learning, NLP, and analytics workflows.
4. Quality Assurance
All extracted content was reviewed internally by the InfoBay AI team to ensure:
accuracy and fidelity to the original textbook content
removal of layout or formatting inconsistencies
correction of typographical errors
consistent structure and labeling across all entries
Because the textbooks are owned and published by InfoBay AI (formerly EduGorilla), all included text is fully compliant with copyright and distribution guidelines.
5. Tooling
Initial extraction and formatting were performed using standard document processing tools. Final cleaning, structuring, and dataset export were conducted using Python (Pandas) to maintain reproducibility and standardized formatting.
Categories
Arts and Humanities, English, Reasoning, Meta Dataset
Licence
Creative Commons Attribution 4.0 International