STEM_TextBook_Arabic
Description
The full corpus is curated across multiple STEM/Non-STEM disciplines and structured for use in LLM training, evaluation, and instruction tuning (SFT/RLHF). This sample represents the structure and quality of the larger dataset. Dataset composition (full corpus): Text corpus: 1.6B+ words of curated STEM and Non-STEM educational content across 22000+ texbooks in 7 languages(English, Hindi, Arabic, Bahasa, Tamil, Telegu, Kannada) Question–Answer pairs: 6.5M+ high-quality Q&A pairs of STEM and Non-STEM in (English, Arabic, Hindi and Indic languages) Video data: 100K+ hours of STEM Videos and 30K+ hours of UGC. Audio data: 821K+ hours of Podcasts and Call Center data(Dual Channel) Medical datasets: 30M+ files including clinical and diagnostic data like CT Scan, MRI, X-ray, Pathology, EHRs, USG Reports and Echo Reports. This repository includes: A small preview subset of the STEM Arabic TextBook data Flat, viewer-friendly schema for inspection Parquet files suitable for benchmarking and evaluation Purpose of this dataset: Dataset preview and validation Model evaluation and experimentation Schema and format inspection before full-scale access warning: Note: This repository contains sample data only. Access to the complete dataset is available separately under appropriate licensing or partnership terms. Note: This is not the full dataset. For full details, Please contact [Em: vipul.mishra@infobay.ai]
Files
Steps to reproduce
1. Source Material Collection All textual content was sourced directly from textbooks published and owned by InfoBay AI (formerly EduGorilla). As the organization holds full copyright and distribution rights over these materials, no external copyrighted content, web-scraped data, or third-party datasets were used in the creation of this dataset. 2. Content Selection Chapters and sections were selected based on their relevance to the dataset’s educational objectives. Content was chosen to provide thorough coverage of essential concepts, explanations, examples, and practice material included within InfoBay AI’s textbooks. 3. Data Extraction & Structuring Selected textbook material was manually extracted and organized into a structured, machine-readable format. Each entry includes: Chapter/Section Title Extracted Text Subtopic / Category / Concept Tag (when applicable) Source Metadata (e.g., textbook title, edition, and internal reference) The dataset was formatted as JSON/CSV to ensure compatibility with downstream machine learning, NLP, and analytics workflows. 4. Quality Assurance All extracted content was reviewed internally by the InfoBay AI team to ensure: accuracy and fidelity to the original textbook content removal of layout or formatting inconsistencies correction of typographical errors consistent structure and labeling across all entries Because the textbooks are owned and published by InfoBay AI (formerly EduGorilla), all included text is fully compliant with copyright and distribution guidelines. 5. Tooling Initial extraction and formatting were performed using standard document processing tools. Final cleaning, structuring, and dataset export were conducted using Python (Pandas) to maintain reproducibility and standardized formatting.