XCommonsense-BN: A Bangla Multiple-Choice Commonsense Reasoning Dataset
Description
The XCommonsense-BN dataset is a curated collection of Bangla multiple-choice questions designed for commonsense reasoning research. It is organized in a tabular format, where each entry includes a unique identifier, the reasoning category, the question in Bangla, four answer options labeled A through D, the correct option, and a short explanation in Bangla justifying the correct answer. The dataset comprises over 1,000 entries spanning five categories: causal, temporal, social, physical, and intentional reasoning, with each category containing at least 150–200 questions to ensure balanced coverage. The dataset is provided in Excel formats encoded in UTF-8 to support the Bangla script. Sample entries illustrate typical questions, answer options, correct labels, and explanations, providing a representative view of the dataset’s structure and content. This dataset enables the development, evaluation, and benchmarking of machine learning models in Bangla commonsense reasoning tasks and contributes to research in low-resource language NLP. Value of the Data: 1. Enables research in Bangla NLP, particularly in commonsense reasoning. 2. Can be used to train, evaluate, and benchmark machine learning and AI models for Bangla question-answering systems. 3. Promotes data-driven AI research in low-resource languages. 4. Supports cross-lingual and multilingual model development by providing high-quality, curated Bangla data.
Files
Steps to reproduce
The XCommonsense-BN dataset was generated through a reproducible workflow using AI-based synthetic data generation and manual validation. Researchers aiming to reproduce or extend this dataset can follow these steps: Define Commonsense Categories Identify the reasoning categories to be covered, including causal, temporal, social, physical, and intentional reasoning. Design Question Templates and Prompts Create structured prompts for LLMs to generate multiple-choice questions in Bangla. Ensure prompts specify four answer options, one correct answer, and a short explanatory note in Bangla. Synthetic Data Generation using LLMs Use advanced LLMs such as ChatGPT, Gemini, or Claude to generate questions directly in Bangla. Execute the prompts iteratively to produce a large set of diverse questions per category. Automatic Annotation Each generated question includes four answer options and a correct answer. Explanatory notes are generated along with each question. Manual Review and Refinement Review all generated entries for clarity, logical consistency, and cultural appropriateness. Adjust phrasing or explanations where necessary to ensure high-quality, understandable Bangla questions. Validation and Quality Checks Confirm that each question aligns with commonsense reasoning principles. Ensure balanced representation across all categories and eliminate duplicates or errors. Finalize Dataset Compile validated questions into a CSV/Excel file (UTF-8 encoded). Include all columns: EntryID, Category, Question (Bangla), Option A–D, Correct Option, Explanation (Bangla). By following these steps, other researchers can reproduce a high-quality Bangla multiple-choice commonsense reasoning dataset, either to extend the dataset, create domain-specific subsets, or benchmark AI models in Bangla NLP tasks.
Institutions
- American International University Bangladesh