Bangla-REX: A Distinct Dataset for Relation Extraction
Description
The dataset is grounded in theoretical and methodological frameworks that emphasize the importance of structured knowledge bases and annotated corpora for effective relation extraction. To generate this dataset, we compiled a comprehensive Bangla Knowledge Base (KB) consisting of 63,256 entries, which serves as a foundation for automating the labeling process with relation tags. The corpus itself is extensive, comprising 90,441 text entries that have been meticulously processed to include Named Entity Recognition (NER) and Part-of-Speech (POS) tagging, ensuring that it is ready for immediate use in relation extraction tasks. Additionally, we developed mnemonics for 440 distinct locations in Bangla, specifically tailored to enhance performance in location-based relation extraction. These mnemonics are particularly beneficial in the context of distant supervision-based relation extraction, where they help in establishing clear associations between locations and their corresponding entities or contexts.
Files
Steps to reproduce
We created a knowledge base from Bangla Wikidata and compiled text from Bangla Wikipedia and other online sources to build our corpus. We then performed Named Entity Recognition (NER) and Part-of-Speech (POS) tagging for efficient relation extraction. Subsequently, we manually curated location mnemonics in Bangla to enhance our dataset.