Dhivehi News Categories Dataset

Published: 10 December 2024| Version 1 | DOI: 10.17632/m397m9n99v.1
Contributor:
RASHMA MOHAMED

Description

The Dhivehi News Categories Dataset addresses the lack of publicly available resources for Dhivehi, a low-resource language spoken in the Maldives, enabling machine learning (ML) algorithms like k-Nearest Neighbors, Decision Trees, XGBoost, SVM, Naïve Bayes, Random Forest, and Artificial Neural Networks to process Dhivehi text for tasks such as text classification and language modeling. Comprising 6,000 curated news articles from reputable sources (e.g., Sunmv, Haveeru, Raajje.mv), the dataset is balanced across four categories: Business, Sports, Entertainment, and World News, with 1,500 articles each. Articles were collected using Python-based web scraping tools, cleaned to remove duplicates and irrelevant content, and manually categorized for high-quality structured data. It supports NLP tasks like text classification, sentiment analysis, and topic modeling, offering balanced representation, thematic clarity (e.g., by optionally excluding "World News"), and fostering low-resource language research. Stored in UTF-8 for compatibility, it contributes to linguistic, cultural, and media studies while advancing AI and multilingual NLP applications. This pioneering Dhivehi resource enables comparative cross-linguistic studies, innovation in computational linguistics, and linguistic inclusivity, ensuring underrepresented languages like Dhivehi are included in global AI advancements.

Files

Steps to reproduce

To reproduce the Dhivehi News Categories Dataset, the process begins with data collection using Python-based web scraping tools like BeautifulSoup and Selenium to gather articles from reputable Dhivehi news portals such as SunMV, Haveeru, RaajjeMV, and Vaguthu. Key elements like article headlines, body text, publication dates, and metadata are extracted. During data cleaning, duplicate articles and irrelevant content such as advertisements are removed, and formatting is standardized to ensure consistency, with all text converted to UTF-8 encoding. Articles are then categorized manually into four groups—Business, Sports, Entertainment, or World News—while overlaps, such as World News articles aligning with other categories, are reviewed and adjusted. For dataset refinement, the option to exclude the World News category is available to minimize thematic overlaps and improve classification boundaries, ensuring a balanced representation of 1,500 articles per category. Each article is saved as an individual UTF-8 encoded text file and organized into folders named after their respective categories for easy navigation. Verification and validation steps are carried out to review the dataset for accuracy and completeness, ensuring it is well-structured and ready for machine learning workflows. Finally, the dataset can be loaded into machine learning models for tasks like text classification, sentiment analysis, or topic modeling, with researchers able to experiment with various configurations, such as including or excluding specific categories, to evaluate model performance.

Categories

Linguistics, Natural Language Processing, Classification System, Text Processing, Digital Media

Licence