Kurdish News Summarization Dataset

Published: 12 June 2023| Version 1 | DOI: 10.17632/jbvrd44m5g.1
Contributor:
Soran Badawi

Description

The Kurdish News Summarization Dataset (KNSD) is a newly constructed and comprehensive dataset specifically curated for the task of news summarization in the Kurdish language. The dataset includes a collection of 130,000 news articles and their corresponding headlines sourced from popular Kurdish news websites such as Ktv, NRT, RojNews, K24, KNN, Kurdsat, and more. The KNSD has been meticulously compiled to encompass a diverse range of topics, covering various domains such as politics, economics, culture, sports, and regional affairs. This ensures that the dataset provides a comprehensive representation of the news landscape in the Kurdish language. Key Features Size and Variety: The dataset comprises a substantial collection of 130,000 news articles, offering a wide range of textual content for training and evaluating news summarization models in the Kurdish language. The articles are sourced from reputable and popular Kurdish news websites, ensuring credibility and authenticity. Article-Headline Pairs: Each news article in the KNSD is associated with its corresponding headline, allowing researchers and developers to explore the task of generating concise and informative summaries for news content specifically in Kurdish. Data Quality: Great attention has been given to ensuring the quality and reliability of the dataset. The articles and headlines have undergone careful curation and preprocessing to remove duplicates, ensure linguistic consistency, and filter out irrelevant or spam-like content. This guarantees that the dataset is of high quality and suitable for training robust and accurate news summarization models. Language and Cultural Context: The KNSD is specifically tailored for the Kurdish language, taking into account the unique linguistic characteristics and cultural context of the Kurdish-speaking population. This allows researchers to develop models that are attuned to the nuances and specificities of Kurdish news content. Applications: The KNSD can be utilized in various applications and research areas, including but not limited to: News Summarization: The dataset provides a valuable resource for developing and evaluating news summarization models specifically for the Kurdish language. Researchers can explore different techniques, such as extractive or abstractive summarization, to generate concise and coherent summaries of Kurdish news articles. Machine Learning and Natural Language Processing (NLP): The KNSD can be used to train and evaluate machine learning models, deep learning architectures, and NLP algorithms for tasks related to news summarization, text generation, and semantic understanding in the Kurdish language. The Kurdish News Summarization Dataset (KNSD) offers an extensive and diverse collection of news articles and headlines in the Kurdish language, providing researchers with a valuable resource for advancing the field of news summarization specifically for Kurdish-speaking audiences.

Files

Categories

Machine Learning Algorithm, Text Extraction, Kurd, Deep Learning

Licence