SKNAD: Sorani Kurdish News Article Dataset
Description
The SKNAD (Sorani Kurdish News Article Dataset) is a large-scale collection of Sorani Kurdish news articles compiled from fourteen online news agencies. The dataset contains 575,543 articles collected from multiple media platforms, including Ava, Bas News, Channel 8, GKSat, Haremnews, Kurdistan TV, Kurdistan24, NRT, Payam, Rachlaken, Rudaw, Shanpress, Sharpress, and Xelk. Each article record contains structured metadata fields including Title, Subtitle, Content, Category, Published_Date, URL, and Source, providing both textual content and descriptive attributes that support further analysis. The dataset covers a wide range of topics that reflect the diversity of Kurdish news reporting. After preprocessing and category standardization, all articles were organized into 13 predefined news categories, such as Politics, World & Regional, Sports, Business & Economy, Society & Lifestyle, Health, Culture & Arts, Opinion & Editorial, Environment, Education, Technology & Science, Religion, and Media & Multimedia. Among these categories, Politics represents the largest portion with 219,930 articles, followed by World & Regional (87,522) and Sports (80,112). The dataset was constructed through a systematic data preparation process that includes data cleaning, metadata normalization, and automated categorization to ensure consistency and reliability. As a result, the dataset forms a structured corpus suitable for various applications such as natural language processing, text classification, topic analysis, information retrieval, and linguistic studies in Kurdish-language media content.