ViClickbait-2025: A Comprehensive Dataset for Vietnamese Clickbait Detection
Description
ViClickbait-2025 is a Vietnamese-language dataset of 3,414 news headlines annotated for clickbait detection. The data were collected via automated web scraping from eight major Vietnamese news platforms and include nine fields such as title, lead paragraph, category, publish time, and binary labels. The dataset supports research in natural language processing and machine learning, especially in detecting misleading or exaggerated content in Vietnamese media.
Files
Steps to reproduce
The dataset can be reproduced by performing automated web scraping of Vietnamese news headlines from the listed sources using Python. We used the Requests, BeautifulSoup, and Selenium libraries, with a 6-hour interval schedule and compliance with each site’s robots.txt. The scraped data included title, lead paragraph, metadata, and label annotations. After collection, we applied a 2-step preprocessing pipeline: HTML noise removal, duplicate filtering.