ViClickbait-2025: A Comprehensive Dataset for Vietnamese Clickbait Detection

Published: 15 July 2025| Version 1 | DOI: 10.17632/3wc46bfcjc.1
Contributors:
Dai Nguyen Phuoc, Y Nguyen Minh, Bay Vo, Thien Tran Khai

Description

ViClickbait-2025 is a Vietnamese-language dataset of 3,414 news headlines annotated for clickbait detection. The data were collected via automated web scraping from eight major Vietnamese news platforms and include nine fields such as title, lead paragraph, category, publish time, and binary labels. The dataset supports research in natural language processing and machine learning, especially in detecting misleading or exaggerated content in Vietnamese media.

Files

Steps to reproduce

The dataset can be reproduced by performing automated web scraping of Vietnamese news headlines from the listed sources using Python. We used the Requests, BeautifulSoup, and Selenium libraries, with a 6-hour interval schedule and compliance with each site’s robots.txt. The scraped data included title, lead paragraph, metadata, and label annotations. After collection, we applied a 2-step preprocessing pipeline: HTML noise removal, duplicate filtering.

Institutions

Ho Chi Minh City University of Technology, Ho Chi Minh City University of Food Industry

Categories

Natural Language Processing, Machine Learning, Vietnamese Language

Licence