ViClickbait-2025: A Comprehensive Dataset for Vietnamese Clickbait Detection

Name: ViClickbait-2025: A Comprehensive Dataset for Vietnamese Clickbait Detection
Creator: Dai Nguyen Phuoc
Published: 2025-07-15T18:01:35.739Z
Keywords: Natural Language Processing, Machine Learning, Vietnamese Language

Nguyen Phuoc, Dai; Nguyen Minh, Y; Vo, Bay; Tran Khai, Thien

doi:10.17632/3wc46bfcjc.1

ViClickbait-2025: A Comprehensive Dataset for Vietnamese Clickbait Detection

Published: 15 July 2025| Version 1 | DOI: 10.17632/3wc46bfcjc.1

Contributors:

Dai Nguyen Phuoc, Y Nguyen Minh, Bay Vo, Thien Tran Khai

Description

ViClickbait-2025 is a Vietnamese-language dataset of 3,414 news headlines annotated for clickbait detection. The data were collected via automated web scraping from eight major Vietnamese news platforms and include nine fields such as title, lead paragraph, category, publish time, and binary labels. The dataset supports research in natural language processing and machine learning, especially in detecting misleading or exaggerated content in Vietnamese media.

Files

Steps to reproduce

The dataset can be reproduced by performing automated web scraping of Vietnamese news headlines from the listed sources using Python. We used the Requests, BeautifulSoup, and Selenium libraries, with a 6-hour interval schedule and compliance with each site’s robots.txt. The scraped data included title, lead paragraph, metadata, and label annotations. After collection, we applied a 2-step preprocessing pipeline: HTML noise removal, duplicate filtering.

Institutions

Ho Chi Minh City University of Technology, Ho Chi Minh City University of Food Industry

ViClickbait-2025: A Comprehensive Dataset for Vietnamese Clickbait Detection

Description

Files

Steps to reproduce

Institutions

Categories

Licence