Ground truth dataset: objectionable web contents
Published: 8 August 2022| Version 2 | DOI: 10.17632/f239556fkr.2
Contributor:
Hamza AltarturiDescription
This is a ground truth dataset that contains 8,000 labelled websites with 4,000 objectionable websites and 4,000 unobjectionable websites. These websites consist of more than 2 million web pages. The dataset contains two files. The "metadata.json" file gives an overview of the websites and their features. The "webpages_detail.json" file gives detailed information on each collected website's web pages (internal URLs) and features.
Files
Steps to reproduce
This is a JSON format dataset, which can be extracted by almost all programming languages and various tools to view, modify, and use.
Institutions
- Universiti Malaya
Categories
Machine Learning, Web Mining, Intelligent Web, Classification System