Ground truth dataset: objectionable web contents
Published: 8 August 2022| Version 2 | DOI: 10.17632/f239556fkr.2
Contributor:
Hamza Altarturi
Description
This is a ground truth dataset that contains 8,000 labelled websites with 4,000 objectionable websites and 4,000 unobjectionable websites. These websites consist of more than 2 million web pages. The dataset contains two files. The "metadata.json" file gives an overview of the websites and their features. The "webpages_detail.json" file gives detailed information on each collected website's web pages (internal URLs) and features.
Files
Steps to reproduce
This is a JSON format dataset, which can be extracted by almost all programming languages and various tools to view, modify, and use.
Institutions
Universiti Malaya
Categories
Machine Learning, Web Mining, Intelligent Web, Classification System