Ground truth dataset: objectionable web contents

Published: 8 August 2022| Version 2 | DOI: 10.17632/f239556fkr.2
Contributor:
Hamza Altarturi

Description

This is a ground truth dataset that contains 8,000 labelled websites with 4,000 objectionable websites and 4,000 unobjectionable websites. These websites consist of more than 2 million web pages. The dataset contains two files. The "metadata.json" file gives an overview of the websites and their features. The "webpages_detail.json" file gives detailed information on each collected website's web pages (internal URLs) and features.

Files

Steps to reproduce

This is a JSON format dataset, which can be extracted by almost all programming languages and various tools to view, modify, and use.

Institutions

Universiti Malaya

Categories

Machine Learning, Web Mining, Intelligent Web, Classification System

Licence