Ground truth dataset: objectionable web contents

Published: 8 August 2022| Version 2 | DOI: 10.17632/f239556fkr.2
Contributor:
Hamza Altarturi

Description

This is a ground truth dataset that contains 8,000 labelled websites with 4,000 objectionable websites and 4,000 unobjectionable websites. These websites consist of more than 2 million web pages. The dataset contains two files. The "metadata.json" file gives an overview of the websites and their features. The "webpages_detail.json" file gives detailed information on each collected website's web pages (internal URLs) and features.

Files

Steps to reproduce

This is a JSON format dataset, which can be extracted by almost all programming languages and various tools to view, modify, and use.

Institutions

  • Universiti Malaya

Categories

Machine Learning, Web Mining, Intelligent Web, Classification System

Licence