Relevant Image Dataset

Published: 22 December 2020| Version 1 | DOI: 10.17632/mbk294tthf.1


The dataset contains relevant and irrelevant image tags of Web pages of 125 different domains. The image dataset contains the web domain, file number, the text of image HTML element, attributes of image elements, the size attributes, the parent HTML element of the image, and relevancy of the image. Each Web domain contains 100 Web pages with varying number of image elements.


Steps to reproduce

The file contain image tags with quotes, so the regular CSV readers may split the lines inside an image tag. In each line each image tag should be detected and replaced with an empty symbol first, later the line can be split. After the split the image tag can be attached. Each line corresponds to a sample or an image element. -- Note that each domain should be trained and tested separately.


Bursa Teknik Universitesi, Namik Kemal Universitesi


Information Retrieval, Machine Learning, Web Mining, Feature Extraction, Text Processing