Data for: Machine Learning based Heterogeneous Web Advertisements Detection Using a Diverse Feature Set

Published: 29-06-2018| Version 1 | DOI: 10.17632/5bzh52txpn.1
KS Kuppusamy,
Ab Shaqoor Nengroo


Advertisement identification and filtering in web pages gain significance due to various factors such as accessibility, security, privacy, and obtrusiveness. Current practices in this direction involve maintaining URL-based regular expressions called filter lists. Each URL obtained on a web page is matched against this filter list. While effectual, this procedure lacks scalability as it demands regular continuance of the filter list. To counter these limitations, we devise a machine learning based advertisement detection system using a diverse feature set which can distinguish advertisement blocks from non-advertisement blocks. The method can act as a base to provide various accessibility-related features like smooth browsing and text summarization for persons with visual impairments, cognitive impairments, and photosensitive epilepsy. The results from a classifier trained on the proposed feature set achieve 93.4% accuracy in identifying advertisements.