Phish-Iris Dataset: A Small Scale Multi-Class Phishing Web Page Screenshots Dataset
Phish-IRIS dataset is aimed for researchers to supply a ground truth dataset to evaluate their vision based multi-class anti-phishing studies. For this purpose, we supply a corpus involving unique screenshots of 15 (14+1) classes. Here, 1 class represents the "unknown" or "legitimate" samples while the rest of the 14 classes correspond to different highly phished brands. It is important to mention that, Phish-IRIS dataset aims to provide a benchmark dataset for only computer vision based anti-phishing studies. The dataset has been collected between March and May 2018. The phishing pages have been collected from Phishtank.com and OpenPhish.com while legitimate pages have been collected randomly. Our dataset involves 1313 training and 1539 testing samples. The directory structure of the dataset has been splitted as "train" and "val" folders which contain respective brand names and "other" category. Since the nature of the anti-phishing is based on discriminating legitimate web pages from the phished targets, we also provided a fairly larger set for legitimate samples (i.e. "other" category). Our first paper utilizing this dataset has been published with the title of Phish-IRIS: A New Approach for Vision Based Brand Prediction of Phishing Web Pages via Compact Visual Descriptors. The official home page of the dataset is https://web.cs.hacettepe.edu.tr/~selman/phish-iris-dataset/ We believe that, our dataset will be beneficial for the researchers who are interested in vision based anti phishing .