PhishLegitURLs: A Comprehensive Dataset of Legitimate and Phishing URLs
Description
This dataset contains a collection of approximately 1,000 URLs, evenly distributed between phishing and legitimate web addresses, designed for use in research and development of phishing detection models. The dataset is structured as follows: Phishing URLs (Approx. 700): These URLs have been sourced from the URLHaus database, a well-known repository of malicious websites actively used in phishing attacks. Each entry in this subset has been manually verified and is labeled as a phishing URL, making this dataset highly reliable for identifying harmful web content. Legitimate URLs (Approx. 300): The legitimate URLs have been collected from reputable sources such as Wikipedia and Stack Overflow. These websites are known for hosting user-generated content and community discussions, ensuring that the URLs represent safe, legitimate web addresses. The URLs were randomly scraped to ensure diversity in the types of legitimate sites included. Dataset Features: URL: The full web address of each entry, providing the primary feature for analysis. Label: A binary label indicating whether the URL is legitimate (1) or phishing (0). Applications: This dataset is suitable for training and evaluating machine learning models aimed at distinguishing between phishing and legitimate websites. It can be used in a variety of cybersecurity research projects, including URL-based phishing detection, web content analysis, and the development of real-time protection systems. Usage: Researchers can leverage this balanced dataset to develop and test algorithms for identifying phishing websites with high accuracy, using features such as URL structure, and class label attributes. The inclusion of both phishing and legitimate URLs provides a comprehensive basis for creating robust models capable of detecting phishing attempts in diverse online environments.