SQL Injection Attack Dataset

Published: 15 July 2025| Version 1 | DOI: 10.17632/mmc4sdmnrc.1
Contributors:
Hasanen Alyasiri,

Description

The carefully built dataset presented in this paper is intended to be used in the training of supervised machine learning algorithms that identify SQLI threats. We manually gathered datasets from Kaggle and GitHub. There are 47,464 distinct SQL queries in it, including both legitimate and malicious ones. All of the components related to SQL queries are contained in each entry of this dataset, including semicolons, single quotes, intermediate data, text fragments, and SQL keywords. Each row in the dataset has a binary label, where attack SQL queries are indicated by 1 and normal queries by 0. The dataset is built with 25,800 benign queries and 21,664 destructive queries (see Fig. 6). The single-column display of this binary labeling facilitates the identification of the kind of query. This work’s second primary contribution is the creation of a 19-feature numeric training dataset. Through feature homogeneity across the dataset, this study aims to raise machine learning algorithm accuracy and precision. 18 useful numerical features were extracted from typical SQLi datasets as the first step in the development process. The source code of every query in the chosen original dataset was used to create these features. The design of the dataset consists of one dependent feature, which acts as the label designating the type of query (0 for normal and 1 for malicious), and eighteen independent characteristics. Constants, punctuation, logical operators, the duration of the question, and the number of nested queries are among the purely numerical data that are extracted. As a result, there are 47,464 records in the improved dataset, and each record has 18 extracted attributes.

Files

Institutions

  • University of Kufa

Categories

Cybersecurity, Machine Learning, Structured Query Language

Licence