Rule-Based SQL Injection (RbSQLi) Dataset
Description
The RbSQLi dataset has been developed to support advanced research and development in the detection of SQL injection (SQLi) vulnerabilities. It contains a total of 10,190,450 structured entries, out of which 2,699,570 are labeled as malicious and 7,490,880 as benign. The malicious entries are categorized into six distinct types of SQL injection attacks: Union-based (398,070 samples), Stackqueries-based (223,800 samples), Time-based (564,900 samples), Meta-based (481,280 samples), Boolean-based (207,900 samples), and Error-based (823,620 samples). The malicious payloads for Union-based, Time-based, and Error-based injection types were sourced directly from the widely used and reputable open-source GitHub repository "Payloads All The Things – SQL Injection Payload List" (https://github.com/payloadbox/sql-injection-payload-list). Moreover, ChatGPT was employed to generate additional payloads for Boolean-based, Stack queries-based, and Meta-based injection categories. This hybrid approach ensures that the dataset reflects both known attack patterns and intelligently simulated variants, contributing to a broader representation of SQLi techniques. Again, some queries in the SQLi dataset are syntactically invalid yet contain malicious payloads, enabling models to detect SQL injection attempts even when attackers submit improperly formed or malformed queries. This highlights the importance of training models to recognize semantic intent rather than relying solely on syntactic correctness. All payloads were carefully curated, anonymized, and structured during preprocessing. Sensitive data was replaced with secure placeholders, preserving semantic meaning while protecting data integrity and privacy. The dataset also underwent a thorough sanitization process to ensure consistency and usability. To support scalability and reproducibility, a rule-based classification algorithm was used to automate the labeling and organization of each payload by type. This methodology promotes standardization and ensures that the dataset is ready for use in machine learning pipelines, anomaly detection models, and intrusion detection systems. In addition to being comprehensive, the dataset provides a substantial volume of clean (benign) data, making it well-suited for supervised learning, comparative experiments, and robustness testing in cybersecurity research. This dataset is intended to facilitate progress in the development of more accurate and generalizable SQL injection detection systems and to serve as a reliable benchmark for the broader security and machine learning communities.
Files
Steps to reproduce
The dataset was constructed through a combination of curated open-source resources and AI-assisted generation. To begin, malicious SQL injection payloads for Union-based, Time-based, and Error-based attacks were sourced from the publicly available GitHub repository "Payloads All The Things – SQL Injection Payload List" (https://github.com/payloadbox/sql-injection-payload-list). These payloads were extracted using custom Python scripts that automated the collection, normalization, and formatting of the data into a structured format suitable for further processing. To ensure broader coverage of attack types and enhance the diversity of payloads, additional examples for Boolean-based, Stack queries-based, and Meta-based SQL injection techniques were generated using ChatGPT. This involved crafting structured prompts that guided the model to produce valid and contextually realistic SQL injection payloads. The generated outputs were reviewed for relevance and quality before inclusion in the dataset. All collected payloads, whether open-source or AI-generated, underwent a sanitization and anonymization process. Sensitive elements such as database names, table identifiers, and user-specific content were replaced with standardized placeholders to protect data privacy while preserving the semantic integrity of the attacks. Following collection and preprocessing, a rule-based classification system was applied to label each payload according to its injection type. This system used pattern-matching logic and contextual analysis to assign accurate labels, ensuring consistency across the dataset. The final dataset was then balanced with a large volume of benign samples, allowing for reliable supervised learning and model evaluation. This workflow ensures that the dataset can be reproduced, extended, or adapted for similar research in SQL injection detection.