StegPDF-21: A Feature-Engineered Dataset for PDF Steganography Detection
Description
This dataset, referred to as StegPDF-21, is developed to support research in PDF steganography detection and document-level steganalysis using machine learning techniques. It consists of feature-based representations of PDF documents organized into two classes: clean documents (label = 0), which do not contain hidden information, and steganographic documents (label = 1), in which hidden data has been intentionally embedded. The dataset was constructed using PDF files obtained from a publicly available stress-testing repository. Initially, approximately 32,000 PDF documents were collected. After a validation and cleaning process—where corrupted, encrypted, malformed, and duplicate files were removed—around 20,000 valid documents were retained. From these, 10,000 clean PDFs were selected as the base set for generating steganographic samples. Steganographic documents were generated by applying eight different embedding techniques, each with three payload variants, in order to simulate a range of hiding strategies. A controlled number of samples from each steganographic variant group were selected to maintain class balance and dataset consistency. Each document was then processed using Python-based tools to extract structural and content-related features, including properties of PDF objects, metadata characteristics, and text-based patterns. Initially, 25 features were extracted; following correlation analysis, 4 redundant features were removed, resulting in a final set of 21 numerical features. The dataset is provided in CSV format, where each row corresponds to a document and each column represents a feature, along with a binary label column where 0 denotes clean documents and 1 denotes steganographic documents. The final dataset contains 19,372 instances, with a nearly balanced distribution between the two classes. All processing steps were implemented using automated Python scripts to ensure reproducibility and consistent dataset construction. This dataset can be used for machine learning experiments, cybersecurity research, and the evaluation of PDF steganography detection methods.
Files
Steps to reproduce
• Collect PDF documents from the PDF Association Stressful Corpus. • Remove corrupted, encrypted, malformed, and duplicate files. • Select a subset of clean PDFs as the base dataset. • Generate steganographic documents using multiple embedding techniques with different payload variants. • Extract structural and statistical features using Python libraries (pypdf/PyPDF2, NumPy, Pandas). • Perform correlation analysis and remove redundant features. • Assign binary labels (0 = clean, 1 = stego). • Store the final dataset in CSV format.