Federated IoT/PC Network Traffic Datasets for FTARL IDS
Description
The dataset comprises CSV files for training and testing an Intrusion Detection System (IDS) that integrates a Transformer‑based Autoencoder, Federated Learning, and Reinforcement‑Learning‑tuned thresholding (FTARL). Contents - IoT training: `iot_devices_dataset_for_traning.csv` (30000 rows) - IoT testing: `iot_devices_dataset_for_testing.csv` (10000 rows) - PC training: `PCs_devices_dataset_for_training.csv` (20005 rows) - PC testing: `PCs_devices_dataset_for_testing.csv` (9883 rows) Schema & Labels See `data_dictionary.csv` for per‑column metadata. Label columns (if present) are listed with distributions in the README. Resolution / Format All files are UTF‑8 CSV with headers. Numeric features are scaled to [0,1] where applicable; sequences were formed over 20‑packet windows during modeling. Intended Use Benchmarking anomaly/attack detection in mixed IoT/PC environments, federated learning experiments, and threshold‑selection research. Suitable for reproduction of the FTARL results. Ethics & Privacy Data contains no PII; IPs/device identifiers are synthetic/anonymized. Third‑party sources (e.g., BoT‑IoT) must be cited by downstream users. Funding INCIBE–USAL SCRIN Project (C068/23), Recovery, Transformation and Resilience Plan (NextGenerationEU).
Files
Steps to reproduce
1. Download & verify Download all files in this record and verify integrity with CHECKSUMS.csv (SHA-256). If any hash differs, re-download. 2. Environment Python 3.10+ with: pandas, numpy, scikit-learn, matplotlib (optional, for plots). 3. Load data Files: • iot_devices_dataset_for_training.csv • PCs_devices_dataset_for_training.csv • iot_devices_dataset_for_testing.csv (has column attack ∈ {0,1}) • PCs_devices_dataset_for_testing.csv (may include attack) Example: import pandas as pd iot_tr = pd.read_csv('iot_devices_dataset_for_training.csv') pc_tr = pd.read_csv('PCs_devices_dataset_for_training.csv') iot_te = pd.read_csv('iot_devices_dataset_for_testing.csv') pc_te = pd.read_csv('PCs_devices_dataset_for_testing.csv') 4. Schema See data_dictionary.csv for column names, dtypes, and null counts. Align your code to those columns. (Training CSVs are unlabeled for unsupervised training; testing CSVs contain attack for evaluation.) 5. Pre-processing Features are provided as ready-to-use numeric columns (scaled to [0,1] where applicable). If you change the feature set, fit any scaler only on training data and apply the same transform to test data. 6. Training (unsupervised AE or your model of choice) • Train on benign-dominant training data (iot_tr, pc_tr) using only feature columns. • Save the trained model and (if used) scaler. 7. Threshold selection (if using an Autoencoder) • Compute reconstruction error on a validation subset (or via K-fold). • Choose a threshold (e.g., percentile of benign errors) or use your RL/heuristic to tune it. 8. Evaluation • On iot_te and pc_te, compute anomaly/attack scores and apply the threshold. • If attack exists, report Accuracy, Precision, Recall, F1, and confusion matrix. 9. Reproducibility knobs • Fix random seeds (numpy, framework). • Document any rows/columns you drop or impute. • Keep versions of libraries in your notes or requirements.txt. 10. Attribution & license Use the dataset under CC BY 4.0 (see LICENSE.txt) and cite this record’s DOI. If you combine with third-party corpora, also cite their sources.
Institutions
- Universidad de Salamanca
Categories
Funders
- INCIBE–USAL SCRIN Project (C068/23), NextGenerationEU