Datasets Comparison
Version 1
SysCall Dataset: A Dataset for Context Modeling and Anomaly Detection using System Calls
Description
Context modeling and anomaly detection use abstractions from the processes and applications to create state-transition graphs that verify system performance. However, this approach of model performance verification is limited as state explosion problem forces designers to use process abstraction which does not capture the intricate interactions amongst the processes, the hardware, and the kernel during execution. Also, the timing constraints of some process executions are challenging to model using the simple state-transition graphs.
In this paper, we describe a dataset of system call events from an uncrewed aerial vehicle (UAV) which capture the order and type of system calls as well as the timestamp of the system call events as the UAV operates in a simulated platform. Since processes call the system call events, then an ingenious reverse engineering process of using the system call events generated by each process1 can be used to audit the behavior of the application. The system call events provide an in-depth view of the process interactions while the timestamp of the events helps in modeling timing requirements during process execution.
The UAV application is modeled using state machines, and as the application operates from the start state to the end state, we record the system call events and the timestamp of the events using the process identifiers, and other IDs that show that the monitored process generated the system call event. We package the UAV application, the instrumentation script, and the Bochs CPU emulator into a Docker container for the ease of generating datasets (similar to field datasets) in the laboratory with minimal cost. Therefore, the dataset is useful for in-depth modern cyber-threat analysis.
Steps to reproduce
Overview
Everything executes in one Docker container. Within this Docker container is an instrumented version of the Bochs CPU emulator
The instrumentation works in such a way that syscall logging begins when the VM user executes the Linux command mkdir FlightBegin and logging ends when the user executes the Linux command mkdir FlightEnd.
Therefore to log all CPU syscalls that occur when a process is running, mkdir FlightBegin; ./program; mkdir FlightEnd
Drone Physics Simulation
Taking the ideas from the WVU AtLAS project I created a simple physics model of a drone simulating:
3 throttles (X-axis, Y-axis, Z-axis)
The gravity of 9.81m/s^2
Aerodynamic drag
This drone model connects to a virtual serial port and can be interfaced with through the following commands:
TODO
This drone is connected to the virtual serial port (using a Linux PTY) of the Bochs VM.
Auto-Piloting The Virtual Serial Port Drone
A Python script running within the Bochs VM controls the drone
The controller goes through the following states:
Take off:
Apply Z-axis throttle to move up to cruising altitude
Cruising:
Adjust Z-axis throttle to counteract gravity
Adjust X-axis and Y-axis throttles to move at a constant velocity to the destination
Landing:
Turn off X-axis and Y-axis throttles
Lower Z-axis throttle so that drone lowers smoothly
Landed:
The drone is on the ground and Python script exits
I have generated data logs from three scenarios:
Normal operation
Random delay - the controller randomly lags behind due to computationally expensive operations
Random syscall - to test anomaly detection the controller sometimes sends random UDP data to a port on 127.0.0.1
The Results
001_NORMAL_Flight.txt: Runs properly, no code to interfere with the performance
002_BUSYDELAY_Flight.txt: Time is sometimes spent in code used to factor large numbers this results in too slow polling of the drone's sensors and consequentially the drone flies too high before it comes down and crashes. It does not reach its destination.
003_SOCKETS_Flight.txt: During cruising mode, UDP packets are sent to random localhost ports. The flight proceeds properly, but there should be new syscalls mixed in.
Raw Dataset Format
For entries with SYSCALL (left to right)
Timestamp
"SYSCALL"
RAX (this is usually the name of the syscall, see: https://filippo.io/linux-syscall-table/)
RDI (optional syscall argument)
RSI (optional syscall argument)
RDX (optional syscall argument)
R10 (optional syscall argument)
R8 (optional syscall argument)
R9 (optional syscall argument)
CR3 (process page table pointer which called this syscall)
For entries with SYSRET
Timestamp
CR3 (process page table pointer which called this syscall)
Processed Dataset Format
Execute the process_data.py to generate the processed dataset. Remember to use a full path to where your repository is saved to avoid File does not exist error.
From left to right
Ti-Ti-1 (Difference between current timestamp and previous timestamp values)
SysCall ID
Institutions
University of Ontario Institute of Technology Faculty of Engineering and Applied Science
Categories
Machine Learning, Artificial Intelligence Applications, Applied Computer Science, Learning Context
Related Links
Licence
Creative Commons Attribution 4.0 International
Version 2
SysCall Dataset: A Dataset for Context Modeling and Anomaly Detection using System Calls
Description
Context modeling and anomaly detection use abstractions from the processes and applications to create state-transition graphs that verify system performance. However, this approach of model performance verification is limited as state explosion problem forces designers to use process abstraction which does not capture the intricate interactions amongst the processes, the hardware, and the kernel during execution. Also, the timing constraints of some process executions are challenging to model using the simple state-transition graphs.
In this paper, we describe a dataset of system call events from an uncrewed aerial vehicle (UAV) which capture the order and type of system calls as well as the timestamp of the system call events as the UAV operates in a simulated platform. Since processes call the system call events, then an ingenious reverse engineering process of using the system call events generated by each process1 can be used to audit the behavior of the application. The system call events provide an in-depth view of the process interactions while the timestamp of the events helps in modeling timing requirements during process execution.
The UAV application is modeled using state machines, and as the application operates from the start state to the end state, we record the system call events and the timestamp of the events using the process identifiers, and other IDs that show that the monitored process generated the system call event. We package the UAV application, the instrumentation script, and the Bochs CPU emulator into a Docker container for the ease of generating datasets (similar to field datasets) in the laboratory with minimal cost. Therefore, the dataset is useful for in-depth modern cyber-threat analysis.
Steps to reproduce
Overview
Everything executes in one Docker container. Within this Docker container is an instrumented version of the Bochs CPU emulator
The instrumentation works in such a way that syscall logging begins when the VM user executes the Linux command mkdir FlightBegin and logging ends when the user executes the Linux command mkdir FlightEnd.
Therefore to log all CPU syscalls that occur when a process is running, mkdir FlightBegin; ./program; mkdir FlightEnd
Drone Physics Simulation
Taking the ideas from the WVU AtLAS project I created a simple physics model of a drone simulating:
3 throttles (X-axis, Y-axis, Z-axis)
The gravity of 9.81m/s^2
Aerodynamic drag
This drone model connects to a virtual serial port and can be interfaced with through the following commands:
TODO
This drone is connected to the virtual serial port (using a Linux PTY) of the Bochs VM.
Auto-Piloting The Virtual Serial Port Drone
A Python script running within the Bochs VM controls the drone
The controller goes through the following states:
Take off:
Apply Z-axis throttle to move up to cruising altitude
Cruising:
Adjust Z-axis throttle to counteract gravity
Adjust X-axis and Y-axis throttles to move at a constant velocity to the destination
Landing:
Turn off X-axis and Y-axis throttles
Lower Z-axis throttle so that drone lowers smoothly
Landed:
The drone is on the ground and Python script exits
I have generated data logs from three scenarios:
Normal operation
Random delay - the controller randomly lags behind due to computationally expensive operations
Random syscall - to test anomaly detection the controller sometimes sends random UDP data to a port on 127.0.0.1
The Results
001_NORMAL_Flight.txt: Runs properly, no code to interfere with the performance
002_BUSYDELAY_Flight.txt: Time is sometimes spent in code used to factor large numbers this results in too slow polling of the drone's sensors and consequentially the drone flies too high before it comes down and crashes. It does not reach its destination.
003_SOCKETS_Flight.txt: During cruising mode, UDP packets are sent to random localhost ports. The flight proceeds properly, but there should be new syscalls mixed in.
Raw Dataset Format
For entries with SYSCALL (left to right)
Timestamp
"SYSCALL"
RAX (this is usually the name of the syscall, see: https://filippo.io/linux-syscall-table/)
RDI (optional syscall argument)
RSI (optional syscall argument)
RDX (optional syscall argument)
R10 (optional syscall argument)
R8 (optional syscall argument)
R9 (optional syscall argument)
CR3 (process page table pointer which called this syscall)
For entries with SYSRET
Timestamp
CR3 (process page table pointer which called this syscall)
Processed Dataset Format
Execute the process_data.py to generate the processed dataset. Remember to use a full path to where your repository is saved to avoid File does not exist error.
From left to right
Ti-Ti-1 (Difference between current timestamp and previous timestamp values)
SysCall ID
Institutions
University of Ontario Institute of Technology Faculty of Engineering and Applied Science
Categories
Machine Learning, Artificial Intelligence Applications, Applied Computer Science, Learning Context
Related Links
Licence
Creative Commons Attribution 4.0 International