SysCall Dataset: A Dataset for Context Modeling and Anomaly Detection using System Calls
Context modeling and anomaly detection use abstractions from the processes and applications to create state-transition graphs that verify system performance. However, this approach of model performance verification is limited as state explosion problem forces designers to use process abstraction which does not capture the intricate interactions amongst the processes, the hardware, and the kernel during execution. Also, the timing constraints of some process executions are challenging to model using the simple state-transition graphs. In this paper, we describe a dataset of system call events from an uncrewed aerial vehicle (UAV) which capture the order and type of system calls as well as the timestamp of the system call events as the UAV operates in a simulated platform. Since processes call the system call events, then an ingenious reverse engineering process of using the system call events generated by each process1 can be used to audit the behavior of the application. The system call events provide an in-depth view of the process interactions while the timestamp of the events helps in modeling timing requirements during process execution. The UAV application is modeled using state machines, and as the application operates from the start state to the end state, we record the system call events and the timestamp of the events using the process identifiers, and other IDs that show that the monitored process generated the system call event. We package the UAV application, the instrumentation script, and the Bochs CPU emulator into a Docker container for the ease of generating datasets (similar to field datasets) in the laboratory with minimal cost. Therefore, the dataset is useful for in-depth modern cyber-threat analysis.
Steps to reproduce
Overview Everything executes in one Docker container. Within this Docker container is an instrumented version of the Bochs CPU emulator The instrumentation works in such a way that syscall logging begins when the VM user executes the Linux command mkdir FlightBegin and logging ends when the user executes the Linux command mkdir FlightEnd. Therefore to log all CPU syscalls that occur when a process is running, mkdir FlightBegin; ./program; mkdir FlightEnd Drone Physics Simulation Taking the ideas from the WVU AtLAS project I created a simple physics model of a drone simulating: 3 throttles (X-axis, Y-axis, Z-axis) The gravity of 9.81m/s^2 Aerodynamic drag This drone model connects to a virtual serial port and can be interfaced with through the following commands: TODO This drone is connected to the virtual serial port (using a Linux PTY) of the Bochs VM. Auto-Piloting The Virtual Serial Port Drone A Python script running within the Bochs VM controls the drone The controller goes through the following states: Take off: Apply Z-axis throttle to move up to cruising altitude Cruising: Adjust Z-axis throttle to counteract gravity Adjust X-axis and Y-axis throttles to move at a constant velocity to the destination Landing: Turn off X-axis and Y-axis throttles Lower Z-axis throttle so that drone lowers smoothly Landed: The drone is on the ground and Python script exits I have generated data logs from three scenarios: Normal operation Random delay - the controller randomly lags behind due to computationally expensive operations Random syscall - to test anomaly detection the controller sometimes sends random UDP data to a port on 127.0.0.1 The Results 001_NORMAL_Flight.txt: Runs properly, no code to interfere with the performance 002_BUSYDELAY_Flight.txt: Time is sometimes spent in code used to factor large numbers this results in too slow polling of the drone's sensors and consequentially the drone flies too high before it comes down and crashes. It does not reach its destination. 003_SOCKETS_Flight.txt: During cruising mode, UDP packets are sent to random localhost ports. The flight proceeds properly, but there should be new syscalls mixed in. Raw Dataset Format For entries with SYSCALL (left to right) Timestamp "SYSCALL" RAX (this is usually the name of the syscall, see: https://filippo.io/linux-syscall-table/) RDI (optional syscall argument) RSI (optional syscall argument) RDX (optional syscall argument) R10 (optional syscall argument) R8 (optional syscall argument) R9 (optional syscall argument) CR3 (process page table pointer which called this syscall) For entries with SYSRET Timestamp CR3 (process page table pointer which called this syscall) Processed Dataset Format Execute the process_data.py to generate the processed dataset. Remember to use a full path to where your repository is saved to avoid File does not exist error. From left to right Ti-Ti-1 (Difference between current timestamp and previous timestamp values) SysCall ID