Claims Management Log Dataset with Digital Documents
Description
This is an event log dataset from a real-world claims management process of a mid-sized German insurance company. It is used in the article "Utilizing the Omnipresent: Incorporating Digital Documents into Predictive Process Monitoring Using Deep Neural Networks". This event log is special in that it associates individual events with external context information in the form of digital documents with multiple pages. These digital documents were either received or produced during each event. The event log ("process_log.csv") is provided as a CSV file and contains the following attributes: * timestamp: timestamp of the event * instance_id: unique identifier of the process instance * state: event type as integer * state_name: event type * type: damage type (instance outcome) as integer * type_name: damage type (instance outcome) * file_name: file name of the associated digital document * n_pages: number of pages in the associated document * time_since_last_event: elapsed seconds since the last event occurred * log_time_since_last_event: natural logarithm of elapsed seconds since the last event occurred The original digital documents are stored in the PDF format and contain up to ten pages. For reasons of data privacy, they can't be published directly. However, to enable research to utilize this data, the digital documents are published as feature vectors extracted by established pretrained neural networks (feature extractors). While these feature vectors cannot be used to reconstruct the source document, they contain meaningful information that can be used in applications such as Predictive Process Monitoring (PPM). Feature vectors are extracted using four models: * VGG-16 pretrained on the ImageNet dataset * VGG-16 pretrained on the RVL-CDIP dataset * BERT pretrained on German texts * LayoutXLM pretrained on multilingual document data They are stored as zipped numpy arrays (e.g., "features/vgg_rvl.zip"). The file name serves as the unique key to link the digital documents to their corresponding events in the log. Details and references are provided in the associated article. Finally, we also publish the exact data splits ("folds_and_splits.csv") that were used for model evaluation in the associated article.