Process Understandability for DFG Notation

Published: 19 November 2024| Version 1 | DOI: 10.17632/2247f6kygy.1
Contributors:
,
,

Description

This data set is composed of 3000 JPEG files and an MS Excel file which contains attributes of each JPEG file. The data set was prepared to measure process understandability and structuredness for diagrams in Directly-follows Graph (DFG) notation. For creating a DFG, main input is a transition matrix which is built using event logs. Disregarding the repeating number of loops between the nodes in a control-flow diagram and including self-loops, there can be at most n^2 distinct process diagrams with n nodes. For a 10-step process, number of all possible process diagrams is 100 under these conditions. From a practical point of view, since node number less than 3 is not considered as a process most of the times and creating process diagrams more than 43 causes system performance problems, the range for nodes in the data set was specified to be between 3 and 43. So, the complete set size is 27429. Among all these possible process diagrams, 3000 of them were randomly created with Python. Visual data set sample size corresponds to 10.94% of the possible diagrams in the selected range and can represent all possible process diagrams in the defined universe. For simplification purposes, thickness is the same for all edges, activity names are standardized (Activity1, Activity2 etc.), artificial start / end nodes are not included and diagram directions are from left to right. Starting from 3 nodes, process diagram generation algorithm basically determined the number of arcs and created the transition matrix randomly. Then a process diagram was created with this information and number of nodes, arcs and self-loops were saved in the MS Excel spreadsheet. With this information, following columns were obtained: - Number of Nodes (Size) - Number of Arcs - Total Number of Elements (including all nodes and arcs) - Number of Self Loops (Arcs starting and ending at the same node) - % of All Possible Behaviors (Density) - Arcs per Node (CNC: Coefficient of Network Connectivity) - Arcs per Node Excluding Self Loops (CNCX: Coefficient of Network Connectivity Excluding Self Loops) - Logarithm of Arcs per Node (LogCNC) - Logarithm of Arcs per Node Excluding Self Loops (LogCNCX) In addition to the input variables above, an initial evaluation was made by a process expert whether a given diagram is structured or not which is given in Classification column. In the evaluation phase, together with the adjectives structured / unstructured, simple / complex, spaghetti-like, easy / hard to understand and metrics used in the literature, 3 other guiding criteria were included: - Is it possible to follow the flow and read it as a process diagram? - What would be the effort to make the process diagram more structured? - Would this process diagram be an acceptable output for a customer? Classification columns is the output variable in 0-1 scale where 0 is structured and 1 is unstructured.

Files

Institutions

Istanbul Teknik Universitesi

Categories

Data Science, Machine Learning

Licence