Student projects dataset used for a case study for code idiom inference. These are ordinary web projects built using ASP.NET framework, built using the same structure and similar practices. There are over 40 projects and 4000 c# files. Resulting code idioms are found using Type-Based Markov-Chain Monte Carlo method, and are here presented without the context needed for API for clarity. The tool used to infer code idioms is available here: https://github.com/lukic-aleksandar/RoseLibML.
Steps to reproduce
We run the inference process with the following setup: - We fixed the probability of setting a node to be a fragment root while creating initial fragments to 0.9. This value was set so that initial steps can be larger; - We set the parameter for geometric distribution provided in Equation 2 to 0.0085. This value was set so that penalization for fragment size is not too strong; - The value for alpha in can be experimented with to find an appropriate value for the dataset. We experimented with a range of possible values, but the concrete value used to infer idioms is 5. - The burn-in period for the run was 75 and the total number of iterations was 100. - The threshold value for the number of occurrences of a single idiom for the run was 3. The idea was to explore the dataset, so we set it quite low. The value can be changed to depend on an idiom's size or otherwise changed to suit the dataset.