Software code quality and source code metrics dataset

Published: 24 March 2022| Version 3 | DOI: 10.17632/77p6rzb73n.3
Sayed Mohsin Reza,


The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.


Steps to reproduce

The following step to reproduce this dataset: (1) Visit the version link mentioned in versions.csv (2) Download the version from the link (3) Use CODEMR tool to analyze each version (4) The analyzed result then export as data


University of Texas at El Paso, University of Texas at El Paso College of Engineering


Machine Learning, Software Design, Software Quality Assurance