CSIQ: A Synthesized Dataset of Code Smells, Issues and Quality related Artifacts from Open Source Repositories

Published: 5 August 2022| Version 5 | DOI: 10.17632/77p6rzb73n.5
Sayed Mohsin Reza,


The dataset contains synthesized code smells, issues, quality, and source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains following folders: code smells, issues, quality attributes, synthesized and four associated Comma Separated Values (CSV) files: repositories.csv, versions.csv, codesmells.csv and attribute − details.csv. The first file (repositories.csv) contains general information (repository name, URL, number of commits, stars, forks, etc.) to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how over time the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods. File codesmells.csv provides the information (Rule, Code Smell, Rule Description) of code smells analyzed from each version.


Steps to reproduce

The following step to reproduce this dataset: (1) Visit the version link mentioned in versions.csv (2) Download the version from the link (3) Use the CODEMR tool to analyze each version (4) The analyzed result is then exported as CSV data


University of Texas at El Paso, University of Texas at El Paso College of Engineering, University of Texas at El Paso College of Science


Data Mining, Software Evolution, Code Refactoring, Code Metrics, Software Quality Assurance