Vulnerability-related Source Code Comments in GitHub
Description
This artefact is a repository consisting of the collected dataset including (i) 1,491 vulnerability-related source code comments extracted from GitHub, and (ii) 1,491 code comments of our manual analysis. This artefact aims to enable researchers to replicate our dataset and reuse this dataset for further research.
Files
Steps to reproduce
The extraction of the data was conducted in two steps: (i) keywords definition and (ii) extraction of source code comments. A. Vulnerability Identifiers Definition To define the vulnerability identifiers, we initially extracted all identifiers from 2 popular vulnerability databases, that is, the National Vulnerability Database (NVD) (https://nvd.nist.gov/) and the Coordination Center of the Computer Emergency Response Team (CERT/CC) Vulnerability Notes Database (https://www.kb.cert.org/vuls/). From the extraction, we found 4 unique vulnerability identifiers plus 1 additional keyword to indicate the vulnerability scoring system identified in both databases, that is: 1. CVE id (Common Vulnerabilities and Exposures identification number) that describes cybersecurity products and services, 2. CWE id (Common Weakness Enumeration specification) that represents a vulnerability type, 3. CPE Name (Common Platform Enumeration) is a structured naming scheme for information technology systems, software, and packages, 4. VU Notes that include summaries, technical details, remediation information, and lists of affected vendors. 5. CVSS Metric (Common Vulnerability Scoring System) is an open framework for communicating the characteristics and severity of software vulnerabilities. B. Source Code Comments Extraction In our study, the source code comments were extracted using the same procedure as prior work (Hata et al, 2019). By applying the regular expression of our 5 defined keywords, we extracted the code comments from 32,007 GitHub repositories across 7 languages, that is, C, C++, Java, JavaScript, PHP, Python, and Ruby (on August 10, 2020). We selected these languages since they were ranked consistently in the top 10 languages on GitHub between 2014 and 2019 (based on the number of pull requests, pushes, stars, and issues). From our extraction, we were able to obtain the code comments that contain at least one keyword, as many as 6,751 comments. After we removed the duplication, we ended at 1,491 distinct comments.