Pypip packages. Their vulnerabilities and dependencies
Description
This dataset is built from two key sources: the PyPi JSON Data repository and the Python Packaging Advisory Database. It aims to support the study of supply chain attacks in the PyPi ecosystem for a university project. The PyPi JSON Data provides detailed metadata on Python packages, including their names, dependencies, and basic vulnerability information. Meanwhile, the Python Packaging Advisory Database offers security alerts, linking packages to known vulnerabilities through CVE identifiers. The goal of this dataset is to simplify the analysis of supply chain attacks, where vulnerabilities in one package can propagate through dependencies, potentially affecting thousands of users. To keep the focus on broader patterns, package versions are not tracked—a decision that reduces complexity and allows for a more accessible study of how vulnerabilities spread throughout the ecosystem. By combining package metadata and security data, this dataset offers a powerful tool for examining risks within the PyPi ecosystem, supporting research into one of today's most significant challenges in software security.
Files
Steps to reproduce
To reproduce the dataset, the first step involves gathering data from two main sources: the PyPi JSON Data repository and the Python Packaging Advisory Database. Start by cloning the PyPi JSON Data repository from GitHub, which provides metadata for Python packages, including package names, dependencies, and basic vulnerability information. The repository stores individual package data as JSON files under the pypi-json-data/release_data/ directory. Similarly, clone the Python Packaging Advisory Database, which contains detailed security advisories and vulnerability reports (CVE data) for PyPi packages. The vulnerability data is located in the advisory-database/vulns/ directory. Once both datasets are downloaded, you can process them by extracting relevant information. For the package data, you need to parse the JSON files to collect fields such as the package name, its dependencies (requires_dist), and any listed vulnerabilities. For the CVE data, gather information on vulnerabilities that affect specific packages, using their names as the linking point between the package data and advisory data. After extracting the data, the final step is to link the packages with their corresponding vulnerabilities. This involves iterating over the PyPi package list and associating each package with the CVEs from the advisory database that impact it. The combined data, including the package name, dependencies, and linked vulnerabilities, can then be written to a final JSON file. This linked dataset serves as a tool for analyzing supply chain vulnerabilities in the PyPi ecosystem.