Portuguese Public Procurement Data (PPPData)
PPPData comprises 5214 records (public procurement contracts with unique properties), each characterised by 37 fields. This data was collected throughout 2022 from the websites: Portal Base, the Portuguese national repository of public procurement procedures; and Diário da Républica Eléctronico, the official national gazette of Portugal. The gathered data was restricted to public project contracts with a closing date between 2015 and 2022. The resulting dataset was stored in an Microsoft Excel file (.xlsx). It provides filterable and queryable data to support the development of statistical analysis techniques and AI-based applications for the construction procurement phase. Researchers, construction analysts, bidders, and clients can benefit from this structured and systematised procurement information. ML and/or NLP researchers may utilise this data to train and test new models for both supervised and unsupervised learning problems, allowing, for instance, to gain insights into the procurement factors that affect a project’s performance and assess the procurement’s success at its early stages.
Steps to reproduce
The methodology began with the collection of information from the Portal Base. This collection was restricted to contracts with closing dates between 2015 and 2022, as well as public construction project contracts. The exclusion of contracts prior to 2015 was based on changes to the DRE, which created hyperlink disassociations with the Portal Base. This first phase resulted in the agglomeration of 5253 contracts, whose data was extracted and stored through a webscraper. Among the different fields of extracted contracts, a URL may exist with the location of the procedure's notice file published on the DRE website. If it existed, the PDF of the procedure notice was downloaded and its information was collected and stored through a PDF scraper. If it did not exist, an error message was appended to the contract in its place. Next, the contractual data from the Portal Base and the procedure's notice from the DRE were compiled in a single JSON file, gathering all the information about the 5253 identified contracts. Subsequently, the JSON file was exported to Microsoft Excel (.xlsx), where the data was processed manually to homogenise the information, clean the erroneous or missing data and remove outliers. This processing reduced the number of contracts to 5214 contracts. Finally, a script was developed to translate all of the text variables from Portuguese to English. This script used deep-translator 1.9.1, a Python package, to translate the Short Description, Award Criteria, Justification for deadline change, and Justification for price change, as these were written in a destructured way. Other text variables were translated using simple Python dictionaries. Following the script translation, a manual verification was performed on the translated variables to proofread them. A DB with 5214 contracts concerning construction's public procurement in English was developed from this process.
CONSTRUCT Institute of R&D in Structures and Constructions
European Social Fund