Deep-tech Keywords - EIT and Public report Thesauruses (2023)
Description
These data were collected as part of a project to identify which factors contribute to facilitate the emergence of deep-tech startups in regions, using the entrepreneurial ecosystems framework. To identify deep-tech startups in large dataset we created different thesauruses with keywords related to technologies considered deep-tech at the end of 2023, and we searched for such keywords (or combinations thereof, see more after) in the textual field available. The keywords were collected from two different type of resources: 1) keywords from the European Institute for Innovation and Technology (EIT) taxonomy of deep-tech 2) keywords collected from PUBLIC resources working in the deep-tech field, such as reseach organizations, data providers, authors, and investors of the sector. Link and sources are available in both files. These thesauruses can be used as starting point for further improvements and refinements.
Files
Steps to reproduce
Thesauruses should be used in this way. We categorized keywords into three groups: 1) keywords that alone are sufficient to classify a startup as deep-tech, 2) keywords related to technologies that need to be paired with specific applications to classify a startup as deep-tech, and 3) keywords related to application fields. A startup was considered deep-tech if text analysis (for example of public description) revealed the presence of at least one keyword from group 1 or a combination of at least one group 2 keyword (technology-related but not independently sufficient) with one group 3 keyword (application-related). The thesauruses are not extremely precise at startup level but are quite precise at aggregate levels (for example regional, national). The list of keywords available in the thesauruses are the result of a refinement process, thus not all the ones availble in cited sources are present in the files. This because we noted that certain keywords were generating more false positive (falsely classified as deep-tech while it was not in reality) than true positives. Keywords that were generating false positives were carefully evaluated and eliminated / changed of type (for example from type 1 to type 2) only if the benefits of elimitating them was higher than the benefit of leaving them unchanced (since eliminating or moving that keyword might have generated other errors, like not recognizing anymore a real deep-tech startup).