Three types of company and technology datasets towards a unified graph model

Published: 24 October 2023| Version 3 | DOI: 10.17632/vvdcjtkg8k.3
Contributor:
Hyuk-Yoon Kwon

Description

We collect three types of publicly available datasets. 1) Company Data: Initially, we collect data on promising small and medium-sized enterprises located in Gyeonggi Province from 2019 to 2021, sourced from the Gyeonggi Open Data Portal, "Gyeonggi Data Dream." 2) Technology Keyword Data: Utilizing prompt engineering with the ChatGPT service [3], we define a set of 14 initial technology keywords. Since it is insufficient to construct a graph with only 14 initial technology keywords, we extend the keywords using a patent database. We extract core technology keyword sets through query expansion using the previously defined initial technology keyword set. 3) Patent Data: We acquire patent disclosure and registration datasets from the Patent Office's KIPRIS PLUS patent information utilization service. Finally, to construct graph data, For a total of 47,385 patent data with 727 SMEs collected as applicants, the relationship between each company and the technology keywords included in the patent data is modeled as a heterogeneous graph. It consists of two types of nodes, Company and Technology, 727 and 1957 respectively, which define three types of edges. There are Use, which indicates that a particular company uses a particular technology, Share, which means that two company share common technology in their main products respectively, and Relate, which indicates that the two technologies are used in a common patent document. We generate graph data as data objects provided by the pytorch-geometric library. For the reconstruction of this data, we converted it into a form of a Pytorch Tensor and stored it as a pickle file. This can be reproduced by "torch.load('Graph.pt')" module. Note that the original datasets were written in Korean. We converted them to English for reference: (eng) init_keywords.txt, (eng) expanded_keywords.txt, (eng) company_data.csv, and (eng) patents.csv.

Files

Institutions

Seoul National University of Science and Technology

Categories

Graph Labeling

Licence