Python elaboration of Patstat and Orbit data

Published: 11-03-2021| Version 1 | DOI: 10.17632/gfnhp8r52y.1
Riccardo Priore


The files included in this repository refer to an example of a patent intelligence approach suitable to focus on the technical implementations concerning the geographic mapping of the marine environment. Having run a search query on the database Orbit Intelligence (Questel), 315 records have been downloaded, the priority years ranging from yr. 2015 to October 2020. The files included allow to reproduce the results of a methodology conceived in order to implement the information available from a typical patent search approach. One goal, dealing on the one hand with the association of a list including technical concepts described in a coincise style with each corresponding patent family belonging to the original dataset, and on the other hand dealing with the necessity of producing a list of keywords ranked according to their frequency of use within the titles of the whole dataset, can be accomplished. To such aim, a quite simple elaboration can be performed with a couple of IPython Notebooks, allowing to generate MS Excel files included in such repository. Such modification of the original dataset allows the user to implement the accessibility to the technical details of the patent data, as well as optionally import the modified data as MS Excel files into the MS Power BI app to achieve a dynamic layout by means of tables and charts allowing to focus on specific features. For example, it is possible to filter the patent documents dealing with a specific technical concept, rather than those including a specific keyword in their titles. An additional goal is aimed at zooming in on a restricted number of patent documents selectable from the original dataset, being there the possibility of clustering the patents based on patterns of either IPC classification codes or on one or more of the 35 technology fields defined by WIPO. Specific patterns based on IPC classification codes or technology fields will appear as distinct clusters of patent documents thanks to the dimensionality reduction allowed by the t-SNE algorithm, which represents a significant implementation with respect to the traditional and less sophisticated approach, being the unambiguous association of a given pattern of IPC classification codes with a well focused, or even unique, technical topic a critical issue. According to the methodology proposed, sub-pools of patent families may be quickly partitioned into macro-categories based on the technical content of each patent document. At the same time, a quantitative analysis of the most representative IPC classification codes or of the predominant technology fields is immediately achieved. The outcome of such clustering approach consists of two pdf files (included in the repository). The patent family identifiers can be immediately detected within each cluster, thanks to the search tool available in Adobe Acrobat, so that the respective bibliographic data may be soon retrieved using such patent family identifier as input of Patstat.


Steps to reproduce

Upon downloading a dataset from Orbit Intelligence by means of the search query included in "Original Orbit Intelligence Data" folder description, the very basic step concerns the 'cleaning' and reorganization of the dataset as detailed in the "Technical concepts' analysis" folder description with the help of Python programming language (version 3.8). At this stage, the docdb_family_id codes, alias the patent family identifiers can be used for all the subsequent steps of the analysis. They can be used to download from Patstat online the respective patent titles, as detailed in the description of the "Titles' keyword analysis" folder. On the other hand, the same patent family identifiers can be used to rank the IPC classification codes, depending on the frequency of assignment of each IPC code to the patent families of the dataset, and then to cluster the patent families as detailed in the description of the "Patent family clustering etc." folder, focusing on the most representative IPC classification codes. This clustering methodology can be easily replicated in case the 35 technology fields' definitions by WIPO are of interest. In both situations, SQL scripts to be run on Patstat are available in the folder mentioned above. Each script generates a MS Excel file including an array referring to two kind of vectors, respectively. In both cases each patent family of the dataset identifies one specific vector, yet while in one case the coordinates of the vector correspond to a pattern of IPC classification codes, in the other case the coordinates correspond to a pattern of WIPO technology fields. The Python script then uses such kind of array as input in order to compare the vectors pairwise, being the output a pdf file in which the patent family clusters can be immediately detected in a diagram. At the same time, the attributes (patent family codes) of the pdf file are searchable, thus permitting to straightforward detect the cluster to whom one specific patent family belongs. In conclusion the analyst is enabled to zoom in on specific members of the cluster, and, if necessary, to associate one or more of these patent families to the respective bibliographic data to eventually focus on single patent documents, using for such aim a very simple script to be run on Patstat online. Moreover, in case further detailed information, based on IPC classification codes detailed at sub-group level, should be necessary as far as one specific cluster composition is concerned, the clustering approach described above can be replicated. In such instance, the replacement of the main group of the IPC codes with the subgroups of interest should be taken in consideration as the only modification required to the clustering procedure illustrated in this dataset repository.