NGS Patstat supplementary data

Published: 25 February 2021| Version 1 | DOI: 10.17632/3twccmpdry.1
Riccardo Priore


The datasets included in the repository concern results obtained following patent searches performed by means of Patstat online ( The SQL based scripts aimed at the analysis performed by means of Patstat online Autumn 2019 (pls. refer to "NGS Patstat data", Priore, Riccardo (2020), “NGS Patstat data”, Mendeley Data, V1, doi: 10.17632/f45st2xmkj.1) are flanked by a collection of SQL scripts tested on the Patstat online Autumn 2020 edition. The codes of each script refer, exactly as in the former repository ("NGS Patstat data" regarding the nucleic acids' Next Generation Sequencing techniques) to the following kinds of analysis: 1) the script named 'C1' is essentially aimed at determining the amount of patent families corresponding to patent applications filed considering the Next Generation Sequencing techniques/tools. 2) The script named ‘N1’ (‘Normalization NGS’) allows to determine the number of patent applications specifically concerning the NGS matter and filed to different national patent authorities. It may be reasonably assumed that if one geographical area is much wider and characterised by a higher number of residents than a second one, in the former case a higher number of patent applications are filed than in the latter case. However, the evaluation may significantly change when dividing the number of the patent applications specifically concerning the NGS field by the number of patent applications filed to the same authority, though irrespectively of the technical item. This procedure is referred to as 'normalization'. 3) The scripts named ‘Q1 – Q5’ are the results of search criteria aimed at estimating the value of the pooled patent documents. Metrics to be used for such aim include the number of the triadic patent families, the ranking of the patent applicants based on the dimension of the patent family or on the average number of forward citations of a patent family. 4) The scripts named ‘S1 – S2’ are aimed at elucidating the collaborations occurred in a specific time-frame between one applicant and one patent attorney. Cases in which two applicants, having collaborated with the very same patent attorney, can be considered as competitors or collaborators (co-assignees according to the possibility of co-filing a patent application), could emerge by means of such search phase. Differently from the analysis performed by means of the SQL scripts stored in "Patstat NGS" repository mentioned above, the SQL scritps provided in such new repository have been slightly modified to permit the analysis, performed by means of Patstat online, of a dataset regarding the NGS technologies and produced interrogating Orbit Intelligence. A preliminary elaboration of such Orbit dataset is based on Python and has been performed to rank the most relevant technical concepts dealing with the NGS technologies.


Steps to reproduce

The present repository includes procedural steps useful to complement the information stored in a pre-existing dataset ("NGS Patstat data", Priore, Riccardo (2020), “NGS Patstat data”, Mendeley Data, V1, doi: 10.17632/f45st2xmkj.1). In the present repository one folder ("Datasets to by analysed") includes a file where the NGS dataset refers to codes aimed at the identification of the patent families as well as the associated technical concepts. Such file may be analysed by means of a dedicated IPython Notebook (see the folder named "Python"). The aim of the elaboration of the Orbit dataset by means of Python is of reorganising the data of the original dataset in order to rank the most frequently cited technical concepts. A MS Excel file is provided to exemplify the output from which the ranking of the technical concepts can be achieved straightforward in case the file is further elaborated by means of the application MS Power BI (the use of the Power Query and the subsequent use of the 'Unpivot' function is recommended). In synthesis, each concept may be associated to a citation frequency that corresponds to the number of patent families characterised by that specific concept. A third folder ("SQL files") contains a list of SQL scripts similar to those included in ("NGS Patstat data", Priore, Riccardo (2020), “NGS Patstat data”, Mendeley Data, V1, doi: 10.17632/f45st2xmkj.1) and allowing to divide the patent search in phases specifically regarding 1) the "coverage" of the dataset, 2) the normalization of the applications' filing events, 3) the detection of triadic families as well as the ranking of the players based on the forward citation rate and the patent families' dimension, 4) the detection of co-occurence of players, not co-assignees, who might have engaged the same patent attorney. The specificity of the patent documents to be analysed by means of the SQL scripts included in this repository relies on the possibility of pasting the 'EPO family ID' identification codes, downloadable from Orbit Intelligence and equivalent to the Patstat 'docdb_family_id' attribute, directly into the SQL scripts. This option is an advantageous alternative to the formulation of search queries based on the traditional interrogation of Patstat by means of specific keywords and IPC or CPC classification codes (for which purpose the SQL scripts included in "NGS Patstat data", Priore, Riccardo (2020), “NGS Patstat data”, Mendeley Data, V1, doi: 10.17632/f45st2xmkj.1 are suitable). The advantage of the procedure based on the use of the docdb_family_id codes depends on the fact that Orbit Intelligence may be interrogated upstream to the Patstat elaboration phases (1 to 4, described above), thus being possible to select the patent documents not only analysing the titles and abstracts, but also the claims and/or the description of the patents. Claims and description are searchable attributes specifically provided by Orbit Intelligence and not yet by Pastat online.


High Throughput Analysis, Patent Classification, Genome Sequencing