A Golden Set of Problem, Solution, Advantages Senteces of the Patents

Published: 11 August 2022| Version 1 | DOI: 10.17632/kpxdzkgs3j.1
Contributors:
Vito Giordano,
,
,

Description

This data contains two different dataset: (1) Golden Set is dataset of sentences tagged as (A) technical problem; (B) solution to the problem; and (C) advantageous effect of the invention. The dataset is based on a selectively extracted collection from the United States Patent and Trademark Office (USPTO) curated by Chikkamath, R., Parmar, V. R., Hewel, C., & Endres, M. (2021). Patent Sentiment Analysis to Highlight Patent Paragraphs. arXiv preprint arXiv:2111.09741. The full text of a patent is composed of three main parts: abstract, claims and description. The five IP offices (IP5) formalize a common application format to standardize the written style of the patent description. The common application format also includes the sections related with the concepts of our interest, i.e., technical problems, solutions to the problem, and advantageous effects of the invention. USPTO provides to the public the full text of the patent for advancing the state-of-the-art in innovation. The full text of a patent is saved in a nested eXtensible Markup Language (XML) formatted file. The XML format enables to distinguish the patent text in the abstract, claims and description part. It allows us to distinguish from the different sections composing the description established by the IP5 common application format, i.e., the background information, the summary, the embodiment, the description of the drawings, the technical fields, and other sections. Chikkamath et al. (2021) use the USPTO data and in particular the XML files of the patent full text for creating a new dataset. The dataset contains a collection of patent texts (150,000 samples) referred to (A) technical problems; (B) solutions to the problem; and (C) advantageous effects of the invention. We use this data for building our golden set. (2) Test data is a database 400 random patent grants and patent applications downoladed from USPTO. We use this data for evaluating a transformer-based language models developed for extracting problems, solutions and advantages on a real case use in an open ended domain.

Files

Institutions

Universita degli Studi di Pisa

Categories

Patent, Natural Language Processing, Problem Solving, Text Mining

Licence