C4ISTAR technologies

Published: 11-07-2020| Version 1 | DOI: 10.17632/88g8kcwj9r.1
Contributors:
Vito Giordano,
Filippo Chiarello,
Irene Spada,
Gualtiero Fantoni,
Andrea Bonaccorsi

Description

The dataset contains the list of the technologies in the context of a defence related domain, C4ISTAR. C4ISTAR, in general indicated as “Command and Control” or “C2”, is an acronym used in Defence for Command, Control, Communication, Computer, Information/Intelligence, Surveillance, Target Acquisition, and Reconnaissance. The technologies are automatically mapped using a Text Mining technique, callled as Named Entity Recognition (NER). The approach to NER used in this article is twofold: a gazetteer based (or terminological-driven NER) approach and a rule-based approach. The text mining techniques are applied to a collection of the C4ISTAR documents, we collected English documents (166 documents) from diverse sources published from 2000 to 2020. The method applied allow us to map the C4ISTAR domain and to identify a list of 1090 technologies. The dataset published in this artiche contains the completed list of the technologies extracted in the C4ISTAR domain.

Files

Steps to reproduce

The methodological step starts with the collection of documents related to the C4ISTAR field, in order to create a dictionary of the technologies adopted in the domain. We collected English documents (166 documents) from diverse sources published from 2000 to 2020. The type of sources used for this purpose are: - Research institutions documents; - National or international institutions documents; - Companies documents; - Thematic website news; - Market survey on the C4ISTAR. The selection of different sources allows us to reduce the source bias effect and increase recall. The collected documents are then pre-processed: each document is tokenized, in order to subdivide the text in single units of meaning, also called token. A token could be a single word, a bi-grams (a pair of words), a tri-grams or chunked words. Then, the Part-of-Speech tagging (or PoS tagging) process is made to assign to each token the part of speech. The PoS tagging is a Natural Language Processing (NLP) technique for assigning unambiguous grammatical categories to words in context. After the pre-processing steps, the C4ISTAR documents are mined to automatically extract technologies using Named Entity Recognition techniques. The approach to NER used in this article is twofold: a gazetteer based (or terminological-driven NER) approach and a rule-based approach. For the gazetteer-based approach, we start from a list of relevant entities (gazette). Then, with the aid of text mining tools, the occurrences of all these entries are identified in the C4ISTAR documents. In order to obtain high accuracy, several sources of knowledge are considered to create the gazette of technologies: - A database of technologies related to industry 4.0; - O*NET; - Wikipedia list of emerging Technologies; - Techopedia. The second approach (rule-based approach) aims to define a list of semantic rules for extracting the technologies from the C4ISTAR documents. The method applied allow us to map the C4ISTAR domain and to identify a list of 1090 technologies.