Topic Detection and Tracking (TDT)

Name: Topic Detection and Tracking (TDT)
Creator: Sahand Vahidnia
Published: 2023-06-07T06:19:51.537Z
Keywords: Publication, Text Mining, Language Modeling

Vahidnia, Sahand

doi:10.17632/xn474x8hvf.1

Topic Detection and Tracking (TDT)

Published: 7 June 2023| Version 1 | DOI: 10.17632/xn474x8hvf.1

Contributor:

Sahand Vahidnia

Description

Dataset for the project TDT (Project URL: https://github.com/sahandv/TDT) This dataset contains all required to train the models for TDT project. The dataset includes abstracts, keywords, mapped concepts, and citations for 194937 cleaned data points from the Scopus dataset, originally from over 300k data points. FastText Model: FastText trained model on Dimensions and Scopus data. This is used for keyword search and concept mapping. Computer Science Ontology (CSO): The original data can be downloaded from https://cso.kmi.open.ac.uk/downloads . The uploaded version is the parent map of the CSO, acquired using DFS. Every node has a level 2 parent (not level 1 root parent_. This is intended to give us an idea about the high-level topic abstractions for each low-level concept (node). So, each node will have a list of parents (topics). Scopus Dataset: (AI-related articles, from the start to 2020) - Preprocessed: - - `keyword pre-processed for fasttext - nov14`: preprocessed scopus publication author keywords, lemmatised with "|" delimiters -- `citations with abstracts`: citations that have documents and abstracts -- `citations with abstracts supernodes_str_name`: same as above, with additional supernodes based on textual clusters. -- `mapped concepts for keywords`: Concepts (level 2 from CS ontology), mapped using author keywords. Instead of keywords, this can be used. -- `abstract_title method_b_3`: This data is omitted to avoid copyright issues. However, can be provided to fellow researchers privately if requested, for non-commercial usage. The data is preprocessed and lemmatised abstracts with Scopus IDs. -- `data with abstract`: Containing these columns from the dataset: PY,id,eid,TI,author. Can be used to double-check the ID and PY (publication year). - `Doc2Vec Model`: Model trained using the Scopus and Dimensions data for AI articles, and doc2vec, with 100 dimensions. - `Node2Vec mode`: 100 dimensional node2vec model, trained using the citations. Use the Scopus id to get embeddings for each node. - Embeddings: -- `concepts_node2vec_50D`: 50-dimensional concept embedding for each article. (Scopus ID is not given as an index, but the order of data is the same as `abstract_title method_b_3` and can be joined.) -- `abstracts_doc2vec_100D`: 100-dimensional abstract embedding for each article. (Scopus ID is not given as an index, but the order of data is the same as `abstract_title method_b_3` and can be joined.) _______________________________________________________________________________________________ Please note that this dataset is only for academic and personal use, and commercial use of this dataset is prohibited. 3rd Party licence - Some materials are recompiled Scopus and Dimensions material and may be subject to their licensing.

Topic Detection and Tracking (TDT)

Description

Files

Institutions

Categories

Funding

Related Links

Licence