arXiv Scientific Research Paper Dataset

Published: 19 February 2025| Version 1 | DOI: 10.17632/mm6kst3krj.1
Contributor:
Sumit Mishra

Description

Description This dataset comprises structured metadata from the arXiv repository, a widely used preprint server for scientific research. It includes paper titles, abstracts, categories (subject areas), and submission dates, making it a valuable resource for research in natural language processing (NLP), bibliometrics, machine learning, and scientific trend analysis. Content The dataset contains the following columns 1. id: Unique arXiv identifier for each paper. 2. title: The title of the research paper. 3. summary: Summary of the paper’s content, extracted from arXiv. 4. summary_word_count: Word count of the summary. 5. category: Subject categories assigned by arXiv. 6. category code: Category code for the research paper. 7. published_date: Publication date of the research paper. 8. updated_date: The last updated date is when the paper is updated. 9. authors: Authors of the research paper. 10. first_author: First Author mentioned in the paper.

Files

Categories

Data Science, Natural Language Processing, Machine Learning, Bidirectional Encoder Representations From Transformers

Licence