10,000+ Artificial Intelligence Papers: arXiv Titles and Abstracts

Published: 14 May 2026| Version 1 | DOI: 10.17632/jhtrncpcnm.1
Contributors:
,
,

Description

This dataset contains the titles and abstracts of over 17,000 research papers focused on Artificial Intelligence, retrieved from the arXiv pre-print repository. It provides a substantial textual corpus of recent academic literature, designed to support researchers, data engineers, and developers working on text mining and advanced machine learning tasks. To facilitate immediate use, the dataset is provided in two formats: a raw version preserving the original formatting of the abstracts, and a thoroughly cleaned version processed specifically for textual analysis. Data Collection Methodology The data was collected in May 2026 by querying the arXiv advanced search portal for papers containing the term "Artificial Intelligence" within the abstract, filtered under the Computer Science classification. The extraction was performed using an automated R-based web scraping pipeline utilizing the rvest package. Data Files arxiv_titles_abstracts_raw.csv: Contains the unmodified, original text scraped directly from the arXiv search results. This file is ideal for applications requiring raw, unformatted academic prose. arxiv_titles_abstracts_clean.csv: A pre-processed version of the dataset optimized for Natural Language Processing. The text pipeline applied to this file includes lowercasing, the removal of punctuation, digits, and non-ASCII characters, automated spell-correction using hunspell, and word lemmatization. Potential Use Cases: This corpus is highly versatile for a variety of NLP and data science applications, including: Building and evaluating domain-specific Retrieval-Augmented Generation (RAG) pipelines. Fine-tuning Large Language Models (LLMs) on academic and technical prose. Training models for Named Entity Recognition (NER) to extract specific algorithms, hardware, or methodologies. Performing unsupervised clustering (such as K-Means, Hierarchical, or DBSCAN) and topic modeling to track emerging trends in AI research. Conducting bibliometric analysis on the evolution of AI terminology. Keywords: Natural Language Processing; Text Mining; Artificial Intelligence; Machine Learning; Text Corpus; Bibliometrics; Web Scraping; arXiv

Files

Steps to reproduce

Step 1: Install Required Dependencies Ensure the following R packages are installed in your environment before running the pipeline: R install.packages(c("tm", "textstem", "hunspell", "dbscan", "ggplot2", "rvest", "dplyr", "stringr")) Step 2: Execute the Web Scraper (Raw Data Collection) 1. Load the rvest, dplyr, and stringr libraries. 2. Define the scrape_page function to parse the arXiv advanced search HTML, targeting the p.title and span.abstract-full CSS selectors. 3. Run the pagination loop from start = 0 to max_results = 17603 (incrementing by 200). 4. Export the resulting dataframe to generate the raw dataset: R write.csv(all_data, "arxiv_titles_abstracts_raw.csv", row.names = FALSE) (Note: Scraping 17,000+ results across multiple pages may take some time depending on network speed and server response times.) Step 3: Apply Text Cleaning Pipeline (Clean Data Generation) 1. Read the newly created arxiv_titles_abstracts_raw.csv back into the R environment. 2. Filter out any empty rows or missing abstracts. 3. Load the tm, textstem, and hunspell libraries. 4. Pass the text through the clean_text function to convert to lowercase, and remove punctuation, digits, and non-ASCII characters. 5. Apply the custom spell_correct function, which builds a frequency table of words appearing 5 or more times and uses hunspell to suggest corrections for out-of-vocabulary terms. 6. Build a VCorpus object and apply the final mapping: removing English stopwords, lemmatizing words via textstem, and stripping excess whitespace. Step 4: Export the Cleaned Dataset Extract the processed text from the VCorpus object and bind it back with the titles to create the final cleaned dataset: R # Extract cleaned text from the corpus cleaned_abstracts <- sapply(corp, as.character) # Create a new dataframe with original titles and cleaned abstracts clean_data <- data.frame(title = raw$title, abstract = cleaned_abstracts, stringsAsFactors = FALSE) # Export the clean dataset write.csv(clean_data, "arxiv_titles_abstracts_clean.csv", row.names = FALSE)

Institutions

Categories

Artificial Intelligence, Natural Language Processing, Machine Learning, Bibliometrics, Text Mining

Licence