Hybrid Semantic Intelligence Dataset for AI-Generated and Human-Authored Research Abstract Analysis
Description
This dataset contains AI research abstracts collected from publicly available research repositories, publication platforms, and Large Language Model (LLM)-generated scientific content sources. The dataset was constructed to support research in semantic intelligence, topic modeling, clustering, and AI-generated text analysis. The corpus consists of two categories of research abstracts: (1) LLM-generated scientific abstracts and (2) human-authored AI research abstracts. The original dataset contains 4,000 research abstracts, including: - 2,000 LLM-generated abstracts (Label = 0) - 2,000 human-authored abstracts (Label = 1) The dataset was collected through web scraping and manual curation, followed by preprocessing operations including duplicate removal, lowercase normalization, punctuation filtering, URL removal, stopword elimination, and token cleaning. After duplicate removal, the final processed corpus contains 3,989 semantically valid research abstracts. Dataset Statistics: - Original Dataset Size: 4,000 abstracts - Final Processed Corpus: 3,989 abstracts - Number of Classes: 2 - Language: English - Labels: - 0 = LLM-generated abstracts - 1 = Human-authored abstracts The dataset is suitable for multiple Natural Language Processing (NLP) and semantic intelligence tasks, including: - Semantic clustering - Topic modeling - AI-generated text analysis - Semantic drift analysis - NLP representation learning - Interpretable semantic analysis - Transformer-based text mining - Scientific document analysis This dataset was developed as part of the research study: “Hybrid Semantic Intelligence Framework for AI-Generated and Human-Authored Research Abstract Analysis.”