Six small datasets for text segmentation
Description
This collection brings together six datasets designed for experiments in text segmentation: Choi — 922 artificial documents [1]. Each document is composed of sentence blocks drawn from different sources. Since the segments are unrelated, segmentation is relatively easy for many algorithms and typically yields high accuracy. Manifesto — 6 long political speeches [2]. Each text includes a human-generated segmentation based on strict guidelines. The dataset is used to evaluate segmentation of semantic topic shifts and thematic changes. Wiki-1024 — 1,024 Wikipedia articles [3]. Segmentation is defined by the natural division of documents into sections and subsections. Abstracts — artificial documents created by merging real research abstracts into continuous texts. About 20,000 abstracts were collected from Scopus in the field of Information Retrieval. Segments correspond directly to individual abstracts. SMan — artificial documents constructed by randomly sampling segments from Manifesto texts. The resulting statements vary in content due to mixing, but generally follow a slogan-like style. PhilPapersAI — 336 philosophy articles (focused on AI) selected from philpapers.org. The source PDFs were reprocessed using the OpenAI GPT-4o-mini LLM to restore structure and add subsection divisions. The resulting texts are coherent and well-structured, while preserving the authors’ original style as closely as possible. [1] Choi, F.Y.Y. Advances in domain independent linear text segmentation. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 2000. [2] Hearst, M.A. Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages. Computational Linguistics 1997, 23, 33–64. [3] Koshorek, O.; Cohen, A.; Mor, N.; Rotman, M.; Berant, J. Text Segmentation as a Supervised Learning Task. In Proceedings of the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers); Walker, M.; Ji, H.; Stent, A., Eds., New Orleans, Louisiana, 2018; pp. 469–473. https://doi.org/10.18653/v1/N18-2075.