A Structured Dataset Containing 6,640 Indonesian Pantun Stanzas with Structural Annotations for Natural Language Generation

Name: A Structured Dataset Containing 6,640 Indonesian Pantun Stanzas with Structural Annotations for Natural Language Generation
Creator: Mohammad Nazir Arifin
Published: 2026-06-11T06:50:54.110Z
Keywords: Linguistics, Computer Science, Artificial Intelligence, Natural Language Generation, Language

Arifin, Mohammad Nazir

doi:10.17632/xcnbn5rpzn.3

A Structured Dataset Containing 6,640 Indonesian Pantun Stanzas with Structural Annotations for Natural Language Generation

Published: 11 June 2026| Version 3 | DOI: 10.17632/xcnbn5rpzn.3

Contributor:

Mohammad Nazir Arifin

Description

Computational Context and Summary This dataset offers a curated, structurally validated, and well-annotated corpus of 6,640 lines of Indonesian pantun for advanced text generation and computational linguistics research. The primary research objective is to estimate and quantify the ability of a Large Language Model (LLM) and text generation framework to adhere to strict, multi-layered formal constraints (line-by-line metrics, phonetic end-rhyme, and macro-structural dualism), while maintaining deep cultural and semantic coherence in a resource-constrained regional language. Key Features and What the Data Shows This dataset is provided as a single, fully structured file, pantun_dataset.csv, consisting of 6,640 unique four-line stanzas mapped to 17 operational variables with no missing values. To optimize the corpus for direct statistical and machine learning applications, the text layout was explicitly flattened and separated into structural subcomponents 'line_sampiran' (lines 1-2) and 'line_content' (lines 3-4). The metric architecture includes three different operational feature vectors for each line: line-level token density ('number_of_words_line_1..4'), the exact number of syllables strictly limited between 8 and 12 syllables ('suku_kata_line_1..4'), and a string representation of the extracted phonetic end rhyme ('rima_akhir_line_1..4'). The final corpus consisted of 4,945 verses (74.47%) with a cross rhyme pattern (a-b-a-b) and 1,695 verses (25.53%) with a continuous rhyme pattern (a-a-a-a). The update from version 2 to version 3 increased the dataset size to 6,640 verses through a more optimized filtering algorithm. The Lexical Validity Ratio filter criteria were added using Indonesian root words and the rima_akhir_line_1..4 variables were added, resulting in a total of 17 operational variables. Collection Methodology and Procedure This dataset was generated through an extensive two-stage programmatic quality control and filtering workflow: - Data Acquisition & Multilevel Deduplication. A raw corpus of 11,795 multi-line text blocks was manually obtained from primary digital databases, historical archives, and cultural portals. A complete deduplication workflow was implemented by combining token-based exact matching with sequence-aware fuzzy similarity alignment (using Jaccard metrics and SequenceMatcher with optimized thresholds). This eliminated duplicate records while preserving legitimate oral literary dialect variations, resulting in 8,765 unique couplets. - Algorithmic structure filtering and error isolation: The remaining couplets were evaluated according to traditional prosodic requirements by a deterministic, rule-based Python framework. A total of 2,125 verses were disqualified and discarded due to low lexical validity ratio, incorrect rhyme scheme or syllable count outside the range of 8-12. This rigorous screening process has yielded a high quality gold standard corpus of 6,640 structurally correct pantun verses.

Files

Steps to reproduce

To reproduce this dataset or build a similar limited poetry corpus, implement the following workflow: - Raw Data Collection: Collect raw text entries from digital platforms, literary archives, and cultural web portals to build a baseline, uncurated dataset. - Duplication Removal: Implement a two-step duplication removal process. First, use exact string matching to remove identical stanzas. Second, use fuzzy matching (Jaccard similarity and SequenceMatcher in Python) to remove highly repetitive lines while preserving natural variations in dialect. - Automated Quality Control: Run a Python script to enforce four strict pantun rules: (a) Line Rule: Each stanza must have exactly 4 lines. (b) Syllable Rule: Each line must contain between 8 and 12 syllables, calculated using a vowel counter script. (c) Rhyme Rule: Extract the final sounds of each line to ensure they conform to the a-b-a-b or a-a-a-a pattern. (d) Text Cleaning: Filter out stanzas with too many non-standard words or non-Indonesian pantun. - Table Export: Convert the final clean data into a table format. Replace actual newlines with the literal "\n" character to keep the CSV file clean and uncorrupted when opened in Python (Pandas) or R (tidyverse). Save the final data into a single file with 17 metadata columns.

Institutions

Universitas Madura
East Java, Pamekasan

A Structured Dataset Containing 6,640 Indonesian Pantun Stanzas with Structural Annotations for Natural Language Generation

Description

Files

Steps to reproduce

Institutions

Categories

Licence