Computational Distributional Profiling

Name: Computational Distributional Profiling
Creator: Novia Ratnasari
Published: 2026-05-20T15:05:26.225Z
Keywords: Computational Linguistics, Natural Language Processing, Corpus Linguistics, Text Mining

Ratnasari, Novia

doi:10.17632/ywrgs9wgcz.1

Computational Distributional Profiling

Published: 20 May 2026| Version 1 | DOI: 10.17632/ywrgs9wgcz.1

Contributor:

Novia Ratnasari

Description

These files are derived analytical outputs generated from the computational profiling of the English Nursery Rhymes corpus. They do not modify the original Kaggle dataset, but provide processed analytical results, including token frequency, word dispersion, concordance/KWIC records, collocation statistics, and a sample distribution of the token little. Together, these files provide structured evidence for assessing whether the corpus reflects key characteristics of children’s song lyrics, particularly lexical repetition, word distribution, and local word associations. The files can be reused for corpus linguistics, computational text analysis, educational language research, and comparative nursery rhyme studies.

Files

Steps to reproduce

1. Download the source corpus from the Kaggle repository terencebroad/english-nursery-rhymes. 2. Extract the 308 plain-text nursery rhyme files and treat each file as one document. 3. Load all text files into a corpus-processing environment. 4. Perform input checking to record character-level statistics, including total characters, ASCII and non-ASCII characters, whitespace characters, dash variants, and encoding anomalies. 5. Apply minimal whitespace normalization by converting line breaks and non-breaking spaces into standard spaces and collapsing repeated whitespace. 6. Tokenize the normalized texts using whitespace-based tokenization. 7. Remove punctuation-only tokens while preserving the original lexical forms of the corpus. 8. Generate token frequency rankings, dispersion statistics, concordance/KWIC records, and collocation outputs using left and right one-word windows. 9. Export the analytical outputs as CSV or Excel files for reuse and corpus-level assessment.

Institutions

State University of Malang
East Java, Malang

Computational Distributional Profiling

Description

Files

Steps to reproduce

Institutions

Categories

Licence