Computational Distributional Profiling
Description
These files are derived analytical outputs generated from the computational profiling of the English Nursery Rhymes corpus. They do not modify the original Kaggle dataset, but provide processed analytical results, including token frequency, word dispersion, concordance/KWIC records, collocation statistics, and a sample distribution of the token little. Together, these files provide structured evidence for assessing whether the corpus reflects key characteristics of children’s song lyrics, particularly lexical repetition, word distribution, and local word associations. The files can be reused for corpus linguistics, computational text analysis, educational language research, and comparative nursery rhyme studies.
Files
Steps to reproduce
1. Download the source corpus from the Kaggle repository terencebroad/english-nursery-rhymes. 2. Extract the 308 plain-text nursery rhyme files and treat each file as one document. 3. Load all text files into a corpus-processing environment. 4. Perform input checking to record character-level statistics, including total characters, ASCII and non-ASCII characters, whitespace characters, dash variants, and encoding anomalies. 5. Apply minimal whitespace normalization by converting line breaks and non-breaking spaces into standard spaces and collapsing repeated whitespace. 6. Tokenize the normalized texts using whitespace-based tokenization. 7. Remove punctuation-only tokens while preserving the original lexical forms of the corpus. 8. Generate token frequency rankings, dispersion statistics, concordance/KWIC records, and collocation outputs using left and right one-word windows. 9. Export the analytical outputs as CSV or Excel files for reuse and corpus-level assessment.
Institutions
- State University of MalangEast Java, Malang