InfantCry-DBL: A Two-Tier Annotated Corpus of Infant Cries Labelled with Dunstan Baby Language Categories (eairh, eh, heh, neh, owh)
Description
A per-clip manifest (metadata.csv, 1,551 rows) accompanies the audio and lists, for each WAV file, its tier, class, anonymised clip ID, gender (Tier 2), duration, sample rate, channel count, bit depth, and file size. The parent folder name is the authoritative class label in all cases. A README.md and an extended DATASET_DESCRIPTION.md are included with full methodology, known issues, recommended evaluation protocol, and citation details. Released under CC BY 4.0.
Files
Steps to reproduce
TIER 1 — Dunstan-Core (337 clips, studio quality) Audio was extracted from the canonical Dunstan Baby Language instructional video corpus (~75 min of source video) and manually segmented into single-vocalisation clips. Clips with narrator speech, background music, or overlapping vocalisations were rejected at segmentation. No resampling, denoising or normalisation was applied. Final yield: 337 clips, 44.1 kHz stereo PCM, 16-bit, median 2.44 s. TIER 2 — InfantCry-1214 (1,214 clips, in-the-wild) Three independent source streams were combined to expose acoustic, demographic and environmental variability: (A) CryCeleb2023-derived (620 clips). The public CryCeleb2023 release (~29,000 cry segments, 786 infants) was filtered on duration 0.4–7 s, presence of a single isolatable vocalisation, and absence of speech overlap, yielding 1,842 candidates. Two trained annotators independently assigned a DBL label per candidate; disagreements were discarded. 620 clips retained; Cohen's kappa on this subset = 0.87; 33.7% of candidates discarded. (B) Online DBL clips (324 clips). Publicly available YouTube and parenting-channel content was scanned for DBL exemplars OTHER than the original Dunstan instructional videos (reserved for Tier 1) to prevent cross-tier leakage. Same dual-annotation protocol. (C) Volunteer home recordings (270 clips). 14 families recorded their infants on consumer hardware (smartphones, USB microphones) in domestic conditions under written parental informed consent permitting research use and redistribution. Same dual-annotation protocol. ANNOTATION PROTOCOL Annotators were trained on a written DBL rubric and a Tier-1 calibration set before labelling. Every Tier-2 candidate received two independent labels; concordant labels were accepted as ground truth, discordant labels discarded (no forced consensus). A third annotator independently re-labelled a stratified random sample of 50 CryCeleb-derived clips for tertiary validation. INTER-RATER RELIABILITY Global Cohen's kappa (Tier 2) = 0.89 (95% CI 0.85–0.93). Per-class kappa: owh 0.94, eairh 0.91, neh 0.85, eh 0.83. Third annotator vs consensus (n=50): kappa = 0.84. AUDIO PROCESSING None applied to released files; original sample rate, channels and bit depth preserved. Users requiring uniformity should resample to 16 kHz mono before feature extraction. RECOMMENDED EVALUATION Stratified group-aware splits by clip_id (no infant identifier in both train and test). Class-stratified within each split. Suggested 70/15/15 train/val/test at the infant level. Report macro-F1, per-class F1, balanced accuracy and the full confusion matrix. For combined Tier 1+Tier 2 work, additionally report Tier 1 → Tier 2 zero-shot transfer to quantify the domain-shift gap. REBUILDING THE METADATA The included _build_metadata.py (Python >=3.8, standard library only: wave, csv, os, re) walks both tier folders, probes each WAV header, parses filenames for clip_id/gender/class, and emits metadata.csv (1,551 rows).
Institutions
- Sana'a UniversityAmanat Alasimah, Sanaa