LUMINA (Linguistic Unified Multimodal Indonesian Natural Audio-Visual)
LUMINA (Linguistic Unified Multimodal Indonesian Natural Audio-Visual) is a carefully curated constrained dataset designed to support research in the field of speech perception. Spoken exclusively in Indonesian, LUMINA contains high-quality audio-visual recordings featuring 14 native speakers, including 9 males and 5 females. Each speaker contributes approximately 1,000 sentences, resulting in a rich and diverse collection of data. The recorded videos focus on facial recordings, capturing essential visual cues and expressions that accompany speech. This extensive dataset provides a valuable resource for understanding how humans perceive and process spoken language, paving the way for advancements in speech recognition and synthesis technologies. As an initial release, the LUMINA dataset provides access to data from one speaker, comprising 500 video recordings. Each video has been split into two separate files: one for audio and one for video. The video files are further divided into four segments, consisting of 0-20, 20-40, 40-60, and 60-80 frames, all in the .mp4 format. Correspondingly, the audio files have also been split to match the duration of the respective video segments and are provided in the .pkl extension.