LUMINA (Linguistic Unified Multimodal Indonesian Natural Audio-Visual)

Published: 5 February 2024| Version 4 | DOI: 10.17632/8fw93k4rny.4


LUMINA (Linguistic Unified Multimodal Indonesian Natural Audio-Visual) is a carefully curated constrained dataset designed to support research in the field of speech perception. Spoken exclusively in Indonesian, LUMINA contains high-quality audio-visual recordings featuring 14 native speakers, including 9 males and 5 females. Each speaker contributes approximately 1,000 sentences, resulting in a rich and diverse collection of data. The recorded videos focus on facial recordings, capturing essential visual cues and expressions that accompany speech. This extensive dataset provides a valuable resource for understanding how humans perceive and process spoken language, paving the way for advancements in speech recognition and synthesis technologies. This dataset aligns with the classification known within relevant research as a 'Constrained Audio-Visual Dataset,' which finds significant application in lip reading and speech synthesis ​. The dataset is stored in two separate folders according to sources, male and female. Inside each folder are audio files (.wav), after undergoing resampling and trimming to achieve a consistent sampling rate of 16000 Hz, and video files (.mp4), which have been compressed using the CRF28 standard and has been cropped to a width of 250 pixels and a height of 150 pixels with the cut point at the center of the mouth. Each file audio and video stored in P<speaker’ number>_S<sentence’ number> naming format for each audio and video file. Also included is an Excel (.xlsx) file containing a list of word combinations out of 2500 used during the Lumina dataset compilation.



Institut Sains Terapan dan Teknologi Surabaya, Universitas Negeri Malang


Audio Recording, Audio Synthesis, Video Recording