EEG Dataset for Second Language Acquisition Arbic and Hindi

Published: 26 May 2026| Version 1 | DOI: 10.17632/7cdyp2r5cz.1
Contributors:
,
,

Description

This dataset contains electroencephalography (EEG) recordings from 20 healthy adult volunteers — 10 Yemeni native Arabic speakers and 10 Indian native Hindi speakers — during a controlled second language acquisition (L2) task. EEG was recorded using a 40-electrode Virgo EEG system (Allengers), with electrode impedance kept below 15 kΩ, positioned according to the international 10–20 system, and sampled at 256 Hz. Stimuli consisted of 24 single words (12 Arabic + 12 Hindi, semantically matched) presented visually for 10 s each with a 5 s rest interval between sessions. The dataset is released in two complementary forms: (1) RAW — 300 European Data Format (.edf) files, organised per participant and per language. Each participant has its own folder containing an "Arabic" and a "Hindi" subfolder. File-naming convention: S<participant>_<language>_SEE<session>.edf, e.g. S4_A_SEE1.edf (participant 4, Arabic, session 1) or S4_H_SEE4.edf (participant 4, Hindi, session 4). Each participant completed 15 sessions in total, split between the two languages. (2) PROCESSED — a single consolidated CSV file (~1.47 GB) ready for machine-learning pipelines, containing 18 columns: 17 EEG channels (FP1, FP2, F7, F3, F4, F8, A1, T3, CZ, T4, A2, T5, P3, P4, T6, O1, O2) followed by a "label" column indicating the stimulus language (Arabic / Hindi). Participants: 12 male, 8 female; ages 18–46 (mean 29.26 ± 7.72). All participants reported normal hearing, no chronic disease or neurological disorder, and provided written informed consent for public release of de-identified EEG data. Recordings were performed at Medicover Hospital, Aurangabad, Maharashtra, India, between 26 May 2022 and 09 August 2022. Ethics approval: Department of Computer Science and IT, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, in cooperation with Medicover Hospital (Ref. 91/136/2020). The dataset supports research in neurolinguistics, brain–computer interfaces, EEG signal processing, feature engineering, and machine/deep learning for cross-language neural pattern recognition, with a unique focus on Arabic vs. Hindi as L2. If you use this dataset, please cite BOTH of the following: Aldhaheri, T. A., Kulkarni, S. B., Bhise, P. R., & Tawfik, M. (2025). Utilizing machine and deep learning algorithms to identify learning-related features in electroencephalography data during second language acquisition. Cogent Arts and Humanities, 12(1). https://doi.org/10.1080/23311983.2025.2485696 Aldhaheri, T. A., Kulkarni, S. B., & Al-Zidi, N. M. (2026). Optimizing machine learning models with multi-feature selection for EEG analysis in second language acquisition research. Discover Artificial Intelligence, 6, 99. https://doi.org/10.1007/s44163-025-00801-z

Files

Steps to reproduce

EQUIPMENT - Virgo EEG system (Allengers) with 40 active electrodes; impedance kept below 15 kΩ; sampling rate 256 Hz. - 15.6-inch laptop (resolution 1024 × 768) for visual stimulus presentation. - Conductive gel for electrode contact. ENVIRONMENT Closed, isolated, low-noise recording room. Participant seated on a chair 90 cm from the stimulus display. PARTICIPANTS 20 healthy adults: 10 Yemeni native Arabic speakers + 10 Indian native Hindi speakers (12 male, 8 female; ages 18–46). Pre-screening confirmed normal hearing and absence of chronic disease or neurological disorder. Written informed consent obtained before recording. ELECTRODE PLACEMENT International 10–20 system. Scalp channels included in the processed CSV: FP1, FP2, F7, F3, F4, F8, T3, CZ, T4, T5, P3, P4, T6, O1, O2. A1 and A2 correspond to mastoid reference positions. STIMULI 24 single words: 12 Arabic + 12 Hindi, semantically matched (car, mother, house, book, teacher, father, day, sleeping, sea, week, sister, no). Each word shown for 10 s with an accompanying picture clarifying its meaning. English glosses provided. Stimuli alternated between Arabic and Hindi. PER-SESSION PROCEDURE 1. Participant focuses on the on-screen word stimulus for 10 s. 2. 5 s rest interval with no stimulus displayed. 3. Steps 1–2 are repeated until all words in both languages have been presented. Each participant completed 15 recording sessions, distributed across Arabic and Hindi (the exact split varies per participant — e.g. 8 Arabic / 7 Hindi or vice versa). Across the 20 participants this produced 300 raw .edf recordings. DATA STORAGE AND INITIAL INSPECTION Raw recordings stored in European Data Format (.edf). The EDFbrowser tool (https://www.teuniz.net/edfbrowser/) was used to open the files and perform initial visual inspection. PROCESSED CSV The .edf files were read sample-wise and concatenated into a single CSV (~1.47 GB). Each row is one multi-channel EEG sample, and a "label" column was appended indicating the stimulus language (Arabic or Hindi) at the time of recording. Researchers requiring filtered, artifact-removed, or epoched data are advised to re-run their preferred preprocessing pipeline (e.g. MNE-Python, EEGLAB) directly on the original .edf files. ETHICS Study approved by the Department of Computer Science and IT, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, in cooperation with Medicover Hospital, Aurangabad, India (Ref. 91/136/2020). Written informed consent obtained from all participants.

Institutions

Categories

Electroencephalography in Neurosurgery, Brain Computer Interface in Rehabilitation

Licence