107-Person Survey Dataset: Public Acceptance of Computer-Vision Sign Language Recognition in South Asia (Hearing the Unheard)

Name: 107-Person Survey Dataset: Public Acceptance of Computer-Vision Sign Language Recognition in South Asia (Hearing the Unheard)
Creator: Aonyendo Paul Neteish
Published: 2026-06-25T20:26:23.840Z
Keywords: Social Sciences, Computer Science Applications, Computer Vision, Information System, Data Science, Machine Learning, Artificial Intelligence Applications, Computer Science Public Policy, Sign Language, Deep Learning, AI-Human Interaction

Neteish, Aonyendo Paul; Biswas, Barno

doi:10.17632/mw9sn8m4n5.1

107-Person Survey Dataset: Public Acceptance of Computer-Vision Sign Language Recognition in South Asia (Hearing the Unheard)

Published: 25 June 2026| Version 1 | DOI: 10.17632/mw9sn8m4n5.1

Contributors:

,

Description

Hearing the Unheard" is a survey dataset capturing the general public's awareness, attitudes, willingness to adopt, and concerns toward computer-vision-based Sign Language Recognition (SLR) technology used to communicate with deaf and hard-of-hearing (DHH) people. It contains 107 anonymous responses collected through an online Google Forms questionnaire between February and April 2024 from individual members of the public. The dataset supports the peer-reviewed paper "Hearing the Unheard" (ACM Digital Library, DOI: 10.1145/3723178.3723277) and is shared to enable replication and new research on technology acceptance, accessibility, and inclusive design. Each of the 17 questions covers one of several themes: demographics (gender, age group, occupation), prior exposure to deaf individuals or sign language, familiarity with SLR technology, perceived benefits, likelihood of using a translation app, willingness to learn sign language, support for integrating SLR into public services (e.g., hospitals, police stations), useful application contexts, worries (cost, accuracy, privacy, over-reliance), the importance of privacy and consent during development, adoption factors, perceived drawbacks, and open-ended feedback. Items include single-choice Likert scales, categorical questions, multiple-choice ("select all that apply") questions, and optional free text. The dataset is provided in three progressively processed versions so it serves both social-science analysis and machine learning / deep learning model training: 1. Raw - a faithful, value-for-value copy of the original export, with a stable respondent_id and machine-friendly column codes; nothing recoded or dropped. 2. Cleaned - a tidy, human-readable table with trimmed whitespace, parsed ISO 8601 timestamps (collected in GMT+6), verified for duplicates, with no responses removed or imputed. 3. ML/DL-ready - a fully numeric feature matrix (107 rows x 41 columns, zero missing values). Ordinal/Likert answers are encoded as rank-preserving integers, nominal variables are one-hot encoded, multi-select questions are expanded into multi-hot indicator columns, and binary flags summarize open text. This version loads directly into scikit-learn, XGBoost, PyTorch, or TensorFlow/Keras. "data_dictionary.csv" file have all the explanation. A complete data dictionary documents every variable and its encoding, and a single reproducible Python script regenerates all files deterministically from the original export. No personally identifiable information was collected; participation was voluntary and anonymous. Suggested uses include modelling adoption likelihood or public-service integration support, segmenting attitude profiles, and analyzing how concerns relate to willingness to adopt. Keywords: sign language recognition, public perception, technology acceptance, deaf and hard-of-hearing, accessibility, computer vision, survey dataset, human-computer interaction.

Files

Steps to reproduce

The published RAW file is the reproduction source. From it, a single Python script ("scripts/prepare_dataset.py") regenerates every derived file deterministically: re-running it on the same input always produces byte-identical outputs. The original raw responses are never altered. Uploaded files: - 01_raw/hearing_the_unheard_raw.csv (107 responses; verbatim answers plus a stable respondent_id and short, machine-friendly column codes) - 02_cleaned/hearing_the_unheard_cleaned.csv (cleaned, human-readable) - 03_ml_ready/hearing_the_unheard_ml_ready.csv (fully numeric feature matrix, 107 x 41) - data_dictionary.csv (description and encoding of every variable) - scripts/prepare_dataset.py (the reproduction pipeline) Requirements: - Python 3.11 or newer - pandas 2.0 or newer (developed and tested with pandas 3.0): pip install "pandas>=2.0" Step 1 - Arrange the files exactly as uploaded, keeping the folder names: place 01_raw, 02_cleaned, 03_ml_ready, and scripts side by side in one project root, with data_dictionary.csv in that root. The script resolves all paths relative to this structure, so no editing is needed. Only 01_raw and scripts are strictly required as inputs; the other files are regenerated. Step 2 - Run the pipeline: cd scripts python prepare_dataset.py Step 3 - Collect the outputs. The script reads 01_raw/hearing_the_unheard_raw.csv and writes/overwrites: - 02_cleaned/hearing_the_unheard_cleaned.csv - 03_ml_ready/hearing_the_unheard_ml_ready.csv - data_dictionary.csv What the script does internally: 1. CLEANED: strips leading/trailing whitespace from every text cell, parses the Google Forms timestamp (e.g., "2024/02/22 9:16:49 PM GMT+6") into ISO 8601 ("2024-02-22 21:16:49", collected in GMT+6), checks for and removes exact duplicate submissions (none are present), and preserves the original respondent_id. 2. ML/DL-READY: encodes answers numerically - ordinal/Likert items as rank-preserving integers (e.g., familiarity 0-4, importance 0-4, age 0-2; Yes/No/Not sure as 0/1/2); gender and open-text presence as binary flags; nominal variables (occupation, prior contact) as one-hot columns; and multi-select questions (useful contexts, worries, adoption factors) as multi-hot indicator columns with an "other" flag for free-text answers. 3. Validates that the numeric matrix has no missing values, then writes the data dictionary. Step 4 - Verify (optional). The console should report "107 rows" for each file and "missing cells in ML matrix: 0". You can also load the matrix: import pandas as pd df = pd.read_csv("03_ml_ready/hearing_the_unheard_ml_ready.csv") print(df.shape) # (107, 41) print(df.isna().sum().sum()) # 0 No internet access, API keys, or personal data are required; the process is fully offline and reproducible.

Institutions

American International University-Bangladesh
Dhaka Division, Dhaka

107-Person Survey Dataset: Public Acceptance of Computer-Vision Sign Language Recognition in South Asia (Hearing the Unheard)

Description

Files

Steps to reproduce

Institutions

Categories

Related Links

Licence