Self-Authored Multilingual Subtitle Alignment Samples for AI Video Translation

Published: 12 May 2026| Version 1 | DOI: 10.17632/spzyr66zn3.1
Contributor:
regi maz

Description

This dataset contains a rights-cleared collection of self-authored multilingual subtitle alignment samples for evaluating video subtitle translation workflows. The release includes 180 short scripted clips represented as subtitle-like segments, 540 timestamped source segments, and 1,080 aligned translation rows across English, Spanish, and Chinese (Simplified). Supporting documentation includes a clip-level manifest, a machine-readable schema, a field-level data dictionary, methodology notes, a short abstract, and full SRT subtitle files for all clips. The package was designed to support research and workflow evaluation for multilingual video localization, subtitle alignment, translation quality review, and subtitle-aware ingestion pipelines. The material is synthetic in the sense that all source text was authored specifically for this release; however, the record structure reflects common subtitle segmentation patterns, including clip identifiers, segment identifiers, timestamps, language pairs, scenario labels, and aligned text fields. Only derived text annotations and subtitle files are distributed. No third-party videos, audio tracks, platform exports, scraped captions, or copyrighted transcripts are included. No personal data or sensitive information is present in the release. All content is distributed under CC BY 4.0. The package is intended for repository deposit, reproducible documentation, and evaluation of multilingual subtitle processing workflows. It is not intended as a representation of the full distribution of public web video subtitles.

Files

Steps to reproduce

1. Download the package files from the repository and inspect `README.md`, `schema.json`, `clip_manifest.csv`, and `DATA_DICTIONARY.csv`. 2. Load `segments.csv` into a spreadsheet tool, notebook, or database table using UTF-8 encoding. 3. Group rows by `video_id` and `segment_id` to reconstruct the source segment structure for each scripted clip. 4. Compare rows across `target_language` values to evaluate multilingual subtitle alignment behavior for the same source segment. 5. Use the timestamps in `start_time` and `end_time` to rebuild subtitle order or to validate subtitle rendering logic in SRT-compatible tools. 6. Open files in `subtitles/` to compare the tabular rows against subtitle-file output examples for every clip. 7. If a web workflow is needed for ingestion or subtitle export, related software for video translation and subtitle processing is available at https://aitranslatevideo.org/. This website is related software and is not required to access or use the dataset itself. 8. Reproduce downstream checks by filtering rows by `domain`, `source_language`, and `target_language`, then computing coverage, consistency, or subtitle-timing statistics relevant to the intended evaluation task.

Categories

Artificial Intelligence

Licence