Bengali to English Word Alignment Dataset

Published: 16 May 2022| Version 1 | DOI: 10.17632/wzgcyc643k.1
Md. Musfiqur Rahaman,


The dataset is in XML format and contains manually annotated 2000 Bengali and English parallelly aligned sentences. These parallel sentences were collected from different news articles and encyclopedias. The translation of some of the sentences was improved via Google Translator, and all the punctuation marks from parallel sentences were removed for the tokenization issues. A sample representation of the dataset is given below: Bengali: বাংলাদেশের জলবায়ু তাপমাত্রায় মৃদু English: Climate of Bangladesh is mild in temperature Alignments: 0-1 0-2 1-0 2-5 2-6 3-3 3-4


Steps to reproduce

We have developed and used a word aligner tool to create this dataset. The link to this tool is attached in the "Related Links" section.


North South University


Computer Science, Artificial Intelligence, English Language, Natural Language Processing, Machine Translation, Machine Learning, Bengali Language, Aligner, Neural Network