Bengali to English Word Alignment Dataset
The dataset is in XML format and contains manually annotated 2000 Bengali and English parallelly aligned sentences. These parallel sentences were collected from different news articles and encyclopedias. The translation of some of the sentences was improved via Google Translator, and all the punctuation marks from parallel sentences were removed for the tokenization issues. A sample representation of the dataset is given below: Bengali: বাংলাদেশের জলবায়ু তাপমাত্রায় মৃদু English: Climate of Bangladesh is mild in temperature Alignments: 0-1 0-2 1-0 2-5 2-6 3-3 3-4
Steps to reproduce
We have developed and used a word aligner tool to create this dataset. The link to this tool is attached in the "Related Links" section.