Multilingual Image-Text Dataset for Cross-Lingual NLP and Sentiment Analysis

Name: Multilingual Image-Text Dataset for Cross-Lingual NLP and Sentiment Analysis
Creator: GRANTHA SAHA
Published: 2026-03-12T10:29:19.270Z
Keywords: Social Sciences, Linguistics, Psychology, Computer Science, Artificial Intelligence, Data Science, Natural Language Processing, Multimodality, Text Processing, Sentiment Analysis

SAHA, GRANTHA

doi:10.17632/r6z3xydbzz.3

Multilingual Image-Text Dataset for Cross-Lingual NLP and Sentiment Analysis

Published: 12 March 2026| Version 3 | DOI: 10.17632/r6z3xydbzz.3

Contributor:

Description

Multilingual Image-Text Dataset for Cross-lingual NLP and Sentiment Analysis has 2,860 images with text in Banglish (Romanised Bangla), Romanised Hindi and English. The records have a combination of visual and textual information, which is best suited to multimodal research.Apply to cross-lingual NLP, sentiment analysis, humor and sarcasm detection, political content and social media research. Text only, image only and multimodal learning Supports binary and multi-class binary and multi-class learning.Images are presented in their original form with no preprocessing or annotations and leave a researcher with absolute freedom to extract features and model them in the way they want. The data set is ethical and there is no personal identifiable information. It is a highly flexible data set that can be used to learn about multimodal understanding, sentiment prediction, cross-lingual AI models, etc.

Files

Steps to reproduce

1.Download & Check: Unzip the ZIP file where all images and texts are located. 2.Preprocess (Optional): standardization of text, tokenization, scaling/normalization of images (when necessary). 3.Feature Extraction: TF-IDF (text feature), BERT (text feature), CNNs (image feature), or pre-trained models. 4.Data Split: Splitting of data into training, validation and test. 5.Model Training: Train text, image, or multimodal models on classification (binary or multi-class). 6.Assessment: Accuracy, precision, recall, F1-score, or any other appropriate evaluation measures need to be used. 7.Documentation: Preprocess documentation, features, model settings and random seeds to reproducibility.

Institutions

University of Frontier Technology, Bangladesh
Dhaka Division, Gazipur

Multilingual Image-Text Dataset for Cross-Lingual NLP and Sentiment Analysis

Description

Files

Steps to reproduce

Institutions

Categories

Licence