Multilingual Image-Text Dataset for Cross-Lingual NLP and Sentiment Analysis

Published: 12 March 2026| Version 3 | DOI: 10.17632/r6z3xydbzz.3
Contributor:

Description

Multilingual Image-Text Dataset for Cross-lingual NLP and Sentiment Analysis has 2,860 images with text in Banglish (Romanised Bangla), Romanised Hindi and English. The records have a combination of visual and textual information, which is best suited to multimodal research.Apply to cross-lingual NLP, sentiment analysis, humor and sarcasm detection, political content and social media research. Text only, image only and multimodal learning Supports binary and multi-class binary and multi-class learning.Images are presented in their original form with no preprocessing or annotations and leave a researcher with absolute freedom to extract features and model them in the way they want. The data set is ethical and there is no personal identifiable information. It is a highly flexible data set that can be used to learn about multimodal understanding, sentiment prediction, cross-lingual AI models, etc.

Files

Steps to reproduce

1.Download & Check: Unzip the ZIP file where all images and texts are located. 2.Preprocess (Optional): standardization of text, tokenization, scaling/normalization of images (when necessary). 3.Feature Extraction: TF-IDF (text feature), BERT (text feature), CNNs (image feature), or pre-trained models. 4.Data Split: Splitting of data into training, validation and test. 5.Model Training: Train text, image, or multimodal models on classification (binary or multi-class). 6.Assessment: Accuracy, precision, recall, F1-score, or any other appropriate evaluation measures need to be used. 7.Documentation: Preprocess documentation, features, model settings and random seeds to reproducibility.

Institutions

Categories

Social Sciences, Linguistics, Psychology, Computer Science, Artificial Intelligence, Data Science, Natural Language Processing, Multimodality, Text Processing, Sentiment Analysis

Licence