Multilingual Image-Text Dataset for Cross-Lingual NLP and Sentiment Analysis
Description
Multilingual Image-Text Dataset for Cross-lingual NLP and Sentiment Analysis has 2,860 images with text in Banglish (Romanised Bangla), Romanised Hindi and English. The records have a combination of visual and textual information, which is best suited to multimodal research.Apply to cross-lingual NLP, sentiment analysis, humor and sarcasm detection, political content and social media research. Text only, image only and multimodal learning Supports binary and multi-class binary and multi-class learning.Images are presented in their original form with no preprocessing or annotations and leave a researcher with absolute freedom to extract features and model them in the way they want. The data set is ethical and there is no personal identifiable information. It is a highly flexible data set that can be used to learn about multimodal understanding, sentiment prediction, cross-lingual AI models, etc.
Files
Steps to reproduce
1.Download & Check: Unzip the ZIP file where all images and texts are located. 2.Preprocess (Optional): standardization of text, tokenization, scaling/normalization of images (when necessary). 3.Feature Extraction: TF-IDF (text feature), BERT (text feature), CNNs (image feature), or pre-trained models. 4.Data Split: Splitting of data into training, validation and test. 5.Model Training: Train text, image, or multimodal models on classification (binary or multi-class). 6.Assessment: Accuracy, precision, recall, F1-score, or any other appropriate evaluation measures need to be used. 7.Documentation: Preprocess documentation, features, model settings and random seeds to reproducibility.
Institutions
- University of Frontier Technology, BangladeshDhaka Division, Gazipur