MVFR: Multilingual Visual Font Recognition Synthetic Dataset

Published: 29 January 2024| Version 1 | DOI: 10.17632/cnd2wh65my.1
Contributor:
Moshiur Rahman Tonmoy

Description

MVFR is a synthetic multilingual visual font recognition dataset featuring data from four common languages: Bangla, Hindi, Spanish, and Russian. The dataset creation process involved several steps. Initially, multiple lists of common words for all four languages were gathered from the open-source data science platform, Kaggle. Following this, the 10 most popular fonts for each language were sourced from various open-source font-sharing platforms. Subsequently, a data generator was developed using Python and the Pillow library to produce synthetic 400x200 white images containing words in the respective languages printed in different fonts. Each language in the dataset comprises 50,000 images in total, with 5,000 images generated for each of the 10 fonts. Additionally, the dataset includes the Python generator script that can facilitate the generation of visual font recognition data for other languages as well. Researchers can leverage both the MVFR dataset and the generator script to train and evaluate AI models for font recognition across multiple languages.

Files

Categories

Computer Vision, Image Processing, Image Classification, Pattern Recognition

Licence