Synthetic datasets of adversarial images
Published: 9 February 2021| Version 1 | DOI: 10.17632/2g3c836mh3.1
NIDDAL IMAMDescription
We build synthetic datasets of images with embedded adversarial text to improve the robustness of OCR-based spam detectors. The datasets were used in our project (
Steps to reproduce
We choose the most frequent spam words in SMS Spam dataset, toxic words in Jigsaw dataset, and offensive words in OffensEval 2019 dataset using Term Frequency -Inverse Document Frequency (TF-IDF). Then, we used a synthetic data generator ( for embedding the perturbed text into images.
University of York
Optical Character Recognition