Synthetic datasets of adversarial images

Name: Synthetic datasets of adversarial images
Creator: NIDDAL IMAM
Published: 2021-02-09T14:23:08.657Z
Keywords: Optical Character Recognition

IMAM, NIDDAL

doi:10.17632/2g3c836mh3.1

Synthetic datasets of adversarial images

Published: 9 February 2021| Version 1 | DOI: 10.17632/2g3c836mh3.1

Contributor:

NIDDAL IMAM

Description

We build synthetic datasets of images with embedded adversarial text to improve the robustness of OCR-based spam detectors. The datasets were used in our project (https://github.com/niddal-imam/Post-OCR-Correction).

Files

Steps to reproduce

We choose the most frequent spam words in SMS Spam dataset, toxic words in Jigsaw dataset, and offensive words in OffensEval 2019 dataset using Term Frequency -Inverse Document Frequency (TF-IDF). Then, we used a synthetic data generator (https://github.com/Belval/TextRecognitionDataGenerator) for embedding the perturbed text into images.

Institutions

University of York

Synthetic datasets of adversarial images

Description

Files

Steps to reproduce

Institutions

Categories

Licence