Nigerian Language Dataset (Wa-Zo-Bia)

Published: 5 September 2022| Version 2 | DOI: 10.17632/jccjsk6pd3.2


This is a dataset that contains the alphabets of the most common Nigerian languages from start to finish and can be used for character recognition. It was recorded physically and has been binarized, while some has not. The handwriting of 50 students was captured for both uppercase and lowercase for each of the languages. The dataset file: This file contains the raw images of the dataset; that is why it is the largest file. The binary file: This contains the raw data converted into binary format with a threshold of 210. This is why it is the smallest file. The sorted file: This file contains the sorted images, i.e., a folder was created for all the 'A' alphabets and so on till 'Z'. That is why it is different from the binary file. All you have to do is download the one you choose to use, and then unzip. The resized file: This contains all the images that have been resized to a specific dimension. Due to the existence of different contributors to the datasets, there is a variation of files and images. Have fun making use of it ;-)


Steps to reproduce

An 8 x 9 table with equal squares was created. Then each alphabet was recorded in each box. A total of 50 students' handwriting was recorded. After that, I cropped each square using CorelDRAW graphics design as a tool to speed up the process. The snipping tool on your computer can also be used in case CorelDRAW doesn't install. Labelling each image was crucial to my next step. I have written an algorithm that will sort each image based on the label and group them together. After the grouping, I wrote another algorithm that binarizes all the images in each folder.


Kwara State University


Optical Character Recognition