Dataset for pest classification in Mango farms from Indonesia

Published: 27-02-2020| Version 1 | DOI: 10.17632/94jf97jzc8.1
Kusrini Kusrini,
Suputa Suputa,
Arief Setyanto,
I Made Artha Agastya,
Herlambang Priantoro,
Krishna Chandramouli,
Ebroul Izquierdo


The infestation of pests affecting the Mango cultivation in Indonesia has an economic impact in the region. Following the recent development in the field of machine learning, the application of deep-learning models for multi-class pest-classification requires large collection of image samples upon which the algorithms can be trained. Addressing such a requirement the paper presents a detailed outline of the dataset collected from the Mango farms in Indonesia. The data consists of images captured from the Mango farms affected by 15-categories of pests which are identifiable through structural and visual deformity exhibited in the Mango leaves. The collection of the data involved the use of a low-cost sensing equipment that are commonly used by the farmers for collecting images from the farm. The collected data is subjected to two processes, namely the data augmentation process and training of the classification model. The dataset collection consists of 510 images that includes 15-caterogies of pests that affect Mango leaves along with the original appearance of the Mango leaves (resulting in 16-classes) collected over a period of 6 months. For the purposes of training the deep-learning neural network, the images are subjected to data augmentation to expand the dataset and to emulate closely the large-scale data collection process carried out by farmers. The outcome of the data augmentation process results in a total of 62,047 image samples, which are used to train the network. The multi-class classification framework. The training framework presented in the paper builds on the VGG-16 feature extractor and replaces the last 3-year network with a fully connected neural network layers resulting in 16-output classes. The dataset includes the annotation of the image samples for both original images captured from the field and the augmented image samples. Both the original and augmented data has been classified as training, validation and testing. The overall dataset is divided into 3-parts, namely version 0, version 1 and version 2. The version 0 consists of the original data set, with 310 images to be used for training, 103 images to be used for the validation and finally 97 images for testing. The version 1 of the dataset of includes 46,500 image samples for training, following the application of the data augmentation process followed by the 103 original images used for validation and 97 images for testing. Finally, the version 2 of the dataset uses 47, 500 images for training and 15, 450 images for validation and 97 images for the testing. The three versions of the dataset include images available in JPEG format. The visual appearance of the pests captured in the dataset provides an ideal testbed for benchmarking the performance of various deep-learning networks trained to detect specific categories of pests. In addition, the dataset also provides an opportunity to evaluate the impact of data augmentation techniques on the original dataset.