A Balanced Multi-Class Image Collection for Deep Learning-Based Horticulture Disease Diagnosis
Description
This dataset contains 16,000 high-resolution leaf images of two economically important horticultural crops — Lychee (Litchi chinensis) and Jackfruit (Artocarpus heterophyllus) — collected from multiple agricultural hubs in Bangladesh between November 5th and 27th, 2025. Images were captured across diverse field conditions using Poco F5, OnePlus Nord CE 2, and Google Pixel 7 smartphones to ensure environmental robustness and sensor variety. The dataset is distributed across eight diagnostic categories: (1) Jackfruit Healthy (1,119 original images), (2) Jackfruit Leaf Senescence (1,372 original images), (3) Jackfruit Leaf Spot (965 original images), (4) Jackfruit Pest Damage (987 original images), (5) Lychee Healthy (1,005 original images), (6) Lychee Leaf Blight (1,041 original images), (7) Lychee Pest Damage (1,030 original images), and (8) Lychee Erinose Mite (1,012 original images), totaling 8,531 original field-collected images across all categories. A strategic zero-leakage split was applied: 300 original images per class were reserved for validation and 300 for testing, while the remaining images were augmented to expand each training class to exactly 1,400 samples. The final dataset comprises 11,200 training, 2,400 validation, and 2,400 testing images across all eight categories, totaling 16,000 images. All images were standardized to 560 × 420 pixels and organized into train/val/test splits with class-wise subdirectories. Ground truth labels were validated by expert agronomists from Daffodil International University and Sylhet Agricultural University. This dataset is intended to support research in automated plant disease detection, deep learning-based image classification, and precision agriculture for tropical horticultural crops.
Files
Steps to reproduce
1. Download and extract the dataset. The directory structure contains three splits: train/, val/, and test/, each with eight class-specific subdirectories named after the disease categories. 2. Apply the preprocessing pipeline sequentially: Bilateral Filter → Gamma Correction (γ=1.2) → CLAHE (clipLimit=2.0) → Normalization [0,1]. Resize images to 224×224 before feeding into any model. 3. The dataset follows a strict zero-leakage split: 300 original images/class for validation and test, and 1,400 augmented images/class for training. Augmentation includes rotation (±20°), horizontal flipping, shifting (0.1), zoom (10–20%), and brightness adjustment [0.8–1.2]. 4. Ground truth labels correspond directly to subdirectory names. Set random seed to 42 for full reproducibility.
Institutions
- Daffodil International UniversityDhaka Division, Dhaka