Coffee leaf dataset by phytosanitary class

Published: 7 January 2026| Version 1 | DOI: 10.17632/mfpxg4y65r.1
Contributor:
John Clark Santa Maria Pinedo

Description

This dataset consists of 1,500 digital images of coffee (Coffea arabica) leaves, collected from crops located in the city of Saposoa. The images represent three clearly differentiated phytosanitary conditions: healthy leaves, leaves affected by coffee leaf rust (Hemileia vastatrix), and leaves affected by coffee leaf spot (Mycena citricolor), with 500 images for each class, ensuring an equitable balance between categories.

Files

Steps to reproduce

The data were collected from healthy and diseased coffee leaves (Coffea arabica) obtained from coffee plantations located in the city of Saposoa. The study population consisted of leaves suitable for digital image acquisition, exhibiting either visible symptoms of foliar diseases or no symptoms at all. Image acquisition was carried out following a uniform capture protocol that considered controlled conditions regarding camera-to-leaf distance, homogeneous natural lighting, and proper focus, in order to ensure visual consistency and adequate technical quality across the dataset. Each image underwent an initial diagnosis performed by plant health specialists, who confirmed the phytosanitary condition of the leaf into one of three predefined categories: healthy leaf, leaf affected by coffee leaf spot (Mycena citricolor), or leaf affected by coffee rust (Hemileia vastatrix). For dataset construction, a non-probabilistic convenience sampling approach was applied, with stratification by phytosanitary condition. Within each stratum, priority was given to images exhibiting visual diversity in terms of shape, size, coloration, texture, and severity of lesions. Explicit inclusion and exclusion criteria were established to ensure data quality and reproducibility. Only images originating from coffee plantations in Saposoa, captured under the defined protocol, with specialist-confirmed classification and sufficient technical quality (adequate resolution, proper focus, and absence of excessive shadows) were included. Images that were blurred, incomplete, overexposed or underexposed, contained leaves occluded by external objects, lacked expert diagnostic confirmation, or corresponded to intermediate states without clear disease differentiation were excluded. Finally, a total sample of 1,500 images was selected and evenly distributed into three strata: 500 healthy leaves, 500 leaves with coffee leaf spot, and 500 leaves with coffee rust. The images were manually reviewed to verify label consistency and organized into class-specific folders, enabling the dataset to be reliably used and reproduced by other researchers in computer vision and machine learning studies focused on the automatic detection of coffee leaf diseases.

Institutions

  • Universidad Nacional Mayor de San Marcos

Categories

Agronomy, Artificial Intelligence, Computer Vision

Licence