SEED-ML: A Multi-Parametric Clinical Dataset on Male Infertility for Predictive Modeling and AI Research.

Published: 8 January 2026| Version 2 | DOI: 10.17632/sc8rsz2vd7.2
Contributors:
,
,
,

Description

Authors: N. Sánchez-Gómez [1] (nicolassg@us.es), J.A. García-García [*, 1] (juliangg@us.es), J. Navarro-Pando [2,3,4,5] (jose.navarro@inebir.com), MJ Escalona-Cuaresma [1] (mjescalona@us.es). Affiliations: [1]ES3 Group (Engineering and Science for Software Systems group). University of Seville, Spain. Avenida Reina Mercedes, s/n., 41012, Seville, Spain. [2]Cátedra de Reproducción y Genética Humana del Instituto para el Estudio de la Biología de la Reproducción Humana (INEBIR), Seville, Spain. [3]Universidad Europea del Atlántico (UNEATLANTICO), Santander, Spain. [4]Fundación Universitaria Iberoamericana (FUNIBER), Seville, Spain. [5]San Juan de Dios Hospital, Sevilla, Spain. Abstract: SEED-ML (Semen Examination and Evaluation Dataset for Machine Learning) is an openly available, multi-parametric clinical dataset specifically designed to support research in male infertility diagnostics and prediction. The dataset comprises records from 10,124 patients, including detailed semen analysis parameters (pre- and post-treatment), morphological classifications, and clinical alterations. Infertility diagnosis is categorized into nine clinically relevant classes, ranging from normal fertility to complex multi-factor conditions such as oligoasthenoteratozoospermia. All data were anonymized and curated following strict ethical and privacy guidelines to ensure compliance with applicable medical data protection regulations. The dataset reflects real-world clinical distributions, with diagnostic classes ranging from 62.7% (Normozoospermia) to 0.16% (Azoospermia), providing a high-fidelity benchmark for testing machine learning algorithms under conditions of significant class imbalance. SEED-ML offers a valuable resource for developing and benchmarking machine learning models, enabling research in predictive analytics, decision support systems, and computational andrology. This dataset aims to facilitate interdisciplinary collaboration between clinicians, data scientists, and AI (artificial intelligence) researchers, accelerating the development of data-driven solutions in reproductive medicine. The dataset is publicly available in Mendeley under a CC BY 4.0 license.

Files

Institutions

Universidad de Sevilla, Bionac Laboratorio SL

Categories

Analysis of Covariance, Male Reproductive Health, Meta Dataset, Applied Machine Learning

Licence