Global Synthetic Crop Yield, Meteorological, and Climate Teleconnection Dataset for Machine Learning Benchmarking

Published: 5 December 2025| Version 1 | DOI: 10.17632/y7hkz2zfcc.1
Contributor:
Raza Hasan

Description

This dataset contains high-fidelity synthetic data representing global agricultural production, local meteorological conditions, and large-scale climate teleconnection indices spanning the period 1990–2023. The dataset was generated to benchmark HESE-GNN-CP, a machine learning architecture designed to capture "teleconnections" (long-distance climate links) using Graph Neural Networks. Unlike standard agricultural datasets, this dataset explicitly models the physical correlations between global climate drivers (ENSO, NAO) and regional weather patterns, making it ideal for testing Spatial-Temporal Graph Neural Networks (ST-GNNs).

Files

Steps to reproduce

The data was generated using a physics-informed Python simulation script (GlobalSyntheticDataGenerator). The generation process followed these steps: 1) Global Signal Generation: A sinusoidal function with added Gaussian noise was used to simulate the periodic nature of ENSO (approx. 5-7 year cycles) and NAO. 2) Teleconnection Coupling: Each of the 15 regions was assigned a "Coupling Coefficient" based on real-world atmospheric physics (e.g., Peru = +0.8 coupling with ENSO; Indonesia = -0.8 coupling). 3) Local Weather Simulation: Local temperature and rainfall were generated by modulating the region's baseline climate (determined by Latitude/Longitude centroids) with the weighted Global Signal. 4) Yield Calculation: Final yield was computed using a non-linear biological stress function: Yield = Base_Yield * Temp_Stress * Water_Stress * Tech_Trend + Noise.

Categories

Agricultural Science, Machine Learning, Climate Change, Food Security

Licence