A Near-Infrared Spectroscopy Dataset for Chemical Composition Prediction and Origin Identification of Tobacco Leaves

Name: A Near-Infrared Spectroscopy Dataset for Chemical Composition Prediction and Origin Identification of Tobacco Leaves
Creator: hexin chen
Published: 2025-12-08T11:22:50.405Z
Keywords: Near Infrared Spectroscopy, Tobacco

chen, hexin; Guo, Junwei; Wang, Hongbo; Zhao, Le

doi:10.17632/9z7dgdtggk.2

A Near-Infrared Spectroscopy Dataset for Chemical Composition Prediction and Origin Identification of Tobacco Leaves

Published: 8 December 2025| Version 2 | DOI: 10.17632/9z7dgdtggk.2

Contributors:

hexin chen,

,

Description

This database contains two core asset types: Data Files and Model Files. 1. Data Files The dataset is provided in two separate .xlsx files: Raw-nir-spectra-data: This file contains the raw near-infrared spectral dataset. It records the spectral information for all 347 tobacco samples and includes metadata such as each sample's unique ID, cultivation year, and country of origin. 13-Chemical-Components-data: This file contains the reference dataset for the chemical constituents. It includes the quantitative analysis results for the 13 key chemical components for all 347 samples, corresponding one-to-one with the spectral data. 2. Model Files The database provides 99 pre-trained prediction and classification models in .joblib format. All models were built in a Python 3.9 environment and can be loaded and called directly. To facilitate easy identification and use, the model files adhere to the following naming conventions: A. Quantitative Models (Chemical Prediction) This naming format is used for the quantitative prediction models of the 13 chemical constituents. Format: [Chemical_Component]_[Preprocessing_Method]_[Modeling_Method].joblib Example: TotalSugars_MSC_PLS.joblib represents a PLS model for predicting Total Sugars using MSC preprocessing. B. Classification Models (Origin Prediction) This naming format is used for classification models built with different types of input data. Format (based on spectral data): [Preprocessing_Method]_[Modeling_Method].joblib Example: SecondDerivative_RF.joblib represents a Random Forest (RF) classification model built using second-derivative spectral data. Special Note: The file Thirteen_chemical_components-RF.joblib is a special classification model. It does not use spectral data; instead, it is built using the quantitative results of the 13 chemical components directly as its input features.

Files

Steps to reproduce

This study established a high-quality dataset comprising 347 tobacco leaf samples procured between 2022 and 2024 from six countries: the United States, Brazil, Zimbabwe, Argentina, Tanzania, and Zambia. During the data collection, all samples underwent a standardized preparation procedure: they were first dried in an FD240 oven (Binder GmbH, Germany) at 40°C to a moisture content of 6–8%, then subsequently ground using a ZM200 grinder (Retsch GmbH, Germany) and passed through a 60-mesh (0.25 mm) sieve. Spectral data were acquired using an Antaris II Fourier Transform Near-Infrared (FT-NIR) spectrometer (Thermo Nicolet, USA) over a wavenumber range of 4000–10000 cm⁻¹. Each final spectrum, comprising 1557 data points, was the average of 64 scans. Concurrently, the reference values for 13 key chemical constituents—including total alkaloids, sugars, total nitrogen, and various phenolic compounds—were determined for each sample using standard laboratory methods such as High-Performance Liquid Chromatography (HPLC). For data analysis, six preprocessing techniques were applied: Savitzky-Golay smoothing, Multiplicative Scatter Correction (MSC), Standard Normal Variate (SNV), first derivative, second derivative, and mean centering. In the quantitative analysis of chemical constituents, Partial Least Squares (PLS) regression was used to establish predictive models for 13 indicators, including total alkaloids, reducing sugars, total sugar, total nitrogen, K, Cl, pH, starch, as well as neochlorogenic acid, chlorogenic acid, cryptochlorogenic acid, scopoletin, and rutin. For geographical origin traceability, a classification model was constructed using the Random Forest (RF) algorithm. All models were developed by partitioning the data into a calibration set (70%) and a validation set (30%).

Institutions

Zhengzhou Tobacco Research Institute

A Near-Infrared Spectroscopy Dataset for Chemical Composition Prediction and Origin Identification of Tobacco Leaves

Description

Files

Steps to reproduce

Institutions

Categories

Licence