Refined NHANES 2007-2012 spirometry dataset for the comparison of segmented (piecewise) linear models to that of GAMLSS
Description
Background Current guidelines recommend using Generalized Additive Models for Location, Scale, and Shape (GAMLSS) to create reference equations for lung function. However, these models are complex & require additional spline tables for use. This study aimed to demonstrate that simpler methods, such as simple linear regression for the forced exhaled volume in 1 second to forced vital capacity ratio (FEV1/FVC ratio) and segmented linear regression (SLR) for forced exhaled volume in 1 second (FEV1) and forced vital capacity (FVC), can achieve similar prediction accuracies as GAMLSS in pulmonary function diagnostics. Data This study utilized secondary data from the National Health and Nutrition Examination Survey (NHANES) conducted between 2007 and 2012. The dataset includes spirometry measurements from 16,596 participants (from an initial pool of 31,451) aged 6-80 years, representing diverse racial and ethnic backgrounds. Participants' weights ranged from 16.4-218.2 kg, heights from 104.6-203.8 cm, and BMI from 12.5-84.9 kg/m². The refined dataset only includes participants who met the minimum technical quality standards for spirometry maneuvers, as outlined by the American Thoracic Society (2005), specifically those performing A and B quality maneuvers. However, the data file also provides a secondary analysis of the calculated z-scores from the developed GAMLSS and piecewise regression models including whether participants had a restrictive respiratory pattern or an airway obstruction classification based on the calculated lower limit of normal. Methods Reference equations FEV1, FVC, and the FEV1/FVC ratio were developed by G.Z. using different modeling techniques: simple linear regression for the FEV1/FVC ratio and segmented linear regression (SLR) for FEV1 and FVC. Initially, all races/ethnicities were grouped together as the primary hypothesis was to compare GAMLSS to SLR models and not to compare different biological ancestries. K-fold cross-validation was applied to calculate the 95% confidence interval (CI) for the root-mean-square error (RMSE), which served as an indicator of prediction accuracy. Additionally, the agreement between both modeling approaches in classifying spirometric patterns [normal, airflow obstruction, restrictive, mixed disorder, or preserved ratio impaired spirometry (PRISm)] was assessed using an unweighted kappa statistic. Results The RMSE values and correlation coefficients for FEV1, FVC, and the FEV1/FVC ratio were similar between the two modeling techniques. The agreement between the models in classifying spirometric patterns was also high, with kappa values ranging from 0.78 to 0.80 (95% CI). Conclusions Simple linear regression (FEV1/FVC ratio) and segmented linear regression (FEV1, FVC) provide prediction accuracies comparable to those of GAMLSS models. These simpler methods are more straightforward and accessible, making them a practical alternative for broader use in pulmonary function diagnostics.
Files
Steps to reproduce
Data Extraction Only NHANES participants meeting "A" and "B" grade acceptability standards per NIOSH were included. Criteria for spirometry values were: "A" GRADE required three acceptable curves, with the largest and second-largest values within 100 ml and no more than 50 ml difference from the last maneuver. "B" GRADE required three acceptable curves, with the largest and second-largest values within 150 ml, meeting the minimum American Thoracic Society’s 2005 criteria. Software Used Analyses were conducted using R (version 4.3.2) and RStudio, with packages "segmented" (2.0-0), "gamlss" (5.4-20), and "caret" (6.0-94). Statistical significance was set at p < 0.05. Segmented Linear Regression Model: Reference equations for FEV1, FVC, and FEV1/FVC ratios were developed using the "segmented" package in R, incorporating an 'age squared' variable to estimate breakpoints across ages 5-80. Breakpoints and their 95% confidence intervals were determined through visual analysis and iterative procedures. LASSO regression was used to select key predictors (age, height, weight, interactions, and squared terms), guided by the Bayesian Information Criterion (BIC). Models were evaluated for multicollinearity, outliers, and assumptions. Generalized Additive Models for Location, Scale, and Shape (GAMLSS) GAMLSS models, implemented in R, used the Lambda-Mu-Sigma (LMS) method to adjust for skewness. Performance was assessed using Akaike's Information Criterion (AIC), Bayesian Information Criterion (BIC), Q-Q plots, and worm plots. Between-Individual Variability Variability among individuals by age was assessed using the predicted standard deviation divided by the predicted mean (as a percentage). The predicted mean was based on median height by age from the white U.S. population, while the predicted standard deviation was derived from segmented regression models and sigma values from GAMLSS models. Prediction Accuracy Comparison K-fold cross-validation (10 folds) was used to evaluate predictive accuracy for FEV1, FVC, and FEV1/FVC ratios with both models. Each fold served as a validation set once, with the others forming the training set. Performance was measured using Root Mean Square Error (RMSE) and correlation, reported as mean scores with a 95% confidence interval (CI). Identification of Pathophysiology from Spirometry: Both GAMLSS & segmented regression identified four physiologic disorders based on z-scores below the lower limit of normal (LLN): (A) airflow obstruction, (B) restrictive spirometry pattern, (C) mixed obstructive and restrictive disorder and (D) preserved ratio, impaired spirometry (FEV1/FVC > LLN + FEV1 < LLN). If none were present, spirometry was classified as normal. Agreement between models was assessed using the Kappa statistic. Additional Analyses Correlations, paired t-tests, and McNemar's test with continuity correction were used to compare models, with multiple comparison corrections applied via the Benjamini-Hochberg procedure.