Modified circular dichroism data set for validation of secondary structure estimate algorithms
Description
This data set, as an Excel file, was created by simulation of nine far UV CD spectra from the Protein CD Data Base. Spectra were fit as the sum of several (usually four) Gaussian peaks. These noise-free images of the CD spectra were then modified - creating "imperfect data" - by shifting spectral wavelength, intensity and offset, or by varying the noise level, or the ratio of peaks at 192 and 290 nm for a camphor-10-sulphonic acid reference standard. The extent of modification was based on literature data. These modified spectra can help validate the robustness of algorithms used to estimate the secondary structure content of proteins. This was demonstrated for the BeStSel algorithm in the associated publication. Understanding the robustness of a method to imperfect calibration and spectral factors is critical in the validation of analytical methods using the ICH Q2(R1) protocol, the standard for the pharmaceutical and biopharmaceutical industries.
Files
Steps to reproduce
Nine spectra downloaded from the PCDDB were simulated as the sum of several Gaussians [in wavenumber space] such that the residuals were symmetrical about the wavelength axis, there were no visible spectral peaks remaining, and the correlation between the experimental and simulated data was maximised. The quality of this fit, as Pearson Rsquared, depended on the noise level in the experimental spectrum. These simulated spectra were then modified in an Excel spreadsheet. Wavelength calibration was changed by offsetting spectra by between -2 and +2 nm in 0.4 nm increments. Wavelength-independent spectral intensity was changed by multiplication of the noise free spectra by factors between 0.5 and 2.0. Spectral noise, the residuals between the experimental and simulated data, was titrated back into simulated spectra up to a ratio of two-times that in the experimental data. Simulated spectra were offset on the intensity axis by an amount linked to the intensity of the most intense peak in the spectrum. Wavelength dependent spectral intensity was changed to simulate variation in the DELTAepsilon192.5/DELTAepsilon290 ratio on a CSA standard, using a factor which pivots around the the data at 290 nm. This approach is used in the CDToolX. In the associated publication, series of modified spectra were submitted to the online secondary structure estimation [SSE] program BeStSel and SSE outputs recovered. This allowed an understanding of the impact of imperfect data on the SSEs. The dataset created can be submitted to other SSE algorithms to understand how they respond to imperfections in the data.