This database studies the performance inconsistency on the biomass HHV proximate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency. The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These models consist of eight regressions, four supervised learnings, and three neural networks. An excel workbook, "BiomassDataSetProximate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Proximate," contains 803 HHV data from 17 pieces of literature. The names of the worksheet column indicate the elements of the proximate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature. A file named "SourceCodeProximate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, "runStudyProximate.m," is the article's main program (Kijkarncharoensin & Innet, 2021) to analyze the performance consistency of the biomass HHV model through the proximate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes. The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article. Reference : Kijkarncharoensin, A., & Innet, S. (2021). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Proximate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.
Steps to reproduce
The database collected the data of biomass HHV based on the proximate analysis from the 17 pieces of literature listed in the article. Three of them are the key sources of the article dataset. The remaining ones are stacked as supplementary data. The dataset eliminated three issues that arise from stacking up the literature data and must be removed during the data collection. The first one is the duplicate records. The redundant data bias the distribution of the biomass data. The model performance trends are positively biased from the train and test record similarity. The next one is the mismatch record. The incorrect data might be reported unintentionally during the publication. Therefore, the wrong value of the biomass data was possibly imported to the follow-up literature. The last one is the unit variation. The various units in a data field lead to erroneous analysis. Thus, the unit of the stacking up data must be identical. The procedures to generate the dataset of the biomass HHV are the following. 1. Stack up the datasets in three worksheets, "Qian2016", "GhugareS1", and "NhuchhenS2" as the primary source. 2. Stack up the other datasets listed in the author's article to that three datasets. 3. Inspect the data mismatch of the duplicate records. Keep the data reported in the elder publication. 4. After the correction, filter out the duplicate records. 5. Convert the data unit in % dry basis and the HHV unit in MJ/kg Running the MATLAB script generates the article's results by following instructions. First, extract file "SouceCodeProximate.rar" into the same folder with the excel workbook "BiomassDataSetProximate.xlsx." Then, run the MATLAB script in the filename "runStudyProximate.m" to train models and test the consistency of their performance.