Benchmark datasets for seriation and patch seriation code

Published: 13 April 2023| Version 1 | DOI: 10.17632/b96s5bcfc2.1
Contributor:
Gergely Toth

Description

These datasets are benchmark ones to test seriation. We used the data to test diagonal and patch seriations. The used C code is also included. SIM dataset: The dataset is a good example for data structure, where different set of variables are responsible for each cluster and the other variables of a given cluster are random. The seriation of these type of data seems to be a hard task for most of the methods. Dataset SIM is a semi-randomly simulated one (created by Gergely Tóth). There are 50 objects and 20 variables in this set ordered in 4 clusters and a random group for the objects. Members of the clusters have similar values at some selected variables, but their other data are random. Some of the selected variables are common also with other clusters. At first, we generated [0,1) random numbers for all data and thereafter the groups were recalculated by adding a given random number for a selected variable of the group biased with white noise. No. of rows: 50 (A,B,C,D=clusters, R=non clustered elements) No. of columns: 20 (A-D characters refer to the involvment of a variable into a given cluster) RETSIM dataset The RETSIM dataset is a simulated one (created by Gergely Tóth). We defined three functional groups and created 4 compounds with random linear combination of the three groups. We set 6 mixtures of the 4 compounds. 6 chromatographic columns were set as well with differently randomized partial retention times for the functional groups. The retention times of the compounds were calculated with linear combination of the functional groups therein. Finally, we added uniform broadening for each compound with integrals related to the concentrations. In this way we had 36 chromatograms of the 6 mixtures on the 6 columns. No. of rows: 36=6*6 A-F denotes the chromatographic columns, 1-6 the mixtures No. of columns: 100 The dataset can be used in two dimension (36*100) or in three dimension(6*6*100). REAC dataset 95 reactions of gasoline combustion used in the thesis work of G. Juhász [1]. The table was created by Gergely Tóth. No.of rows: 95 (reactions) No. of columns: 32 (reactants or products) 0/1 mean whether a compound takes part in the reaction (irrespectively from the stochiometry or reactant/product role) [1] Juhász G. Reduction of a biodiesel combustion reaction mechanism. BSc thesis Budapest: Eötvös Loránd University, Institute of Chemistry, Department of Physical Chemistry, 2015. Seriation code in C: see details in the header of the code.

Files

Steps to reproduce

Details can be found in the files.

Institutions

Eotvos Lorand Tudomanyegyetem Kemiai Intezet

Categories

Chemometrics, Pattern Recognition, Chemometrics Software

Licence