TabbyXL: Experiment Data

Published: 25-06-2017| Version 1 | DOI: 10.17632/448jdx7gcr.1
Contributor:
Alexey Shigarov

Description

The data are designed to evaluate TabbyXL, a system for rule-based transformation spreadsheet data from arbitrary to relational tables that is freely available at GitHub (https://github.com/cellsrg/tabbyxl). Our data are based on the existing dataset of tables Troy_200 [1]. It contains 200 arbitrary tables as CSV files collected from 10 different government statistical websites. They were collected for the experiment on data extraction from tables that is presented in the paper [2]. We use its earlier version that stores the original tables with style features (fonts, alignment, and indentation) as Excel spreadsheets available at http://tango.byu.edu/data. We have put all of these tables with style features into the single spreadsheet file (data/TangoDataset.xlsx). Each of 200 tables is located in a separate sheet. The pair of tags $START and $END points out to its location inside the sheet. We initially used this file in our previous experiment described in the paper [3]. We have transformed automatically all tables of the single spreadsheet into the relational form, using TabbyXL and the ruleset (data/rules.dslr). The folder data/results contains the obtained results. The folder data/gt contains the ground-truth data for automated performance evaluation of TabbyXL in the role and structural stages of the table analysis. Each table of our data/results and data/gt dataset is accompanied with two recordsets: ENTRIES and LABELS. The first of them specifies entries. Each record presents an entry as a triple <value, provenance, set of associated labels>. In LABELS recordset each record presents a label as a triple <value, provenance, parent reference>. We also have stored the log files: results.log with the results of running and eval.log with the results of performance evaluation of TabbyXL. REFERENCES [1] Nagy G. TANGO-DocLab web tables from international statistical sites, (Troy_200), 1, ID: Troy_200_1. URL: http://tc11.cvc.uab.es/datasets/Troy_200_1. [2] Embley D., Krishnamoorthy M., Nagy G., & Seth S. (2016). Converting heterogeneous statistical tables on the web to searchable databases. Int. J. on Document Analysis and Recognition, 19(2), 119-138. URL: https://link.springer.com/article/10.1007/s10032-016-0259-1. [3] Shigarov A., Paramonov V., Belykh P., & Bondarev A. (2016) Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets. Proc. 22nd Int. Conf. on Information and Software Technologies, pp. 78-91. URL: http://link.springer.com/chapter/10.1007/978-3-319-46254-7_7.

Files

Steps to reproduce

1. Download and install Java SE Runtime Environment 8 (http://www.oracle.com/technetwork/java/javase/downloads) or more. 2. Download and unpack zip archive that contains the experiment data files into your directory. 3. Download and unpack TabbyXL.v0.1.zip (https://github.com/cellsrg/tabbyxl/releases) into this directory. 4. Run TabbyXL in your console as follows: 4.1. Change to your directory that contains the unpacked data and TabbyXL cd <path to your directory> 4.2. In order to obtain results, run the executable JAR with the following command java -jar TabbyXL-0.1-jar-with-dependencies.jar -input data/TangoDataset.xlsx -ruleset data/rules.dslr -ignoreSuperscript true -useCellText false -debuggingMode false -output data/results -useShortNames true 4.3. In order to evaluate the obtained results, run the executable JAR with the following command java -cp TabbyXL-0.1-jar-with-dependencies.jar ru.icc.cells.ssdc.evaluation.Evaluator data/results data/gt