HESML_vs_SML: scalability and performance benchmarks between the HESML V1R2 and SML 0.9 semantic measures libraries
Description
This dataset introduces a companion reproducibility Java console program, called HESML_vs_SML_test.jar, of the work introduced by Lastra-Díaz and García-Serrano [1]. This latter work introduces the Half-Edge Semantic Measures Library (HESML), and carries-out an experimental survey between HESML V1R2, the Semantic Measures Library (SML) 0.9 [2] and the WNetSS [4] semantic measures libraries. The HESML_vs_SML_test.jar program runs the set of performance and scalability benchmarks detailed in [1] and generates the figures and tables of results reported in the aforementioned work, which are also enclosed as complementary files of this dataset (see files below). Licensing note: The 'HESML_vs_SML_test.jar' program is based on the HESML V1R2 [3], SML 0.9 [2] and WNetSS [4] semantic measures libraries, and it includes these libraries in its distribution, as well as WordNet 3.0 [6] and the SimLex665 [5] dataset. Thus, if you use this dataset, you should also cite the works related to these resources. References: [1] Lastra-Díaz, J. J., and García-Serrano, A. (2016). HESML: a scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. To appear in Information Systems Journal. [2] Harispe, S., Ranwez, S., Janaqi, S., and Montmain, J. (2014). The Semantic Measures Library: Assessing Semantic Similarity from Knowledge Representation Analysis. In E. Métais, M. Roche, & M. Teisseire (Eds.), Proc. of the 19th International Conference on Applications of Natural Language to Information Systems (NLDB 2014) (Vol. 8455, pp. 254–257). Montpelier, France: Springer. http://dx.doi.org/10.1007/978-3-319-07983-7_37 [3] Lastra-Díaz, J. J., & García-Serrano, A. (2016). HESML V1R2 Java software library of ontology-based semantic similarity measures and information content models. Mendeley Data, v2. https://doi.org/10.17632/t87s78dg78.2 [4] Ben Aouicha, M., Taieb, M. A. H., and Ben Hamadou, A. (2016). SISR: System for integrating semantic relatedness and similarity measures. Soft Computing, 1–25. http://dx.doi.org/10.1007/s00500-016-2438-x [5] Hill, F., Reichart, R., & Korhonen, A. (2015). SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. Computational Linguistics, 41(4), 665–695. http://dx.doi.org/10.1162/COLI_a_00237 [6] Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39–41. http://dx.doi.org/10.1145/219717.219748
Files
Steps to reproduce
System requirements: a Java8-compliant workstation with at least 8 Gb RAM. The HESML_vs_SML_test.zip file contains the source files and compiled versions of the HESML_vs_SML_test.jar and all the aforementioned semantic measures libraries, thus, you only need to run the program. However, in order to compile HESML_vs_SML_test from its source files, you need to install NetBeans 8.0 or higher and the Java SDK 8.0. Running of the benchmarks: The first group of benchmarks evaluates the running-time and caching ratio in a side-by-side comparison between the most significant topological algorithms implemented by HESML and SML. (1) Download the HESML_VS_SML_test.zip file above and extract it onto your hard drive, then follow the steps 2-4 below: (2) Open a Linux or Windows command console in the main HESML_VS_SML_test directory and run the following command: $prompt:> java -Xms4096m -Xmx4096m -jar dist\HESML_VS_SML_test.jar <output_results.csv> (3) Import the raw output file with LibreOffice or MS-Excel to obtain the data as shown in benchmarks_HESML_vs_SML.csv file above (4) Install and open the R statistics package, then follow the following steps: (a) select the "File->Open script" menu and load the 'IS_HESML_figure3_and_table18.r' script file above; (b) edit the first two lines of the script code in order to set the path of the input directory and the input 'output_results.csv' file generated in the step 2 above; and finally, (c) select the "Edit->Run all' menu in order to generate the figure in the HESML_vs_SML.pdf file above. The output csv file obtained in step 2 above will be identical to the complementary 'benchmarks_HESML_vs_SML.csv' file. However, it will show the running times on your experimental platform. The second benchmark evaluates the running time of HESML, SML and WNetSS in the evaluation of the Jiang-Conrath similarity measure with the Seco et al. IC model in the SImLex665 dataset. In order to reproduce the WordNet-based similarity benchmark reported in table 19 of [1] and the 'final_results-SimLex665.csv' file above, you should follow the steps 5-8 below: (5) Install MySQL community edition in your workstation (demanded by WNetSS). (6) Open a Linux or Windows command console in the HESML_VS_SML_test directory and run the command below, which carries out the off-line pre-processing tasks of WNetSS in order to load WordNet 3.0 and all its topological information in the MySQL server. This task could take a few hours in a modern workstation. $prompt:> java -Xms4096m -Xmx4096m -jar dist\HESML_VS_SML_test.jar -WNetSS_Setup mySqlRootPassword (7) From the same Linux or Windows command console run the following command: $prompt:> java -Xms4096m -Xmx4096m -jar dist\HESML_VS_SML_test.jar -WNetSS mySqlRootPassword <output_results.csv> (8) Import the output file with LibreOffice or MS-Excel to obtain the data shown in the final_results_SimLex665.csv file above.