Experimental data for computing semantic similarity between concepts using multiple Inheritances in Wikipedia category graph

Published: 25 February 2020| Version 1 | DOI: 10.17632/hnmb43sj5s.1


In this data article, we provide experimental data to compute the semantic similarity between the concepts (words) taken from the gold standard word similarity benchmarks MC30 (English), RG65 (Spanish), and RG65 (French). This data is related to the multiple inheritance-based semantic similarity methods proposed in In M. J. Hussain, et al. The dataset contains four folders named as "Benchmarks_results_graphs", "French_RG65", "MC30", and "Spanish_RG65" respectively. The folder "Benchmarks_results_graphs" contains the Pearson correlation values of the experimental results of English (MC30), French (RG65), and Spanish (RG65) benchmarks. The Folders “French_RG65”, “MC30”, and “Spanish_RG65” have all the necessary pre-processed data files to execute the python based program to compute the semantic similarity between French, English, and Spanish Wikipedia concepts according to our methods. For example, the folder “French_RG65” contains: (1) the experiments on RG65 (French) benchmark in the sub-folder named as “French_RG65_results”, (2) the required data for the computation of Information Content (IC) with respect to category hyponyms and category pages in the sub-folder names as “predate_fr”, (3) the disambiguated French Wikipedia concepts in the file named as “disambiguated_benchmark.csv”, (4) the French Wikipedia concepts page ids in the file named as “fr_RG65_pageid.csv”, (5) the French Wikipedia page associated categories in the file named as “fr_RG65_page_categories.txt”, (6) the source code to compute the semantic similarity between the concepts of French Wikipedia using IC with respect to category hyponyms in the file named as “RG_French_Sim_IC_hypos.txt”, (7) the source code to compute the semantic similarity between the concepts of French Wikipedia using IC with respect to category pages in the file named as “RG_French_Sim_IC_pages.txt.”, and (8) the source code to reproduce the data associated to Table 3 in the file named as “Table3_French.txt”. These data folders provide all the necessary pre-processed data files to execute the python-based program to reproduce the experimental results of our semantic similarity methods and further analysis on the graphical structures of different Wikipedia category graphs.



South China Normal University


Natural Language Processing
