Contributors: Chong-kwon Kim, Sihyun Jeong, Boyeon Jang
... The dataset includes legitimate users and customers of fake follower markets. Please Read Readme.txt file that has an description of each files. The files in the dataset includes the information about labled users, follower relationship, following relationship, and follower distance from users. Included files are legitimate_id.txt, buyer_id.txt, ffollower_id.txt, follower.txt, followee.txt, legitimate_distance.txt, buyer_distance.txt, and Readme.txt
Contributors: Antonio Fariña, Miguel A. Martínez-Prieto, Francisco Claude, Gonzalo Navarro
... uiHRDC (universal indexes for Highly Repetitive Document Collections) is a replication framework licensed under the GNU Lesser General Public License v2.1 (GNU LGPL). It includes all the required elements to reproduce the main experiments of the paper , including datasets, query patterns, source code and scripts. The general structure of the uiHRDC repository includes: i) a directory benchmark which contains a LATEX formatted report and a script that will collect all the data files resulting from running all the experiments and will generate a PDF report with all the most relevant figures; ii) a directory data, which includes the text collections (7z compressed), and the query patterns. iii) directories indexes and self-indexes that contain the source code for each indexing alternative, and scripts that permit to run all the experiments for each technique (it includes the construction of each compressed index of interest (using a builder program) and then performing both locate and extract operations over that index (using the corresponding searcher program). Each experiment will output relevant data to a results-data file); and iv) a script doAll.sh that will drive all the process of decompressing the source collections; compiling the sources for each index and running the experiments with it; and finally, generating the final report.  F. Claude, A. Fariña, M. A. Martínez-Prieto, and G. Navarro. Universal Indexes for Highly Repetitive Document Collections. Information Systems, 61:1–23, 2016.
Contributors: Mohamed L. Chouder, Stefano Rizzi, Rachid Chalal
... These datasets has been used to evaluate the EXODuS approach: EXploratory OLAP over Document Stores. - The games dataset has been collected by Sports Reference LLC. It contains around 32K nested documents representing NBA games in the period 1985-2013. Each document represents a game between two teams with at least 11 players each. It contains 47 attributes; 40 of them are numeric and represent team and player results. - The DBLP dataset contains 2M documents scraped from DBLP in XML format and converted into JSON. Documents are flat and represent eight kinds of publications including conference proceedings, journal articles, books, thesis, etc. The third portion of the dataset represent author pages, containing half the number of fields compared to other kinds. So, documents have shared attributes such as title, author, type, year and unshared ones such as journal and booktitle. - The Twitter dataset contains 2M tweets scraped from the Twitter API. Each document represents a tweet message and its metadata, which contains some nested objects: a user object that represent the author of the tweet, a place object that gives its location and a retweet object if it is a reply. The dataset is heterogeneous and mixes between tweets and documents of an API call for tweet deletes. The sources of the datasets are listed in the Related links Section.
Top results from Data Repository sources. Show only results like these.
HESML_vs_SML: scalability and performance benchmarks between the HESML V1R2 and SML 0.9 semantic measures libraries
Contributors: Juan J. Lastra-Diaz, Ana Garcia-Serrano
... This dataset introduces a companion reproducibility Java console program, called HESML_vs_SML_test.jar, of the work introduced by Lastra-Díaz and García-Serrano . This latter work introduces the Half-Edge Semantic Measures Library (HESML), and carries-out an experimental survey between HESML V1R2, the Semantic Measures Library (SML) 0.9  and the WNetSS  semantic measures libraries. The HESML_vs_SML_test.jar program runs the set of performance and scalability benchmarks detailed in  and generates the figures and tables of results reported in the aforementioned work, which are also enclosed as complementary files of this dataset (see files below). Licensing note: The 'HESML_vs_SML_test.jar' program is based on the HESML V1R2 , SML 0.9  and WNetSS  semantic measures libraries, and it includes these libraries in its distribution, as well as WordNet 3.0  and the SimLex665  dataset. Thus, if you use this dataset, you should also cite the works related to these resources. References:  Lastra-Díaz, J. J., and García-Serrano, A. (2016). HESML: a scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. To appear in Information Systems Journal.  Harispe, S., Ranwez, S., Janaqi, S., and Montmain, J. (2014). The Semantic Measures Library: Assessing Semantic Similarity from Knowledge Representation Analysis. In E. Métais, M. Roche, & M. Teisseire (Eds.), Proc. of the 19th International Conference on Applications of Natural Language to Information Systems (NLDB 2014) (Vol. 8455, pp. 254–257). Montpelier, France: Springer. http://dx.doi.org/10.1007/978-3-319-07983-7_37  Lastra-Díaz, J. J., & García-Serrano, A. (2016). HESML V1R2 Java software library of ontology-based semantic similarity measures and information content models. Mendeley Data, v2. https://doi.org/10.17632/t87s78dg78.2  Ben Aouicha, M., Taieb, M. A. H., and Ben Hamadou, A. (2016). SISR: System for integrating semantic relatedness and similarity measures. Soft Computing, 1–25. http://dx.doi.org/10.1007/s00500-016-2438-x  Hill, F., Reichart, R., & Korhonen, A. (2015). SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. Computational Linguistics, 41(4), 665–695. http://dx.doi.org/10.1162/COLI_a_00237  Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39–41. http://dx.doi.org/10.1145/219717.219748
Contributors: Juan J. Lastra-Díaz, Ana Garcia-Serrano
... This dataset is provided as supplementary material of the paper by Lastra-Díaz, J. J., & García-Serrano, A. (2016). HESML: a scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Information Systems. This dataset contains a ReproZip reproducible experiment file, called "HESMLv1r1_reproducible_exps.rpz", which allows the experimental surveys on word similarity on WordNet introduced in the three papers below to be reproduced exactly.  Lastra-Díaz, J. J., & García-Serrano, A. (2015). A novel family of IC-based similarity measures with a detailed experimental survey on WordNet. Engineering Applications of Artificial Intelligence Journal, 46, 140–153. http://dx.doi.org/10.1016/j.engappai.2015.09.006  Lastra-Díaz, J. J., & García-Serrano, A. (2015). A new family of information content models with an experimental survey on WordNet. Knowledge-Based Systems, 89, 509–526. http://dx.doi.org/10.1016/j.knosys.2015.08.019  Lastra-Díaz, J. J., & García-Serrano, A. (2016). A refinement of the well-founded Information Content models with a very detailed experimental survey on WordNet (No. TR-2016-01). NLP and IR Research Group. ETSI Informática. Universidad Nacional de Educación a Distancia (UNED). http://e-spacio.uned.es/fez/view/bibliuned:DptoLSI-ETSI-Informes-Jlastra-refinement
Contributors: Andreas Wolke
... In Wolke et al. we compare the efficiency of different resource allocation strategies experimentally. We focused on dynamic environments where virtual machines need to be allocated and deallocated to servers over time. In this companion paper, we describe the simulation framework and how to run simulations to replicate experiments or run new experiments within the framework.
Contributors: Andreas Wolke
... Data produced by simulations and experiments that were used in our paper "More than bin packing: Dynamic resource allocation strategies in cloud data centers" (Information Systems)