Noisy name datasets

Published: 15-09-2016| Version 2 | DOI: 10.17632/hp3my8rv3m.2
Contributors:
Lisandra Díaz de la Paz,
Yaisel Nuñez Arcia,
Juan Luis García Mendoza

Description

These test datasets focuses on string data, which represent persons’ names. These datasets were generated using NSDGen (Noisy String Data Generator) tool. The total amount of elements obtained in each dataset is calculated as the product of k value and the number of exact duplicates. As noise is introduced, duplicates are no longer exact and that amount is referred from now on as the number of observations per group. We refer to noise when common typos as insertions, deletions, substitutions and/or transpositions of characters in strings are introduced. To introduce such typos in strings, we consider the graph of distances among keys in QWERTY keyboards. Them has been used to evaluate clustering algorithms based on partitions in the ambiguous name problem, record linkage and authority control files, as testing datasets.

Files

Steps to reproduce

Test datasets are obtained using the NSDGen tool as follows: 1) Initial State a) a txt file is loaded with 10 different names (k value), which is called primary data set 1. 2)Exact duplicates a) 15 exact duplicates are introduced for each different name from the primary data set 1. 3) Percent of noise a) the following noise interval is specified: initial percent 5%, final percent 95% and step 10. Thus 10 artificial data sets are created. The total number of items for each data set created is 150. 4) Final State a) It is iterated twice more with tool having as input: i) A txt file with 15 different names called primary data set 2, in the first iteration. ii) Then enters another txt file with 30 different names called primary data set 3, in the second iteration. Then 10 and 5 exact duplicates are introduced for each different name of the primary data set 2 and 3 respectively. iii) Furthermore, the same noise interval used in the primary data set 1 is selected for primary data sets 2 and 3. b) Finally, 10 data sets for each of the three primary data sets are obtained. It formed a total of 30 data sets with 150 elements each.