SM01: Research sample subsets
Description of this data
Research project SM01 (Parallel Semantic Crawler for manufacturing multilingual web...)
Research sample sets mentioned in "Evaluation" section of the paper given in spreadsheet and plain text formats +
including some extra information..
Origin of the initial research data set: the research sample set was extracted from CRM system of a company doing business in the domain of application
Experiment data files
Reviewed sample subsets: Sp, Sa and Sb. The spreadsheet contains only domain url (seed url).
300 domains of Sall (reviewed list) that were deeply scanned.
Sall domain index export, with some additional sheets made during subset creation
Spreadsheet with Sc, Sd and Sn sample subsets - only domain urls
Spreadsheet file with all steps leading to Sn subset creation. The first sheet describes the steps, other sheets contain domain list of each filtration step.
130 domains, plain text list, "Challenging" sample subset
Initial Sc subset (130 domains) - Domain Index data table report - this crawl is performed in subset review purpose i.e. to create "reviewed" Sc subset from the initial
83 domains of Sc reviewed subset (initial Sc had 130 domains)
Plain text, list of domain urls of Sd subset (internally called: Sc_sub) "New sample subset SD with 50 domains is created as refined version of Sc"
Subset Sn, domains with landing page on other than primary language
Steps to reproduce
If you want to run your crawler over sample sets used in our research - the files in plain text format are probably the ones you want to download.
Spreadsheet documents and zip with internal reports contain some extra information you might want to dive in later.
Cite this dataset
Grubić, Goran (2017), “SM01: Research sample subsets”, Mendeley Data, v1 http://dx.doi.org/10.17632/b4cs4rky9s.1
The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.