DMOZ 2006 Dataset and its Wikification

Published: 2 May 2019 | Version 1 | DOI: 10.17632/9mpgz8z257.1
Contributor(s):

Description of this data

This dataset was retrieved with a crawler in 2006 from the Open Directory Project (ODP) (http://dmoz.org, https://en.wikipedia.org/wiki/DMOZ), which closed in 2017 and was reborn as Curlie (https://curlie.org/).
The topics were selected from the third level of the ODP hierarchy. Some constraints were imposed on this selection to ensure the quality of the dataset. The minimum size for each selected topic was 100 URLs, and the language was restricted to English. For each topic, we collected all of its URLs as well as those in its subtopics.
The retrieved HTML was parsed and cleaned to remove empty, pdf, flash, and other not useful files.
The total number of collected pages was more than 350K from 448 topics.
In 2018 the data was wikified.

Experiment data files

Steps to reproduce

Websites were retrieved using the lists of URLs available at urls.tar.gz, but some pages are not in the dataset.
Reasons are: website was not available at crawling time (near December 2006), retrieved HTML was not a valid HTML file (i.e., pdf, flash, etc.), parsed HTML contained no readable text, etc. So for example, the urls/1.txt file contains 812 URLs and texts/1/ directory contains 592 files only.
Wikification was done using TagMe (https://tagme.d4science.org/tagme/) in May 2018.
In order to limit the number of concepts, wikified short-phrases with scores below 0.1 were discarded.

Related links

Latest version

  • Version 1

    2019-05-02

    Published: 2019-05-02

    DOI: 10.17632/9mpgz8z257.1

    Cite this dataset

    Lorenzetti, Carlos; Maguitman, Ana; Baggio, Cecilia (2019), “DMOZ 2006 Dataset and its Wikification”, Mendeley Data, v1 http://dx.doi.org/10.17632/9mpgz8z257.1

Statistics

Views: 727
Downloads: 85

Institutions

Universidad Nacional del Sur, Consejo Nacional de Investigaciones Cientificas y Tecnicas

Categories

Information Retrieval, World Wide Web, Classifier Evaluation

Licence

CC BY 4.0 Learn more

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.

What does this mean?
You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.

Report