DMOZ 2006 Dataset and its Wikification

Published: 2 May 2019| Version 1 | DOI: 10.17632/9mpgz8z257.1
Contributors:
,
,

Description

This dataset was retrieved with a crawler in 2006 from the Open Directory Project (ODP) (http://dmoz.org, https://en.wikipedia.org/wiki/DMOZ), which closed in 2017 and was reborn as Curlie (https://curlie.org/). The topics were selected from the third level of the ODP hierarchy. Some constraints were imposed on this selection to ensure the quality of the dataset. The minimum size for each selected topic was 100 URLs, and the language was restricted to English. For each topic, we collected all of its URLs as well as those in its subtopics. The retrieved HTML was parsed and cleaned to remove empty, pdf, flash, and other not useful files. The total number of collected pages was more than 350K from 448 topics. In 2018 the data was wikified.

Files

Steps to reproduce

Websites were retrieved using the lists of URLs available at urls.tar.gz, but some pages are not in the dataset. Reasons are: website was not available at crawling time (near December 2006), retrieved HTML was not a valid HTML file (i.e., pdf, flash, etc.), parsed HTML contained no readable text, etc. So for example, the urls/1.txt file contains 812 URLs and texts/1/ directory contains 592 files only. Wikification was done using TagMe (https://tagme.d4science.org/tagme/) in May 2018. In order to limit the number of concepts, wikified short-phrases with scores below 0.1 were discarded.

Institutions

Universidad Nacional del Sur, Consejo Nacional de Investigaciones Cientificas y Tecnicas

Categories

Information Retrieval, World Wide Web, Classifier Evaluation

Licence