DMOZ 2006 Dataset and its Wikification

Published: 02-05-2019| Version 1 | DOI: 10.17632/9mpgz8z257.1
Contributors:
Carlos Lorenzetti,
Ana Maguitman,
Cecilia Baggio

Description

This dataset was retrieved with a crawler in 2006 from the Open Directory Project (ODP) (http://dmoz.org, https://en.wikipedia.org/wiki/DMOZ), which closed in 2017 and was reborn as Curlie (https://curlie.org/). The topics were selected from the third level of the ODP hierarchy. Some constraints were imposed on this selection to ensure the quality of the dataset. The minimum size for each selected topic was 100 URLs, and the language was restricted to English. For each topic, we collected all of its URLs as well as those in its subtopics. The retrieved HTML was parsed and cleaned to remove empty, pdf, flash, and other not useful files. The total number of collected pages was more than 350K from 448 topics. In 2018 the data was wikified.

Download All

Steps to reproduce

Websites were retrieved using the lists of URLs available at urls.tar.gz, but some pages are not in the dataset. Reasons are: website was not available at crawling time (near December 2006), retrieved HTML was not a valid HTML file (i.e., pdf, flash, etc.), parsed HTML contained no readable text, etc. So for example, the urls/1.txt file contains 812 URLs and texts/1/ directory contains 592 files only. Wikification was done using TagMe (https://tagme.d4science.org/tagme/) in May 2018. In order to limit the number of concepts, wikified short-phrases with scores below 0.1 were discarded.