SM01: Web Crawling with DLC parallel execution pattern - experiment reports

Published: 01-11-2017| Version 1 | DOI: 10.17632/hzxkbhfw7z.1
Goran Grubić


Research project SM01 (Parallel Semantic Crawler for manufacturing multilingual web...) In the DLC pattern multiple pages of the same web site are loaded and processed in parallel threads. The DLC parallel execution pattern (loading more than 1 targets from the ranked frontier URL list) implies conscious Frontier Ranking Algorithm corruption in favor of execution performance gains. Objective of this experiment is to describe relationship between crawl quality and performance gains for different Load Take (LT) values: 1, 2, 4, 8, 12, 16, 20, 24, 28 and 30. All crawlers were run over two sample sets: Sc and Sn, with Page Loads (PLmax) set to 30 and allowed number of parallel DLC threads (TCmax) set to 2. Please, refer to the Crawl Report Content Guide to learn what is in the report archives.