imbWBI: Classification of Business Entities on Multilingual Web - The Main Results

Published: 26 February 2018| Version 2 | DOI: 10.17632/8x9n2mn7h4.2
Goran Grubić


In this research, we proposed and developed, an open source business stakeholder classification system, capable of multi-class single-label hard classification of business entities, according to the products they fabricate. The output is single label result, pointing to the particular industry of the stakeholder. + Summary Spreadsheets with the most relevant findings and research sample data. + TF-IDF Evaluation Contains in total 16 configurations, evaluated in 10-fold cross validation schema, where the same 8 models were ran with page sorting (by text size, desc) at input (of content processing pipeline) and 8 without. Beside the traditional TF-IDF (2 experiments), another 6 modified versions were evaluated: without IDF, with DFC 1.1 and 2.0, and with and without HTML Tag Factors (TW). + Results with CSSRM Cosine SSRM is our customized method for semantic similarity computation. Reports in this folder are performed near and at optimum configuration of the system. + System evaluation Reports and summary spreadsheets on experiments performed for system 10-fold cross validation. + Unstable performance Experiments with different (several sites) sample set, where the system achieved up to F1=0.893 effectiveness, while being unstable because high-number of parallel threads. Morphosyntactic resource interpreter and content decomposition pipeline were producing different results at each run. The results are discarded as non reproducible with single run. ------------------------------------------------------------------- Sample set contains: 5 categories, each having 10 companies (web sites). Specific challenges addressed in this research: - multilingual web content - limited availability of domain-specific training data-sets - heterogeneous linguistic resources of variable quality - absence of production ready and publicly available general semantic lexicons, like WordNet Problems that are addressed by this research: - construction of semantic cloud (non-hierarchical lexicon of semantically related terms) from limited amount of web content - adaptation of similarity computation schema, based on Semantic Similarity Retrieval Model - development of efficient and effective Feature Vector Extraction mechanism, used to reduce number of dimensions in Feature Vector to the number of categories (5) - evaluation of wide range of classification algorithms and configuration parameters: kNN, NaiveBayes, Multiclass SVM and Neural Networks. (17 classifier models are evaluated in every experiment) All software tools (application and the libraries), developed during this research, are published under GNU GPL3 licence, thus available for other researchers and professionals. ---- Goran Grubić Faculty of Organizational Sciences, University of Belgrade, Belgrade, Serbia, +381 62 27 27 55


Steps to reproduce

The experiments are produced using [imbWBI Console Tool] application, built on top of our open source Web Business Intelligence (imbWBI) library, that is part of imbVeles Framework . To download the Windows Installer for the [imbWBI Console Tool] application visit: To setup the environment for the reproduction of the result, please follow detailed tutorial on:


Univerzitet u Beogradu Fakultet organizacionih nauka


World Wide Web, Web Mining, Manufacturing, Classification System, Serbian Language, Serbia, Classifier