imbWBI: Classification of Business Entities on Multilingual Web: configuration optimization, auxiliary experimental reports and other resources

Published: 26 February 2018| Version 2 | DOI: 10.17632/mg98ypgc8s.2
Goran Grubić


The DataSet contains auxiliary experimental reports, configuration files and other materiel, produced during research: "Classification of Business Entities on Multilingual Web using Natural Language Processing...". # Content of the folders: ## Configuration optimization process Experiments performed for system optimization ## General resources Graphic representations of Semantic Clouds, created by the system. Collection ACE Scripts, executed during research. Color renders are made with older cloud construction algorithm. ## Particular aspects of the system Results of the experiments performed to evaluate particular mechanisms of the system ## imbWBI_ITM_ProjectFiles Configuration files with sample specification and other resources required for results reproduction ------------------------------------------------------------------------------------------------------------------ In this research, we proposed and developed, an open source business stakeholder classification system, capable of multi-class single-label hard classification of business entities, according to the products they fabricate. The sole external data source is content retrieved from web site of the stakeholder, processed with array of Natural Language Processing, Web Data Mining and statistical techniques. The output is single label result, pointing to the particular industry of the stakeholder. Sample set contains: 5 categories, each having 10 manufacturing companies (web sites). Specific challenges addressed: - multilingual web content - limited availability of domain-specific training data-sets - heterogeneous linguistic resources of variable quality - absence of production ready and publicly available general semantic lexicons, like WordNet Problems solved in this research: - construction of semantic cloud (non-hierarchical lexicon of semantically related terms) from limited amount of web content - adaptation of similarity computation schema, based on Semantic Similarity Retrieval Model - development of efficient and effective Feature Vector Extraction mechanism, used to reduce number of dimensions in Feature Vector to the number of categories (5) - evaluation of wide range of classification algorithms and configuration parameters: kNN, NaiveBayes, Multiclass SVM and Neural Networks. (17 classifier models are evaluated in every experiment) All software tools (application and the libraries), developed during this research, are published under GNU GPL3 licence, thus available for other researchers and professionals. ---- Goran Grubić Faculty of Organizational Sciences, University of Belgrade, Belgrade, Serbia, +381 62 27 27 55


Steps to reproduce

The experiments are produced using [imbWBI Console Tool] application, built on top of our open source Web Business Intelligence (imbWBI) library, that is part of imbVeles Framework . To download the Windows Installer for the [imbWBI Console Tool] application visit: To setup the environment for the reproduction of the result, please follow detailed tutorial on: Alternatives for imbWBI Console Tool Windows Installer: - to download imbWBI repository from GitHub, build the source code with Visual Studio 2017 (having C# Build Tools installed). - to create your own Console Application with Visual Studio and to use imbWBI NuGet package (which will draw the complete imbVeles stack of packages: imbNLP, imbWEM, imbACE and imbSCI)


Univerzitet u Beogradu Fakultet organizacionih nauka


Data Mining, World Wide Web, Business Intelligence, Manufacturing, Classification System, Serbian Language