imbWBI: Classification of Business Entities on Multilingual Web - The Main Results

Published: 22 Feb 2018 | Version 1 | DOI: 10.17632/8x9n2mn7h4.1

Description of this data

In this research, we proposed and developed, an open source business stakeholder classification system, capable of multi-class single-label hard classification of business entities, according to the products they fabricate. The output is single label result, pointing to the particular industry of the stakeholder.

# Folders

## Summary
Spreadsheets with the most relevant findings and research sample data.

## TF-IDF Evaluation
Contains in total 16 configurations, evaluated in 10-fold cross validation schema, where the same 8 models were ran with page sorting (by text size, desc) at input (of content processing pipeline) and 8 without. Beside the traditional TF-IDF (2 experiments), another 6 modified versions were evaluated: without IDF, with DFC 1.1 and 2.0, and with and without HTML Tag Factors (TW).

## The final results with CSSRM
Cosine SSRM is our customized method for semantic similarity computation. Reports in this folder are performed near and at optimum configuration of the system. Stable F1 achieved is: 0.840, for LPF=4, STX=3, DFC=2.0, RX=std, TW=std, TC=std.

## Unstable performance
Experiments with different (several sites) sample set, where the system achieved up to F1=0.89333 (macro-averaged, 10-fold cross validation) effectiveness, while being unstable because high-number of parallel threads. Morphosyntactic resource interpreter and content decomposition pipeline were producing different results at each run. The results are discarded as non reproducible with single run. The sample set, used for the final evaluation is adjusted to have more even web sites in terms of content size.

Sample set contains: 5 categories, each having 10 companies (web sites).

Specific challenges addressed in this research:

  • multilingual web content
  • limited availability of domain-specific training data-sets
  • heterogeneous linguistic resources of variable quality
  • absence of production ready and publicly available general semantic lexicons, like WordNet

Problems that are addressed by this research:

  • construction of semantic cloud (non-hierarchical lexicon of semantically related terms) from limited amount of web content
  • adaptation of similarity computation schema, based on Semantic Similarity Retrieval Model
  • development of efficient and effective Feature Vector Extraction mechanism, used to reduce number of dimensions in Feature Vector to the number of categories (5)
  • evaluation of wide range of classification algorithms and configuration parameters: kNN, NaiveBayes, Multiclass SVM and Neural Networks. (17 classifier models are evaluated in every experiment)

All software tools (application and the libraries), developed during this research, are published under GNU GPL3 licence, thus available for other researchers and professionals.

Goran Grubić
Faculty of Organizational Sciences, University of Belgrade, Belgrade, Serbia, +381 62 27 27 55

Experiment data files

Steps to reproduce

The experiments are produced using [imbWBI Console Tool] application, built on top of our open source Web Business Intelligence (imbWBI) library, that is part of imbVeles Framework .

To download the Windows Installer for the [imbWBI Console Tool] application visit:

To setup the environment for the reproduction of the result, please follow detailed tutorial on:

Related links

Latest version

Previous versions

  • Version 1


    Published: 2018-02-22

    DOI: 10.17632/8x9n2mn7h4.1

    Cite this dataset

    Grubić, Goran (2018), “imbWBI: Classification of Business Entities on Multilingual Web - The Main Results”, Mendeley Data, v1

Compare to version


University of Belgrade Faculty of Organizational Sciences

Mendeley Library

Organise your research assets using Mendeley Library. Add to Mendeley Library


CC BY 4.0 Learn more

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.

What does this mean?

This dataset is licensed under a Creative Commons Attribution 4.0 International licence. What does this mean? You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.