imbWBI: Classification of Business Entities on Multilingual Web: configuration optimization, auxiliary experimental reports and other resources

Published: 22 Feb 2018 | Version 1 | DOI: 10.17632/mg98ypgc8s.1
Contributor(s):

Description of this data

The DataSet contains auxiliary experimental reports, configuration files and other materiel, produced during research: "Classification of Business Entities on Multilingual Web using Natural Language Processing...".

# Content of the folders:

## Configuration optimization process
Experiments performed for system optimization

## General resources
Graphic representations of Semantic Clouds, created by the system. Collection ACE Scripts, executed during research. Color renders are made with older cloud construction algorithm.

## Particular aspects of the system
Results of the experiments performed to evaluate particular mechanisms of the system

## imbWBI_ITM_ProjectFiles
Configuration files with sample specification and other resources required for results reproduction

------------------------------------------------------------------------------------------------------------------

In this research, we proposed and developed, an open source business stakeholder classification system, capable of multi-class single-label hard classification of business entities, according to the products they fabricate. The sole external data source is content retrieved from web site of the stakeholder, processed with array of Natural Language Processing, Web Data Mining and statistical techniques. The output is single label result, pointing to the particular industry of the stakeholder.

Sample set contains: 5 categories, each having 10 manufacturing companies (web sites).

Specific challenges addressed:

  • multilingual web content
  • limited availability of domain-specific training data-sets
  • heterogeneous linguistic resources of variable quality
  • absence of production ready and publicly available general semantic lexicons, like WordNet

Problems solved in this research:

  • construction of semantic cloud (non-hierarchical lexicon of semantically related terms) from limited amount of web content
  • adaptation of similarity computation schema, based on Semantic Similarity Retrieval Model
  • development of efficient and effective Feature Vector Extraction mechanism, used to reduce number of dimensions in Feature Vector to the number of categories (5)
  • evaluation of wide range of classification algorithms and configuration parameters: kNN, NaiveBayes, Multiclass SVM and Neural Networks. (17 classifier models are evaluated in every experiment)

All software tools (application and the libraries), developed during this research, are published under GNU GPL3 licence, thus available for other researchers and professionals.
----
Goran Grubić
Faculty of Organizational Sciences, University of Belgrade, Belgrade, Serbia
goran.grubic@koplas.co.rs, +381 62 27 27 55

Experiment data files

Steps to reproduce

The experiments are produced using [imbWBI Console Tool] application, built on top of our open source Web Business Intelligence (imbWBI) library, that is part of imbVeles Framework .

To download the Windows Installer for the [imbWBI Console Tool] application visit:
http://blog.veles.rs/imbveles-open-source-libraries/imbwbi-introduction/research-console/

To setup the environment for the reproduction of the result, please follow detailed tutorial on:
http://blog.veles.rs/imbveles-open-source-libraries/imbwbi-introduction/guide-for-reproducing-the-web-classification-research/

Alternatives for imbWBI Console Tool Windows Installer:

  • to download imbWBI repository from GitHub, build the source code with Visual Studio 2017 (having C# Build Tools installed).

  • to create your own Console Application with Visual Studio and to use imbWBI NuGet package (which will draw the complete imbVeles stack of packages: imbNLP, imbWEM, imbACE and imbSCI)

Related links

Latest version

Previous versions

  • Version 1

    2018-02-22

    Published: 2018-02-22

    DOI: 10.17632/mg98ypgc8s.1

    Cite this dataset

    Grubić, Goran (2018), “imbWBI: Classification of Business Entities on Multilingual Web: configuration optimization, auxiliary experimental reports and other resources”, Mendeley Data, v1 http://dx.doi.org/10.17632/mg98ypgc8s.1

Compare to version

Institutions

University of Belgrade Faculty of Organizational Sciences

Mendeley Library

Organise your research assets using Mendeley Library. Add to Mendeley Library

Licence

CC BY 4.0 Learn more

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence.

What does this mean?

This dataset is licensed under a Creative Commons Attribution 4.0 International licence. What does this mean? You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.

Report