SM03: Evaluation of Feature Selection and Weighting methods for topical Website Multi-class Classification

Published: 25 September 2018| Version 1 | DOI: 10.17632/zzmp7t8msn.1
Contributor:
Goran Grubić

Description

The repository is related to a website classification research, named: "Evaluation of Feature Selection and Weighting methods for topical Website Multi-class Classification" The main focus of the study is a comprehensive evaluation of state-of-the-art term weighting models, in the context of business website classification. The models are decomposed into their local and global components and recombined into 32 hybrid models, representing all viable variations, beyond what was initially considered by the original authors. The results showed that multi-class classification performances can be significantly improved if recently proposed global weighting components of Inverse Gravity Moment and Inverse Class Space Density Frequency, are combined with less addressed, but highly effective, local functions, like square root Term Frequency and Glasgow. In addition, filter-model feature selection functions, based on information theory, are empirically evaluated together with web page selection functions for website representation construction. The repository provides: + content analysis and other statistics on used datasets: WebKB's 7-Sector 1997 and WebKB 7-Sector 2018 Reports generated during three stages of experiments: + Feature selection function evaluation + 32 hybrid term weighting models evaluation + Weg page selection functions evaluation Note: the content snippets are removed from the experiment reports, in order to comply to the copyrights of source websites. Hence many folders in the reports remained empty. An experiment report directory, normally contains the following: + Subdirectories for each fold of cross validation 5-fold[0-5] directory_readme.txt -- description of contained files dt_test_results.xlsx -- classification results, after aggregated from k-folds log.txt -- Log output generated by imbWBI Console Tool note.txt -- Notes on the experiment In fold subdirectories: + Corpus -- subdirectory, contains reports of selected features and processed corpus note.txt -- provides description of the experiment setup

Files

Steps to reproduce

To reproduce exactly the same experiments, you should use imbWBI Console Tool (version 0.4+). The installation is available on the project's web site. To reconstruct 7Sectors2018 dataset, please follow instructions on: http://blog.veles.rs/webkb-industry-7-sectors-2018-dataset-reconstruction-tutorial/ Once you have both datasets ready and imbWBI Console Tool installed, follow the tutorial on results reproduction: http://blog.veles.rs/results-reproduction-tutorial-on-sm03-research/ As part of each experiment report package, a snippet from actually executed imbWBI Console Tool script is included (script.ace). You can use these scripts to repeat experiments of your interest. -- However, you might use the provided dataset category structure, to harvest and process the content with your oen tools. Good starting point would be to download archives from Research datasets, and find domains?7sectors.txt and domains_7sectors_2018.txt files. These contain list of web domains, being crawled back in 1997. and during this research (2018/09).

Institutions

Univerzitet u Beogradu Fakultet organizacionih nauka

Categories

Natural Language Processing, Machine Learning, Content Analysis, Web Mining, Text Mining

Licence