Classified web pages by bounce rate with SEO features
Description
To build a prediction model of webpage bounce rate and investigate the relationship between SEO features and bounce rate, it's required to specify a set of webpages and collect statistical data about them. To get bounce rates of these webpages, they should have Google Analytics property ID and tracking code on those webpages. In addition, the data collector should have access to this Google Analytics account. The number of examples in the final output dataset after preprocessing stage was 824 webpages. The attached excel sheets are for the dataset before and after preprocessing.
Files
Steps to reproduce
1366 webpages from 7 websites (including Arabic and English) were selected to be used for data collection. The Google tool "Looker Studio" was used to collect datasets. The 7 output datasets included the attributes: URL, bounce rate, views, average session duration, views per session, views per user, new users, sessions per user, scrolled users, engagement rate, user engagement, engaged sessions, and sessions. Data was extracted for the period Sep 2022 – Sep 2023. All these attributes can be used as dependent variables in the classification process, but because bounce rate is the most attribute that express about user engagement and can express about measurable effect on improvements of targeted actions on a webpage for economic purposes, it was selected to be the target class attribute of dataset. The ScreamingFrog tool was used to extract the SEO features for each webpage. Then, a RapidMiner process was used to merge the collected data from Google Looker Studio and ScreamingFrog. Finally, a data preprocessing by RapidMiner was applied for many goals: 1- To filter out webpages that have less than 10 page views, because this number of page views is little and can't be used to give a trusted values of bounce rates. 2- To select the needed attributes for classification. 3- To apply discretization on the attribute “Bounce Rate” to convert it from numerical values to user-specific classes to set it as a target attribute for classification. Discretization was applied for 3 values: Low, Medium, and High. Low bounce rates were chosen for values less than 0.4, medium for values between 0.4 and 0.6, and high bounce rates for values over 0.6 (values were selected up to expert recommendations)