Anonymized TPB-Based Feature Matrix from Lengthy Indonesian Facebook Posts for Knowledge-Sharing Classification
Description
This dataset accompanies the IJIES manuscript "Knowledge-Sharing Behavior on Facebook Using Theory of Planned Behavior" (Yuliazmi et al., 2026). CONTENTS It contains an anonymized derived-feature dataset of 1,841 lengthy Facebook posts from 81 Indonesian Facebook users. Each post is described by 21 numeric/categorical features (S1-S21) derived from the Theory of Planned Behavior (TPB) framework and labelled as either Knowledge-Sharing Content (KSC) or Non-Knowledge-Sharing Content (NKSC). The package also provides 80/20 train/test split indices and StratifiedGroupKFold (k=5) user-grouped fold assignments for cross-user validation. Selected hyperparameters for six classifiers (Logistic Regression, Random Forest, Gradient Boosting, SVM, Neural Network, CatBoost) and the random seed (42) are included. A README maps each manuscript table (Tables 7-14) to the files and pseudocode steps required to reproduce it. DATA COLLECTION - Survey (OSMS): online questionnaire distributed between January 27 and April 2, 2021. 345 valid respondents. - Facebook posts: retrieved between March and June 2021 from public timelines of 184 participants who indicated consent through the online survey to share their Facebook activity for research. Retrieved posts span Dec 1, 2019 - Apr 8, 2021 (16-month historical window). Of 184 consenting participants, 81 met the post-length criterion (>= 5 sentences/post), yielding 1,841 lengthy posts. ANONYMIZATION & ETHICS Participants provided informed consent through the online survey, including specific agreement to share their public Facebook posts for research. The dataset has been fully anonymized: raw text, usernames, profile links, URLs, and timestamps removed; user IDs replaced with codes (vr###). Original posts cannot be reconstructed or re-identified. Anonymization is consistent with Indonesia's Personal Data Protection Law (UU PDP No. 27/2022). HISTORICAL CONTEXT The 2021 retrieval reflects the platform access environment of that period; current Facebook policies differ substantially. INTENDED USE & CITATION Released under CC BY 4.0. May be used for the analyses in the accompanying paper as well as new investigations into knowledge-sharing behavior, social media classification, or TPB feature engineering. Please cite both the dataset and the accompanying paper. Keywords: Theory of Planned Behavior, knowledge sharing, social media, Facebook, machine learning, Indonesian dataset, classification. Related publication: International Journal of Intelligent Engineering and Systems, 2026 (DOI to be added upon publication).
Files
Steps to reproduce
Detailed file mapping is provided in README.md included in the package. The full methodology is described as Pseudocodes 1-4 in the accompanying manuscript (Sections 2.2, 2.6, 2.7). Brief reproduction summary: 1. Environment: Python 3.9+ with pandas, numpy, scikit-learn, and catboost. Set all random_state / random_seed parameters to 42 (see config/random_seed.txt). 2. Load data/anonymized_features_S1_S21.csv as the main feature matrix (1,841 rows × 24 columns: post_index, anon_user_id, S1-S21, label_KSC). 3. Reproduce within-cohort classification (Tables 7, 8, 10): - Apply 80/20 split using data/train_test_split_indices.csv - Train each classifier with hyperparameters from config/selected_hyperparameters.json - Evaluate Accuracy, Precision, Recall, F1-score, AUC-ROC - See Pseudocodes 1 and 2 in the manuscript 4. Reproduce cross-user validation (Table 9): - Use data/stratified_groupkfold_indices.csv (k=5 folds, user-grouped) - For each fold, train on train indices, evaluate on test indices - Compute mean ± SD AUC across folds - See Pseudocode 3 in the manuscript 5. Reproduce hypothesis-based ablation (Table 12): - Use Random Forest with 5-fold CV on the training set - Iteratively drop each hypothesis feature group (H1-H6), retrain, recompute AUC - See Pseudocode 4 in the manuscript 6. Reproduce TPB component analysis (Tables 13 and 14): - Compute component scores: Attitude = mean(S1-S9), Subjective Norm = mean(S10-S16), PBC = mean(S17-S21) - Pearson correlation matrix for Table 13 - Independent-samples t-test (KSC vs NKSC) for Table 14 For per-table file requirements and feature definitions (TPB component mapping), see README.md.
Institutions
- Sepuluh Nopember Institute of TechnologyEast Java, Surabaya
- Universitas Budi LuhurJakarta, Jakarta