Pretrained Models and Reproducibility Archive for "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification"
Description
This repository contains the pretrained model weights, vectorizer pipelines, and ablation study checkpoints for the paper "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification". The provided models were trained and evaluated across three distinct text domains (fanfiction, personal blogs, and corporate emails) to investigate the fundamental vulnerability of stylometric systems against semantic-preserving paraphrase attacks. Contents of this archive: Robust Siamese: The best-performing cross-domain model utilizing character n-gram features (includes .pth weights, vectorizer.pkl , and scaler.pkl ). Cross-Domain (CD) Siamese: The baseline generalist character 4-gram model trained across all domains. Robust DANN (Domain-Adversarial Neural Network): The multi-view syntactic feature model trained for high adversarial robustness. BERT Baseline: The contextual baseline model used for comparative evaluation. Syntactic Ablation Models: Pretrained checkpoints isolating Part-of-Speech (POS) trigrams, function word frequencies, and readability metrics to demonstrate the specific drivers of robustness in stylometric features. Usage: These weights are intended to be used directly with the PyTorch and Scikit-Learn inference pipelines provided in the official GitHub repository. Researchers can utilize this archive to perfectly reproduce the cross-domain accuracy (up to 86.2%) and attack success rate evaluations presented in the manuscript.
Files
Institutions
- Netaji Subhas University of TechnologyDelhi, New Delhi