Pretrained Models and Reproducibility Archive for "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification"

Name: Pretrained Models and Reproducibility Archive for "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification"
Creator: Aarushi Sinha
Published: 2026-03-27T08:43:05.920Z
Keywords: Natural Language Processing, Machine Learning

Sinha, Aarushi

doi:10.17632/8nbmbdtwpn.1

Pretrained Models and Reproducibility Archive for "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification"

Published: 27 March 2026| Version 1 | DOI: 10.17632/8nbmbdtwpn.1

Contributor:

Aarushi Sinha

Description

This repository contains the pretrained model weights, vectorizer pipelines, and ablation study checkpoints for the paper "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification". The provided models were trained and evaluated across three distinct text domains (fanfiction, personal blogs, and corporate emails) to investigate the fundamental vulnerability of stylometric systems against semantic-preserving paraphrase attacks. Contents of this archive: Robust Siamese: The best-performing cross-domain model utilizing character n-gram features (includes .pth weights, vectorizer.pkl , and scaler.pkl ). Cross-Domain (CD) Siamese: The baseline generalist character 4-gram model trained across all domains. Robust DANN (Domain-Adversarial Neural Network): The multi-view syntactic feature model trained for high adversarial robustness. BERT Baseline: The contextual baseline model used for comparative evaluation. Syntactic Ablation Models: Pretrained checkpoints isolating Part-of-Speech (POS) trigrams, function word frequencies, and readability metrics to demonstrate the specific drivers of robustness in stylometric features. Usage: These weights are intended to be used directly with the PyTorch and Scikit-Learn inference pipelines provided in the official GitHub repository. Researchers can utilize this archive to perfectly reproduce the cross-domain accuracy (up to 86.2%) and attack success rate evaluations presented in the manuscript.

Files

Institutions

Netaji Subhas University of Technology
Delhi, New Delhi

Pretrained Models and Reproducibility Archive for "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification"

Description

Files

Institutions

Categories

Licence