Query-Free OpenWebText - Part 1: Clean and Dirty Corpus
Description
This dataset is derived from the original OpenWebText corpus and is used to investigate the effect of intrinsic query language complexity by isolating pre-training exposure bias (RQ1, RQ2). This is Part 1 of the Query-Free OpenWebText dataset. The OpenWebText Base Filtered Corpus consists of two variants of equivalent size: OpenWebText-clean: The corpus where all documents containing explicit SQL, SPARQL, or Cypher syntax/keywords have been removed using rigorous filtering protocols. This is used to train the base **unbiased T5 (uT5)** model. OpenWebText-dirty: The original, unfiltered version of the OpenWebText corpus. This dataset retains natural occurrences of query language examples to represent typical pre-training distribution, allowing for the benchmarking of pre-training bias effects. Dataset Format: Arrow
Files
Steps to reproduce
The generation of the Clean and Dirty OpenWebText variants involves applying the defined query-language filtering protocol to split into clean (no query languages) and dirty ( the rest of the corpus, reduced to same token length as clean). Source Corpus: - Plain text: OpenWebText dataset (Gokaslan and Cohen, 2019). Filtering Protocol: The OpenWebText-clean variant was generated by applying a sequential, rule-based filtering workflow: Sequential application of case-insensitive regular expression filters (matching `SELECT...FROM`, `SELECT...WHERE`, `MATCH...RETURN`, as detailed in Table 2) and a comprehensive list of query keyword filters (e.g., "JOIN", "FILTER", "SQL"). Corpus Splitting and Assignment: 1. The raw OpenWebText dataset was split into two large, equivalent-sized partitions (clean and dirty). 2. To generate OpenWebText-clean corpus, the full filtering protocol was applied. 3. OpenWebText-dirty was built from the filtered out samples and reduced to be the same size as OpenWebText-clean. Both resulting corpora (Clean and Dirty) were then used to pre-train uT5 model in github.com/vejvarm/balanced-plms.
Institutions
- Yokohama Kokuritsu Daigaku