ASQ-PHI: An Adversarial Synthetic Benchmark for Clinical De-Identification and Search Utility
Description
Hospitals are beginning to deploy HIPAA-compliant Business Associate Agreement (BAA) large language models (LLMs). In the public setting, LLMs with fixed training cutoffs are routinely augmented with tools such as web search, deep research, and Model Context Protocol (MCP) servers so they can reach up-to-date knowledge. BAA LLMs, by contrast, almost never expose live web search or external tools, even though clinicians expect LLMs to surface current guidelines, drug safety updates, and literature. The constraint is that any query leaving a BAA-protected LLM for an external service must be free of Protected Health Information (PHI). We refer to this boundary as the safe handoff: the moment when a clinician’s PHI-containing query, generated inside a HIPAA-compliant BAA LLM, must be transformed into a HIPAA Safe Harbor–compliant version before being sent to non-BAA tools such as web search APIs, external evidence services, or MCP servers. Existing de-identification datasets are built from long electronic health record narratives rather than the short, compressed search queries clinicians type into LLM chat interfaces, so they do not allow de-identification performance to be tested at this safe handoff. ASQ-PHI (Adversarial Synthetic Queries for Protected Health Information de-identification) is constructed to supply this missing data: a benchmark of 1,051 fully synthetic clinical search queries with ground-truth PHI annotations for stress-testing HIPAA-compliant de-identification software. All queries were generated using Azure OpenAI GPT-4o. No real patient data were used. Research hypothesis: Current de-identification systems fail at the safe handoff from LLMs running inside HIPAA BAAs to external tools in two ways: 1) Leaking PHI 2) Over-redacting non-identifying clinical information reducing query utility. What the data shows: The dataset contains 832 PHI-positive queries (79.2 percent) and 219 hard negatives (20.8 percent), totaling 2,973 tagged PHI elements across 13 HIPAA Safe Harbor types. Most common: GEOGRAPHIC_LOCATION (27.8 percent), NAME (27.4 percent), DATE (27.1 percent). Hard negatives mimic PHI-containing text but contain no identifiers, enabling over-redaction measurement. Notable findings: Baseline validation using Amazon Comprehend Medical (DetectPHI) revealed severe recall-utility tradeoffs, as shown in validation_results. How the data was generated: Queries were generated using GPT-4o. A system prompt specified natural clinical questions containing 0 to 5 HIPAA identifiers. Output format: "===QUERY===" followed by "===PHI_TAGS===" with JSON {"identifier_type": "TYPE", "value": "..."}. Complete code in code/data_generation_pipeline.ipynb. The dataset serves developers building HIPAA-compliant LLM infrastructure, privacy researchers benchmarking algorithms, and compliance teams evaluating vendor claims.
Files
Steps to reproduce
To recreate the ASQ-PHI dataset exactly you will need Python 3.12, an Azure OpenAI GPT-4o deployment, and the code folder from this record, although the same pipeline can also be run with a local LLM or any other remote LLM provider. 1. Set up the environment: Install the Python dependencies from the project root using: pip install -r requirements.txt 2. Provide Azure credentials: In the code/ folder create a file called .env, or export the same variables in your shell. Fill in your Azure OpenAI details: AZURE_OPENAI_API_KEY_4o AZURE_OPENAI_ENDPOINT_4o AZURE_OPENAI_DEPLOYMENT_4o OUTPUT_PATH (optional, defaults to synthetic_clinical_queries.txt) 3. Open the notebook: Start Jupyter and open code/data_generation_pipeline.ipynb. The notebook reads the environment variables, builds the AzureOpenAI client, loads the PHI-focused system prompt, and defines the helper functions generate_phi_queries and validate_dataset. 4. Generate synthetic queries: Run the notebook cells from top to bottom. Then call: generate_phi_queries(n=...), with your chosen number of records (for example, n=1051). The script writes to OUTPUT_PATH, appending entries in the ASQ-PHI format: a single-line query after ===QUERY=== followed by a ===PHI_TAGS=== block that contains one JSON object per PHI element. 5. Validate the dataset: After generation, call validate_dataset(OUTPUT_PATH). This function reports the number of valid queries, the proportion of PHI-positive and hard-negative queries, the total PHI elements and mean PHI per query, and how many records were malformed. The summary should be close to the values in dataset_statistics.txt. 6. Optional tailoring of outputs: To create a domain-focused variant of ASQ-PHI (for example oncology, cardiology, or pediatrics), edit the three few-shot examples inside the system-prompt cell, you can even add to it, then repeat the generation and validation steps. The result is a new synthetic benchmark with the same file structure and quality checks but a different clinical focus. How to interpret and use the data: Parse synthetic_clinical_queries.txt by splitting on delimiters. Query block contains text to de-identify. PHI annotations are JSON objects with identifier type and value following HIPAA Safe Harbor. To evaluate the data: Run a de-identification software on each query, compare to ground truth labels, calculate recall and leakage. For hard negatives, calculate over-redaction rate.
Institutions
- University of Texas Medical Branch at Galveston