Discovery of human gut phage-encoded Anti-CRISPR proteins unveils diverse mechanisms for phages to evade host immunity
Description
This collection of Supplementary Tables provides comprehensive datasets and analyses supporting the systematic identification and characterization of CRISPR-Cas systems and anti-CRISPR (Acr) proteins in the human gut microbiome. Tables S1–S7 and S10 detail the detection, classification, and phylogenetic analysis of Class 1 and Class 2 CRISPR-Cas systems—including Cas9 orthologs—within the UHGG database. Tables S9 and S11 document spacer–virus connections between UHGG and GVD, enabling the prediction of phage-encoded Acrs. Tables S12–S14 summarize Acr candidate selection, codon optimization, and library construction. Functional validation data for Acrs targeting six Type II-Cas9 systems (Spy-, Sa-, St1-, St3-, Fn-, and NmCas9) are presented in Tables S16–S22, with a non-redundant Acr set provided in Table S22. Finally, structural analyses, including fold similarity and the GutAcraca family, are summarized in Tables S24–S26. Table S1. Class 1 CRISPR-Cas systems detected in UHGG, related to Figure 1A Table S2. Class 2 CRISPR-Cas systems detected in UHGG, related to Figure 1A, B Table S3. Type I CRISPR-Cas systems with Cas3 detected in UHGG, related to Figure 1A Table S4. Type III CRISPR-Cas systems with Cas10 detected in UHGG, related to Figure 1A Table S5. Distribution of type II, V, and VI CRISPR-Cas systems from Class 2 across microbial classes, related to Figure 1B Table S6. Cas9 CDSs detected in UHGG, related to Figure 1C. Cas9_subfamily were obtained from UniProt according to UniProtKB_Entry annotated by UHGG. Table S7. Non-redundant Cas9 CDSs used to construct the phylogenetic tree, related to Figure 1C Table S9. Connections between CRISPR spacers from UHGG and viral contigs from GVD through CRISPR-spacer blastn matches, related to Figure 2A and Figure S1 Table S10. Non-redundant Cas9 CDSs used to construct the phylogenetic tree, related to Figure S1 Table S11. Viral contigs in GVD which had CRISPR spacer matching with microbial genomes in UHGG carrying Cas9, related to Figure 2A Table S12. Acr candidates with amino acid sequence, related to Figure 2A Table S13. Selecting Acr candidates for DNA sequence codon optimization, related to Figure 2B Table S14. Oligos design of Acr candidate library, related to Figure 2B Table S16. Acrs of SpyCas9, related to Figure S3A, B Table S17. Acrs of SaCas9, related to Figure S3A, B Table S18. Acrs of St1Cas9, related to Figure S3A, B Table S19. Acrs of St3Cas9, related to Figure S3A, B Table S20. Acrs of FnCas9, related to Figure S3A, B Table S21. Acrs of NmCas9, related to Figure S3A, B Table S22. 651 non-redundant Acrs in total, related to Figure 4A, B Table S24. Structural similarity matrix of Acrs, related to Figure 4A, B and Figure S5B Table S25. Members of GutAcraca, related to Figure 4B Table S26. GutAcracas structural similarity in the AlphaFold database, related to Figure 5N
Files
Steps to reproduce
Bioinformatics pipeline for Acr candidate identification To identify Acr candidates, we linked 286,997 UHGG (Gregory et al., 2020) microbial genomes to 33,242 viral contigs from the Gut Virome Database (GVD) (Gregory et al., 2020) via CRISPR spacer matching. Using BLASTn (v2.15.0), we queried 1,846,441 UHGG CRISPR spacers (https://portal.nersc.gov/MGV/MGV_v1.0_2021_07_08/) (Nayfach et al., 2021) against GVD viral contigs under stringent parameters (word_size: 18, dust: no, qcov_hsp_perc: 95, pident > 96%, max_target_seqs: 1). This yielded 417,807 non-redundant spacer–viral contig links (spanning 56,758 UHGG genomes and 10,208 viral contigs). From these linked genomes, 19,264 UHGG genomes encoding 21,275 Cas9 proteins (per UHGG annotation) targeted 5,850 viral contigs. We then screened these contigs for anti-CRISPR-associated (Aca) genes by querying them against a curated Aca database (Yi et al., 2020) using BLASTx (v2.15.0, e-value < 1e−2, p-ident ≥ 40%, alignment length ≥80% of subject). This identified 1,820 Aca-harboring viral contigs (including 11 manually checked with subthreshold coverage). Open reading frames (ORFs) within these contigs were predicted using Prodigal (v2.6.3) (Hyatt et al., 2010), generating 97,336 proteins (50–350 aa) from 119,107 total ORFs. Subsequent BLASTp analysis (e-value ≤ 1e−5) against the Aca database (Yi et al., 2020) identified 6,093 Aca homologs. Acr candidates were defined as Aca homologs and their adjacent ORFs (both 50–350 aa), resulting in 13,493 proteins for downstream screening. Acrs screening and metagenomic sequencing The randomly selected 4,497 Acr candidate DNA sequences were codon-optimized for expression in E. coli. The designed oligonucleotides were synthesized on a 12,000-format chip (201-250 nt, Twist). Bacterial cells containing the Acr candidate libraries were cultured from 300-μL aliquots stored at -80 °C in 15 mL of LB medium supplemented with kanamycin (Km+) and chloramphenicol (Cm+) for 12 h at 37 °C. Two milliliters of the cultures were inoculated into 8 mL of fresh 2xLB medium, supplemented with 20 μg/mL chloramphenicol, 50 μg/mL kanamycin, and inducers L-arabinose (L-ara) and rhamnose (rha), and cultured at 37 °C for 6 hours. Subsequently, 2 mL of the resulting culture were diluted in 8 mL of fresh 2xLB medium containing antibiotics and inducers. This serial transfer process was repeated for 6-13 cycles. For each transfer, the remaining 8 mL cultures were aliquoted into two tubes, with 3 mL aliquots stored at -80 °C with 25% glycerol for preservation, and another 5 mL aliquots subjected to plasmid extraction and sequencing as previously described to determine the abundance of each Acr candidates in each transfer.