A high-throughput mutational scan of an acidic transcriptional activation domain
Description
Raw data for Staller et al. 2018 A high-throughput mutational scan of an acidic transcriptional activation domain Max V. Staller1, 2, Alex S. Holehouse3,4, Devjanee Swain-Lenz1,2,5, Rahul K. Das3,4,6, Rohit V. Pappu3,4, and Barak A. Cohen1, 2, * 1Edison Family Center for Genome Sciences and Systems Biology, Washington University in St. Louis School of Medicine, Saint Louis, MO, 63110 2Department of Genetics, Washington University in St. Louis School of Medicine, Saint Louis, MO, 63110 3Department of Biomedical Engineering, Washington University in St. Louis, Saint Louis, MO, 63130 4Center for Biological Systems Engineering, Washington University in St. Louis, Saint Louis, MO, 63130 5Present address: Department of Biology, Duke University, Durham, NC, 27708 6Present address: GNS Healthcare Inc., Cambridge, MA, 02139 * Corresponding Author and Lead contact: cohen@wustl.edu Files include: raw sequencing reads for replicate 1 (Sort 11) raw sequencing reads for replicate 2 (Sort 12 ) raw sequencing reads for replicate 1 of amino acid starvation condition (Sort 14) for calculating induction raw sequencing reads for mCherry only sorting (sort 15A) File of designed mutants (DNA sequence) (DNA_Seqs_GCN4_Array.txt). raw sequencing reads for paired end sequencing to look for mutations in the library (Sort11AD-BC R1 and R2) Key to barcodes: Sort_Bin 5’ inline barcode 3’ adaptor barcode S11_1 GCTCGAT IND71 S11_2 TAGACTAT IND71 S11_3 CGCTACCCT IND71 S11_4 ATAGTGGACA IND71 S11_5 GCTCGAT IND72 S11_6 TAGACTAT IND72 S11_7 CGCTACCCT IND72 S11_8 ATAGTGGACA IND72 S12_1 GCTCGAT IND69 S12_2 TAGACTAT IND69 S12_3 CGCTACCCT IND69 S12_4 ATAGTGGACA IND69 S12_5 GCTCGAT IND70 S12_6 TAGACTAT IND70 S12_7 CGCTACCCT IND70 S12_8 ATAGTGGACA IND70 S14_1 GCTCGAT IND69 S14_2 TAGACTAT IND69 S14_3 CGCTACCCT IND69 S14_4 ATAGTGGACA IND69 S14_5 GCTCGAT IND70 S14_6 TAGACTAT IND70 S14_7 CGCTACCCT IND70 S14_8 ATAGTGGACA IND70 S15_1 GCTCGAT IND69 S15_2 TAGACTAT IND69 S15_3 CGCTACCCT IND69 S15_4 ATAGTGGACA IND69 S15_5 GCTCGAT IND70 S15_6 TAGACTAT IND70 S15_7 CGCTACCCT IND70 S15_8 ATAGTGGACA IND70
Files
Steps to reproduce
Command line preprocessing unzip all the sequencing files gunzip *.fastq.gz concatenate all files from the 4 lanes for each index. e.g. cat *69* > ../IND69.fastq Copy both Brett’s python script and Ashley’s sbatch script to current folder run Brett’s script to split samples within an index python fastqconvert_ApaI.py IND69.fastq S14-1.txt S14-2.txt S14-3.txt S14-4.txt ind69umapped python fastqconvert_ApaI.py IND70.fastq S14-5.txt S14-6.txt S14-7.txt S14-8.txt ind70unmapped python fastqconvert_ApaI.py IND71.fastq S14-9.txt S14-10.txt S14-11.txt S14-12.txt ind71unmapped python fastqconvert_ApaI.py IND72.fastq S14-13.txt S14-14.txt S14-15.txt S14-16.txt ind72unmapped grep samples with ApaI site and AscI site Sort and count unique barcodes sbatch grepANDsort.sh Download with Cyberduck and proceed to python SCRIPT grepANDsort.sh #!/usr/bin/env bash #Run command on HTCF: sbatch /home/arwolf/scripts/16S_processing.sbatch #Author: awolf@wustl.edu #SBATCH --mem=16000 for i in $(ls S*.txt); do grep -o GGCCCG.*.GGCGCGCC $i > "$i"grepped sort "$i"grepped | uniq -c > "$i"sorted.txt done #BINNAME= $(echo $i | cut -d '.' -f 1) ## consider updating this script to account for BC length at this stage—something like: ## grep -o GGCCCG………GGCGCGCC $i > "$i"grepped ## note necessary because the preprocessing python script checks for correct length, only advantage is that it would be faster at this stage, result in a smaller output file, and remove a slower check step later in the pipeline.