Barcode counts for heterozygous yeast for genes related to the chaperone system

Published: 19-01-2021| Version 2 | DOI: 10.17632/6pkcy497v7.2
Contributor:
Tatyana Tavella

Description

Haploinsufficiency Chemical Genomic Profiling (HIP) approach is based on the premise that yeast containing a gene expressed in heterozygosis will become hypersensitive to an inhibitor targeting the product of this gene. Here we treated a pool containing ~6000 heterozygous yeast strains with sublethal doses of violacein or vehicle control for further barcode counts comparisson. The BioProject accession number of the SRA database generated from this work is PRJNA689872. In order to guarantee reproducibility of our results, all the code we used to process the data is available in a public databank (https://zenodo.org/record/4443837).

Files

Steps to reproduce

The barcode sequencing (barcode-seq) pre-processing started with evaluation of the quality of reads generated through the tools FastQC (version 1.6), written by Simon Andrews at the Babraham Institute (for more information see www.bioinformatics.babraham.ac.uk/projects/fastqc) and MultiQC (version 1.6). Then, adaptors from the primers sequences were removed using Cutadapt (version 1.16). At this step, primer sequences with insertions and deletions were not allowed and pairs of unprocessed reads were discarded (options–no—indels, --dischard-untrimmed). The resulting sequences were again analyzed with FastQC and MultiQC in order to evaluate the removal efficiency of the primers and adapters. No sample had post-removal adapter contents, and most reads had a 20-basepair size (bp), which is the expected size of the barcode sequence. Single read clustering was done from the identification of amplicon sequence variants (ASV) with the DADA2 denoising algorithm (version 1.9.1) to discard those with more than one expected error (maxEE = 1), quality score less than 2 (minQ = 2) and size smaller than 16 bp or greater than 21 bp (minLen = 16, maxLen = 21). Next, the parameters of the error models were obtained by alternating the sample interference with the parameter estimation until convergence was reached. After base pair denoising, clustered dereplicated reads and error models from all samples were used as input data for the function DADA (options OMEGA_A = 1e-40, AND_SIZE = 10, USE_KMERS = TRUE, VECTORIZED_ALIGNMENT = TRUE, GREEDY = FALSE, GAPLESS = FALSE). The pairs of reads with a minimum overlap of 10 bp and no mismatch were then fused to obtain the ASVs. In total, DADA2 identified 8395 ASVs, of which 4112 corresponded perfectly to any of the 6337 known barcodes sequences. It was allowed that ASVs with Levenshtein distance of up to 2 were assigned as barcodes with greater similarity. Thus, 5447 ASVs, corresponding to 5405 barcodes, were taken for subsequent analyzes. Finally, the array containing the number of reads per barcode has been updated to replace each barcode sequence by its corresponding mutated ORF code that it represents. Finally, DESeq2 package (version 1.20.0) was used to normalize the barcodes counts and to estimate the differential abundance between treated samples and their respective controls. A filter was further established to remove the barcodes with low count along the samples (less than 10 observations in triplicate). Differentially depleted barcodes were identified using a maximum likelihood ratio test ("LRT"), which consists of a generalized linear model in which the number of observations of a given barcode is described by a negative binomial distribution whose average is given by the treatment. The normalization between samples was done by the library size factor method, using only ASVs with number of observations greater than zero.