CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)

Published: 4 December 2018| Version 1 | DOI: 10.17632/6wtpgr3kbj.1
Contributors:
Farah Zaib Khan, Stian Soiland-Reyes

Description

The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages. First step, "Pre-align'' accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step. The next step "Align'' also accepts the human reference genome as input along with the output files from "Pre-align'' and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format. The BAM files generated after "Align'' are sorted with "SAMtool sort''. Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in "Post-align'' step. This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.6.0 or use https://pypi.org/project/cwlprov/ to explore

Files

Steps to reproduce

This analysis was run using a 16-core Linux cloud instance with 64GB RAM and pre-installed docker. 1. Install gsutils export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install google-cloud-sdk 2. Get the data and make the analysis environment ready: git clone https://github.com/FarahZKhan/topmed-workflows.git cd topmed-workflows git checkout cwlprov_testing cd aligner/sbg-alignment-cwl # this is a custom script download google bucket files from json files and create a local json # it needs gsutil to be installed though git clone https://github.com/DailyDreaming/fetch_gs_frm_json.git python2.7 fetch_gs_frm_json/dl_gsfiles_frm_json.py topmed-alignment.sample.json # Wait... this should download ~18Gb. 3. Run the following command to create the CWLProv Research Object: time cwltool --no-match-user --provenance alignmnentwf0.6.0 --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-alignment.cwl topmed-alignment.sample.json.new zip -r alignment_0.6.0_linux.zip alignment_0.6.0_linux sha256sum alignment_0.6.0_linux.zip > alignment_0.6.0_linux.zip.sha256

Institutions

The University of Manchester School of Computer Science, The University of Melbourne Department of Computing and Information Systems

Categories

Bioinformatics, Workflow Management, Provenance

Licence