Potential bulged G-quadruplex forming sequences (pG4-BS) in the human genome
Description
Dataset containing the genomic coordinates of potential bulged G-quadruplex forming sequences (pG4-BS) in the human genome (Assembly: GRCh38). G-quadruplexes (G4s) are non-canonical DNA structures that are commonly found in single-stranded DNA (e.g., during transcription and replication). "Canonical" G4 finding algorithms have existed for a number of years and these utilize a generalized sequence model (G3+N1-7G3+N1-7G3+N1-7G3+; where "G3+" denotes a cluster of at least 3 continuous guanines and "N1-7" denotes any combination of one to seven nucleotides) to capture potential G4 forming DNA regions. Bulged G4s represent a novel subset of G4-like structures that incorporate non-guanine nucleotide(s) into one or more of their guanine clusters. Due to the highly diverse nature of the DNA sequence underlying these structures, no genome wide maps of pG4-BS are available for any organisms. Here, we provide a computationally derived dataset containing the genomic coordinates of pG4-BS in the human genome.
Files
Steps to reproduce
The dataset can be reproduced by executing the script title "g4_bulge_finder.v4.py" (available at https://github.com/pappc/pG4-BS_2021) on the GRCh38 human genome assembly (GenBank Assembly Accession: GCA_000001405.15; RefSeq Assembly Accession: GCF_000001405.26). The script requires Python 2.7 to run. The options used were: python g4_bulge_finder.v4.py -g 2 -b 2 hg38.fa > g4bs.Hg38.raw.txt The options "-g 2" and "-b 2" set the parameters for the number of uninterrupted guanine clusters and the number of bulges in the captured sequences.