Potential bulged G-quadruplex forming sequences (pG4-BS) in the human genome

Published: 27 February 2023| Version 1 | DOI: 10.17632/w37rx9hpb7.1
Csaba Papp,


Dataset containing the genomic coordinates of potential bulged G-quadruplex forming sequences (pG4-BS) in the human genome (Assembly: GRCh38). G-quadruplexes (G4s) are non-canonical DNA structures that are commonly found in single-stranded DNA (e.g., during transcription and replication). "Canonical" G4 finding algorithms have existed for a number of years and these utilize a generalized sequence model (G3+N1-7G3+N1-7G3+N1-7G3+; where "G3+" denotes a cluster of at least 3 continuous guanines and "N1-7" denotes any combination of one to seven nucleotides) to capture potential G4 forming DNA regions. Bulged G4s represent a novel subset of G4-like structures that incorporate non-guanine nucleotide(s) into one or more of their guanine clusters. Due to the highly diverse nature of the DNA sequence underlying these structures, no genome wide maps of pG4-BS are available for any organisms. Here, we provide a computationally derived dataset containing the genomic coordinates of pG4-BS in the human genome.


Steps to reproduce

The dataset can be reproduced by executing the script title "g4_bulge_finder.v4.py" (available at https://github.com/pappc/pG4-BS_2021) on the GRCh38 human genome assembly (GenBank Assembly Accession: GCA_000001405.15; RefSeq Assembly Accession: GCF_000001405.26). The script requires Python 2.7 to run. The options used were: python g4_bulge_finder.v4.py -g 2 -b 2 hg38.fa > g4bs.Hg38.raw.txt The options "-g 2" and "-b 2" set the parameters for the number of uninterrupted guanine clusters and the number of bulges in the captured sequences.


State University of New York Upstate Medical University


Genomics, Bioinformatics, Computational Biology