A set of 48 Target-capture probes for Silene

Published: 28 June 2024| Version 1 | DOI: 10.17632/xjt4hwd66c.1
Contributor:
Patrik Cangren

Description

A set of 48 target capture probes for Silene and related genera. The probes have been annotated using BLASTx for gene identification and identified proteins have been aligned using GeneWise to locate exon-intron regions.

Files

Steps to reproduce

Initial probe design An initial set of 142 markers were created by collecting 106 allegedly single-copy genes from the transcriptomes published in "Cangren, Patrik (2024), “40 Transcriptomes from Sileneae”, Mendeley Data, V2, doi: 10.17632/vykf3g4z5g.2". To this set 36 sequences with both exon and introns were added, of which eight had previously been utilized for Sileneae phylogenetics, whereas the remaining 28 sequences were found on Genbank and determined as putatively single-copy from blasting the set of 40 Sileneae transcriptomes. This probe set were used to sequence 96 samples from five Silene species using Illumina Miseq. To improve the probe set we assembled new sequences from the sequenced data. For exon-only markers, new reference sequences containing both exons and introns were inferred from contigs assembled using CLC-assembler. Assembled contigs were BLAST searched against the probe set to retrieve homologues and matching contigs were aligned using MAFFT. All alignments were inspected manually before merging contigs into final consensus sequences. If all exons could not be connected by introns the longest continuous sequence were selected and unconnected exons were removed. Through this process the exon only markers were complemented with introns to create full length sequences. To further improve and extend the probes an iterative mapping approach was utilized using the software CLC Mapper. The consensus sequences generated by mapping reads against the probe references were used as references for the next round of mapping until the number of mapped reads stabilized. The sequences produced were used to create an improved set of 142 markers based on complete nuclear sequences. This set was used to sequence additional species in Silene to test the performance of the markers. Marker selection The 142 markers were filtered based on sequencing success and absence of mapping issues. Alignments were manually inspected to check for presence of SNP’s and absence of large indels or signs of paralogy. Gene trees were estimated using phased sequences from 26 Silene species and the software PAUP. Gene trees were manually inspected to identify potentially useful candidate genes and reveal issues such as poor resolution, gene transfer or paralogy. The results from manual inspection of gene trees and alignments together with sequencing and mapping results were used to select 48 probes. Annotation All markers were annotated by searching the NCBI nr-db using BLASTx. We saved the top 10 hits för each query and selected the top hit which were not tagged as hypothetical/predicted/ uncharacterized/unknown/unnamed. If there were no hits without the specified tags the highest scoring hit was selected. To identify exon/intron boundaries the amino acid sequence of each hit was downloaded from GenBank and aligned against the probe using GeneWise. In cases where the top hits matched partial sequences, the best hit with the full protein sequence were used.

Institutions

Goteborgs universitet Institutionen for Biologi och Miljovetenskap

Categories

Molecular Biology, Genomics, DNA, Nucleotide, Phylogenetics, DNA Isolation

Licence