The published scripts are integrated in a pipeline which was developed at the IPK in the frame of a master thesis. The pipeline aims to perform a gene
prediction in genomic data using additional information from RNA-seq data. The process is divided into four main steps that integrate several external tools: 1) RNA-Seq-Scaffolding
(L_RNA_Scaffolder), 2) repeat masking (Kmasker), 3) RNA-seq alignment (STAR or HISAT2) and 4) gene prediction (GeMoMa). The pipeline is developed to work at the computing infrastructure of the IPK.
The file contains nucleotide sequences of the synthetic DNA constructs set used for the siFi software calibration.
The siFi (siRNA Finder) is for optimizing long double-stranded RNAi- target design and for the prediction of RNAi off-targets. It provides an intuitive
graphical user interface, works in Microsoft Windows environment and can use custom sequence databases in standard FASTA format.
The query suggestion API enables a real-time, semantic query
suggestion for keyword based query systems. It has been implemented as
RESTful service and can
easily be integrated into third-party
applications. Compared to popular search frameworks, like Apache Lucene,
the suggested queries consider biological
background knowledge that was extracted from
biomedical literature. The
service was successfully applied in the LAILAPS plant science search
engine. In particular we were able to discover inferred associations
between traits and genes and use
them to automatically reformulate
search phrases. The training dataset consits of 13,930,050 documents
from PubMed articles, and gene and protein function describtions from
UniProt, Plant Ontology and
Tool for the removal of paired-end or mate-pair reads from a set of two Illumina FASTQ read files based on read length. Users can specify the minimal length
of reads in base pairs that both reads of a pair need to have in order for this read pair to be kept, otherwise the read pair will be removed from the output.
This folder contains the Unix shell (ZSH) and R source code that was used for reading in datasets providing positional information,
constructing a BAC overlap graph and a HiC map, and combining this information to derive pseudomolecules sequences.
See comments in the source code files for further explanations. Helper functions are provided in the subfolder "functions".
Input data sets and processed data are found in the subfolders "data" and "processed_data", respectively.
Software package for assembling paired-end BAC read data and scaffolding the generated paired-end assemblies with mate-pair read data.
The developed scripts (FASTQ_splitter.pl and VCF_SNP_matrix_construction.pl) were developed to assist data processing in the PreBreed Yield project.
FASTQ_splitter.pl: The Perl script assists to synchronize read pairs after quality trimming. Synchronization is required after quality trimming, in case the left or the right partner of a paired
end or mate pair read was removed completely. Furthermore, it can be applied to split merged FASTQ files containing the first (R1) and the second (R2) read pair. VCF_SNP_matrix_construction.pl: The
Perl script assists to read multiple VCF files and constructs a joint SNP matrix. It requires a list of VCF files, a PLIST file with positions of selected variant positions and a MPILEUP file with
coverage information for each utilized read alignment.
Hadoop pipeline source code for a semantic based recommendation system in life science
Abstract: Knowledge found in biomedical databases, in particular in Web information systems, is a major bioinformatics resource. In general, this biological
knowledge is worldwide represented in a network of databases. These data are spread among thousands of databases, which overlap in content, but differ substantially with respect to content detail,
interface, formats and data structure. To support a functional annotation of lab data, such as protein sequences, metabolites or DNA sequences as well as a semi-automated data exploration in
information retrieval environments an integrated view to databases is essential. Search engines have the potential of assisting in data retrieval from these structured sources, but fall short of
providing a comprehensive knowledge excerpt out of the interlinked databases. A prerequisit for supporting the concept of an integrated data view is the to acquiring insights into cross-references
among database entities. But only a fraction of all possible cross-references are explicitely tagged in the particular biomedical informations systems. In this work, we investigate to what extend
an automated construction of an integrated data network is possible. We propose a method that predict and extracts cross-references from multiple life science databases and thier possible
referenced data targets. We study the retrieval quality of our method and the relationship between manually crafted relevance ranking and relevance ranking based on cross-references, and report on
first, promising results.