Filter Results
11 results
The published scripts are integrated in a pipeline which was developed at the IPK in the frame of a master thesis. The pipeline aims to perform a gene prediction in genomic data using additional information from RNA-seq data. The process is divided into four main steps that integrate several external tools: 1) RNA-Seq-Scaffolding (L_RNA_Scaffolder), 2) repeat masking (Kmasker), 3) RNA-seq alignment (STAR or HISAT2) and 4) gene prediction (GeMoMa). The pipeline is developed to work at the computing infrastructure of the IPK.
Data Types:
  • Software/Code
The file contains nucleotide sequences of the synthetic DNA constructs set used for the siFi software calibration.
Data Types:
  • Software/Code
The siFi (siRNA Finder) is for optimizing long double-stranded RNAi- target design and for the prediction of RNAi off-targets. It provides an intuitive graphical user interface, works in Microsoft Windows environment and can use custom sequence databases in standard FASTA format.
Data Types:
  • Software/Code
The query suggestion API enables a real-time, semantic query suggestion for keyword based query systems. It has been implemented as RESTful service and can easily be integrated into third-party applications. Compared to popular search frameworks, like Apache Lucene, the suggested queries consider biological background knowledge that was extracted from biomedical literature. The service was successfully applied in the LAILAPS plant science search engine. In particular we were able to discover inferred associations between traits and genes and use them to automatically reformulate search phrases. The training dataset consits of 13,930,050 documents from PubMed articles, and gene and protein function describtions from UniProt, Plant Ontology and Gene Ontology.
Data Types:
  • Software/Code
Tool for the removal of paired-end or mate-pair reads from a set of two Illumina FASTQ read files based on read length. Users can specify the minimal length of reads in base pairs that both reads of a pair need to have in order for this read pair to be kept, otherwise the read pair will be removed from the output.
Data Types:
  • Software/Code
This folder contains the Unix shell (ZSH) and R source code that was used for reading in datasets providing positional information, constructing a BAC overlap graph and a HiC map, and combining this information to derive pseudomolecules sequences. See comments in the source code files for further explanations. Helper functions are provided in the subfolder "functions". Input data sets and processed data are found in the subfolders "data" and "processed_data", respectively.
Data Types:
  • Software/Code
Software package for assembling paired-end BAC read data and scaffolding the generated paired-end assemblies with mate-pair read data.
Data Types:
  • Software/Code
The developed scripts (FASTQ_splitter.pl and VCF_SNP_matrix_construction.pl) were developed to assist data processing in the PreBreed Yield project. FASTQ_splitter.pl: The Perl script assists to synchronize read pairs after quality trimming. Synchronization is required after quality trimming, in case the left or the right partner of a paired end or mate pair read was removed completely. Furthermore, it can be applied to split merged FASTQ files containing the first (R1) and the second (R2) read pair. VCF_SNP_matrix_construction.pl: The Perl script assists to read multiple VCF files and constructs a joint SNP matrix. It requires a list of VCF files, a PLIST file with positions of selected variant positions and a MPILEUP file with coverage information for each utilized read alignment.
Data Types:
  • Software/Code
Hadoop pipeline source code for a semantic based recommendation system in life science
Data Types:
  • Software/Code
Abstract: Knowledge found in biomedical databases, in particular in Web information systems, is a major bioinformatics resource. In general, this biological knowledge is worldwide represented in a network of databases. These data are spread among thousands of databases, which overlap in content, but differ substantially with respect to content detail, interface, formats and data structure. To support a functional annotation of lab data, such as protein sequences, metabolites or DNA sequences as well as a semi-automated data exploration in information retrieval environments an integrated view to databases is essential. Search engines have the potential of assisting in data retrieval from these structured sources, but fall short of providing a comprehensive knowledge excerpt out of the interlinked databases. A prerequisit for supporting the concept of an integrated data view is the to acquiring insights into cross-references among database entities. But only a fraction of all possible cross-references are explicitely tagged in the particular biomedical informations systems. In this work, we investigate to what extend an automated construction of an integrated data network is possible. We propose a method that predict and extracts cross-references from multiple life science databases and thier possible referenced data targets. We study the retrieval quality of our method and the relationship between manually crafted relevance ranking and relevance ranking based on cross-references, and report on first, promising results.
Data Types:
  • Software/Code