Bioinformatics workflows for scalable analysis of plant “omics” data in cloud computing environments
Terrestrial plants have evolved from aquatic plants in the course of evolution. Thereby, they had to adapt to completely new environmental conditions one of which involved adaption of sexual reproduction. Recent advances in high throughput DNA and RNA sequencing enable cost-effective and detailed research of the evolution of sexual reproduction in plants at the genome and transcriptome level. By applying modern sequencing methods, even large reference genomes of plants can be established, such as the genome of Picea abies with a size of ~20 Gb. In addition, RNA sequencing facilitates the analysis of gene expression to compare samples from different conditions. Thereby, the evolution of sexual reproduction of plants can be traced by comparing genome and transcriptome data from different evolutionary stages of development. Within the research unit FOR 5098, large amounts of data will be generated. In this project, we will improve existing automated and standardised analysis pipelines for bioinformatic processing of those data and develop new pipelines that are tailored to the specific needs of our our cooperation partners. Due to the large amounts of data, we will place special focus on the scalability of the pipelines and adapt them for execution in cloud computing infrastructures. Standardised experiments in the various plants will create a data pool that enables us to develop new methods for comparing gene expression patterns and gene interaction networks between different plant species. Our IT infrastructure will be available to all members of the research group. All raw data, processing pipelines and analysis results as well as the associated metadata from the research group will be stored and made accessible in accordance with the FAIR principles (Findable, Accessible, Interoperable, Reusable). Computationally intensive analyses can be computed in our cloud computing environment. We will develop a web-based user interface to visualise data and explore analysis results. Here, users should be able to visualise the data interactively and dynamically. In addition, there will be the option of viewing the data in parallel panels in different contexts. The single panels should be linked to each other in a way that adjustments in one panel lead to synchronization of all other panels. Furthermore, we will offer various training courses for the members of the research group, for example on the use of the IT infrastructure or the analysis pipelines.