As a data archive, GHGA will receive data from major sequencing centers (NGS Centers) and other institutions in Germany, while exchanging (meta)data with national and international initiatives, like genomDE and EGA. Technical harmonization of data will ensure quality and comparability of data for large scale analysis. Data access will be regulated via Data Access Committees (DACOs) with a solid ethico-legal framework set up by GHGA providing a safe space for data storage and access.GHGA will build on existing infrastructures for high performance and cloud computing, such as de.NBI, to establish a cloud based analysis platform. Keeping the scientific communities at heart, specific portals will ensure secure and easy access to datasets and analysis pipelines.
To ensure smooth organisation, developement and maintenance, GHGA is structured into seven workstreams adressing different aspects and functionalities.
The federated nature of GHGA requires a localized expertise when it comes to the production level operations. There are two main themes of these operations, (1) Data Stewardship and (2) DevOps operations. GHGA production work will require close collaboration of these two groups for running day to day GHGA activities.
Data Stewardship. One topic the Operations Workstream focuses on is the details of the helpdesk structure and mode of operation. Data stewards at each hub will form the GHGA helpdesk and support users with data submission and data access requests. Close collaboration with the sequencing centers ensures tight connections to GHGA’s major data submitters.
DevOps. In the first phase of the GHGA, DevOps operations are carried out together with the Software & Infrastructure Workstream (see below). This way, the deployment and operations strategy will be closely aligned with software development.
All processes within the Data Hub Operations Team are organized using standard operating procedures (SOPs), which are a vital tool for making sure that data hubs work in a reproducible and safe manner individually and jointly.
The ELSI (Ethical, Legal, and Social Implications) team consists of legal scholars and ethics researchers who are working in close collaboration to develop the foundations for the ethical and legal context of GHGA. Together we provide the necessary ethical and legal documents for GHGA (e.g., consent forms, governance papers, and policy documents), aim to ensure GHGA’s legal implementation and interoperability and explore strategies to involve patients and data subjects in the conception and governance of GHGA in order to establish broad and lasting societal support for the project.
The ethics team has been working on a consent tool to allow data providers to share their data with GHGA in the future and on a guide on how consent modules should be used to extend already existing consent forms for GHGA’s purposes. Moreover, we are working with patient representatives to develop informational resources for patients and gather input on ethical and regulatory issues. The result will be a white paper describing how patients can be involved in GHGA governance and thus help to build and keep trust with stakeholders.
The legal team has been focusing on the legal basis for data processing and legacy consent (i.e., consent obtained in the past). We are also working on risk assessment, de-identification and anonymization methods and a potential Code of Conduct to implement the governance framework for GHGA and enhance its legal interoperability for data processing within the EU and international data spaces.
Achievements & Products
The Metadata Workstream provides the model for the data stored in GHGA. It is a joint effort of the conceptual and technical departments of GHGA. The team is composed of experts with extensive knowledge from various areas that feed into the definition of the GHGA Metadata Concept.
The starting point of the workstream was the evaluation of already existing and well-established metadata models, with a focus on the cancer and rare disease communities. With the knowledge from various portals, a prototype schema was set up and released as the GHGA Metadata Schema V.0.0.1. This prototype was followed by a survey across the GHGA consortium in order to gather feedback on whether the necessary metadata for our main communities is captured.
We make GHGA’s metadata FAIR by utilizing established and widely used ontologies and vocabularies that help our communities to describe their submitted data as well as to retrieve data of interest. All ontologies and vocabularies are evaluated based on their maintenance and richness of content with the help of https://fairsharing.org. The identified metadata, ontologies and vocabularies were structured in our metadata schema, which is technically implemented using the Linked data Modelling Language. LinkML helps us to create and update the metadata schema in one place and provides GHGA’s technical stack with definitions of the schema in different modeling languages, such as JSON and RDF. The GHGA Metadata Schema is stored in a public GitHub repository and accessible to everyone.
Achievements & Products
At GHGA, data scientists, biomedical researchers and clinicians from more than 20 institutions are working together to bring this ambitious project to life. This results in a very interdisciplinary workforce of over 80 members who are involved in the diverse working areas of GHGA.
To ensure that GHGA is running smoothly, the Project Management Team strives to universally support the team and the workstreams. This includes administrative tasks in the background like finances and recruiting. Additionally, we are actively supporting the workstreams in their general organisation (including reporting), the development of the legal framework, and organisation of internal and external meetings. Furthermore, we are involved in the governance of the project, organising regular meetings with the Board of Directors and the Scientific Advisory Board. Being at the interface between our consortium and the NFDI, we also participate at various levels in the different NFDI gremia.
Achievements & Products
GHGA’s communications channels are diverse. Reaching diverse people. Transmitting the same message: Data sharing in genome research is safe, if all necessary safety precautions are taken (which we do!), and important to drive scientific discovery.
Building the infrastructure with research and clinical communities in mind, GHGA is in close contact with the omics data producers and users. Presenting at conferences and holding workshops, we aim to not simply advertise GHGA but also to encourage the principles of FAIR data sharing. Data sharing is a collaborative effort. Between scientists and clinicians. But also between different initiatives, making sure national and international efforts are aligned and ideally follow similar standards. GHGA aims to connect genome research undertaken in German institutions.
GHGA will also provide training in form of workshops, lectures and webinars on relevant topics to our communities, including training on data upload to the GHGA portal and data analysis - once the infrastructure is established.
By establishing a secure ethico-legal framework for data-sharing in Germany, GHGA aims to provide consent tools and educate patients as well as the public on the importance of data sharing.
Last but not least – genome research is interesting, fun and all around. We at GHGA aim to make this visible for everyone. Exploring different approaches, local events such as science slams and science pop up stores will be held at the GHGA hubs – or online for now. The GHGA podcast “Der Code des Lebens” will be launched in spring 2022!
Achievements & Products
Designing software is half science and half art. At GHGA, this is not different, since we have to find a creative balance between many aspects and requirements:
On one hand, we would like to have a first version of our genome archive operational very soon. On the other hand, GHGA is a project that shall run for 10 years and beyond, so long-term maintainability and extensibility is a key consideration. Our software solutions should be straight forward to deploy to simplify the onboarding of new data centers to our network. At the same time, we have to provide the needed flexibility to adapt to the resources and infrastructures that are locally available. Moreover, the tools that we are developing should not only be used by us, but we would also like to serve the broader national and international research and health care community.
Everything starts with the right development culture. Continuously optimizing our agile development processes goes thereby hand in hand with DevOps practices which make us think of software development and operation as one unit. This is why we closely coordinate with the Data Hub Operations Workstream. Moreover, choosing progressive yet robust architecture patterns is another key aspect of tackling the above challenges. This is why, from day one, we are implementing a domain-driven microservice architecture. This is not only easy to maintain, but it also facilitates major refactorings that will be necessary when the scope of our projects changes over time. Furthermore, to be independent of a specific cloud provider and to enable frictionless continuous deployment, we rely on the container orchestration-solution Kubernetes and its associated ecosystem. Finally, we put a lot of effort into aligning with national and international software standards and we aim to actively push their development forward by participating in community efforts of the NFDI, ELIXIR Europe and the GA4GH.
Achievements & Products
Within the GHGA consortium, the workflow workstream is working on standardizing and harmonizing Next Generation Sequencing (NGS) analysis workflows for the German life science research community. The aim is to create workflows to facilitate uniform processing of raw NGS data to ready-to-use research data (e.g.: FASTQ to annotated VCF, figure 1) by not creating yet another workflow but by using and improving existing workflows and aligning with community standards such as GA4GH, nf-core, and BioWDL.
The resulting workflows for DNA- and RNA-sequencing data will be used to uniformly process all the data submitted to GHGA to make the processed data comparable. This enables cross-study comparisons and the joint analysis of multiple rare disease cohorts.
To ensure the highest quality of the developed workflows on the technical as well as on the biological side, GHGA uses the principles of continuous integration and continuous deployment (CI/CD) to test and benchmark the workflows using synthetic and experimental datasets like CHM cell lines and Genome in a Bottle (GiaB).
In line with GHGAs goal to promote FAIR data sharing, we ourselves adhere to the FAIR principles. We follow community standards like the ones set by GA4GH. All GHGA workflows will be open-source and registered with platforms like Dockstore or WorkflowHub to make them Findable and Accessible and, through Interoperability, easy to be Reused.
This is a selection of projects that spring from the GHGA team.
The initiative aims to conduct research on the sequence data that is being generated in the context of the SARS-CoV-2 viral genome sequencing efforts in Germany, but also to enable other scientists by making this data available to the public.
A particular focus lies on the sequencing raw data, which will be collected and made available in addition to the viral genome assembly data collection and sharing carried out by the Robert-Koch-Institut (RKI). The raw data enables reproduction of the assembly calculations run by the individual laboratories, benchmarking and evaluation of assembly pipelines and the identification of multiple variants in the same sample, i.e. identification of intra-host viral evolution events.
To achieve these goals the major milestones of CoGDat are:
GHGA has developed DataMeta, a generic submission portal for data with associated metadata, to fulfil the technical requirements of CoGDat in the domain of data collection and management. Furthermore, the CoGDat project has established a data privacy and legal concept as well as a data anonymization concept and is in exchange with the relevant authorities to ensure that patients interests are protected.
Over the last two years, SARS-CoV-2 and COVID-19 were and still are the dominating topics in research focusing on health and medicine in general and genomics and functional genomics in specific. In functional genomics, which studies the interplay of genes, signaling pathways and gene products, research is focused on answering questions about the immune response to an infection with SARS-CoV-2 and finding explanations for differing disease courses and severity. During the course of the pandemic, German researchers alone published nearly 25 000 papers on COVID-19 and were one of the most active contributors to research about various topics related to the pandemic.
However, research data and findings are not yet bundled or centralized, and the data sharing culture is still under development. This is why we develop CoFGen, a data portal for research on functional genomics in COVID-19. CoFGen will enable researchers to answer questions about changed biological processes and mechanisms, e.g. regulations of signaling pathways, after an infection with SARS-CoV-2 and further centralize and democratize data storage, as well as the storage of analysis workflows.
Our goal is to collect single-cell and bulk RNA-sequencing datasets and analyses from German research groups, execute basic metadata analyses, and make data more easily available to other researchers focusing on different parts of the immune response to COVID-19. To achieve this, we will work closely together with DeCOI and the Lung Biological Network of the Human Cell Atlas, who are our initial data providers. Storage of datasets, corresponding analysis workflows, as well as data access will be managed by FASTGenomics, a collaborative research effort of Comma Soft AG in Bonn and the LIMES Institute from Bonn University.