The GHGA Metadata Model

The goal of the model is to help our communities to richly describe their submitted genomic data as well as to retrieve data of interest. In order to achieve this, we focus on making the metadata model Findable, Accessible, Interoperable, Reusable (FAIR) by utilizing established and widely used ontologies and vocabularies. All ontologies and vocabularies are evaluated based on their maintenance and content with maximum granularity with the help of https://fairsharing.org

The implementation of our metadata catalogue is done using the Linked Data Modelling Language LinkML and is openly accessible for everyone on the GHGA GitHub Repository. Here, you can track every new release of the schema and access different artefacts, such as a JSON Schema, for the programmatic implementation at your site.

The Core-Spreadsheet captures three categories of data: 

  • Dataset
  • Sample
  • Technical Metadata

Dataset can be seen as the knot that bundles all categories together. It references the metadata associated with the dataset, such as the Data Access, but also the corresponding Technical and Sampledata. 

Technical Metadata contains experimental information (library preparation and sequencing protocol), analysis data, and file data. It captures information that is related to technical aspects of a dataset. 

Sample data captures information about sample provenance. The sample spreadsheet encapsulates into Individual aka donor, Biospecimen, and the sample itself.

The data can be submitted using GHGAs Submission Spreadsheet, which reflects the metadata catalogue and enables the GHGA Data Portal to display valuable information about a submitted dataset through the linkage of all categories. Richly described metadata will help to promote data submitters datasets and encourage the community to reuse the data.

Additionally GHGA is producing documentation that helps to understand the GHGA Metadata Model with a description of the model itself, but also the underlying concepts and standards.