05 Documentation and Meta Data¶

Data Documentation¶

In order to make research data more findable and traceable, documentation of data is essential. It considerably facilitates the further use of the data and enables reproducibility. Well-documented data will be used and cited more often, which will increase the reputation of the creator. Documentation is also important with regard to the subsequent usability and traceability of the data for your own work. Over time, details might be forgotten, so it is recommended to document the data while working on it.

Basic contents of a documentation include:

description of the research project
project goals
hypotheses
detailed information on data collection (methods, units, time periods, locations, technology used)
measures for data cleansing
structure of data and their relationships to each other
explanation of variables, labels and codes
differences between versions
information on access and terms of use

Tools that support documentation

ELN (electronic lab notebooks)
OneNote
GitLab
Jupyter Notebook
Workflow Management Systems (e.g. Galaxy, KNIME, Taverna)
Modelling notebooks (e.g. TRACE)

What are Metadata?¶

Metadata refers to structured data that contains information on other data – "data about data". They are stored either independently of or in combination with the data they describe. There is a distinction between content-related and technical metadata. They form a specific subset of the documentation data and serve primarily to make the data findable, including in library reference systems. In order to make them machine-readable, for example in Semantic Web applications, they are often stored in XML format.

CC-BY 4.0: Data Stewards, Ghent University

Metadata Standards¶

Standardization of metadata vocabulary is necessary to improve findability of the data and to provide interoperability. The linking of the metadata will ensure this. Furthermore, standards allow a uniform description of similar data sets in terms of content and structure. Metadata standards contain a defined selection of information which is necessary to find and identify these data. A reusability of the data is not necessarily guaranteed thereby. Among the most common bibliographic interdisciplinary metadata standards are Dublin Core, DataCite Metadata Schema and MARC21.

Discipline-Specific metadata Standards¶

Since each scientific community has its own requirements, different discipline-specific metadata standards are also being developed. For example, in the social and economic sciences the Data Documentation Initiative (DDI) standard is frequently used, while in the earth and environment field the ISO 19115 Geographic information - Metadata or the CF Metadata Conventions are highly relevant. Other examples are: ABCD (specimen collection), MIxS (genomics), Ecological Metadata Language (ecology).

An overview of discipline-specific metadata standards is available, for example, on the pages of the British Digital Curation Centre and in an overview of the Research Data Alliance. The Helmholtz Metadata Collaboration (HMC) offers a similar catalog where you can query metadata schemes based on the subject area, e.g. Earth and Environment.

Thesaurus, Controlled Vocabulary, and Authority Files¶

Controlled vocabularies are also necessary to enable structured documentation of data. Thesauri and classifications are documentation languages used to describe the content of research data. Classifications are used to assign objects to (mostly hierarchically structured) classes. These classes are characterized by certain attributes. A thesaurus, on the other hand, is a natural-language, structured collection of terms and their relationships to one another.

Thesauri and controlled vocabulary significantly enhance metadata and increase the findability of the data. Many disciplines already provide their own specialised classifications and thesauri.

Examples for Discipline-specific classifications

Examples for Discipline-specific thesauri

Discipline-specific classifications and thesauri can be searched via the Basel Register of Thesauri, Ontologies & Classifications (BARTOC).

For persons, institutions, research funders, locations and much more, so-called authority files are assigned to enable a clear attribution. This facilitates, for example, the finding of persons in case of identical names and allows search engines to interpret this data unambiguously.

Examples of Authority files

GND: the Integrated Authority File is mainly used for cataloging literature in libraries, but is also increasingly used for other purposes
ISNI: the International Standard Name Identifier serves to uniquely identify the public identity of persons involved in a publication. The ISNI is a standard of the International Organization for Standardization (ISO) and is comparable to the ORCID
VIAF: the Virtual International Authority File is an international authority file for personal data and is maintained by the Online Computer Library Center (OCLC). The GND and ISNI authority files are part of VIAF
The Open Funder Registry: is used by research funding agencies for identification purposes

Electronic Lab Notebooks (ELN)¶

Electronic Lab Notebooks (ELN, or ELB for Electronic Lab Books) are designed to document the conception, execution and evaluation of scientific experiments, observations or other studies and the research data generated in this context. They are the digital versions of paper lab notebooks, which have so far been an essential part of the scientific work process in natural and life science disciplines. With increasing digitization, especially in the collection of data, ELN are also experiencing a growing acceptance and use.

ELN Software

Chemotion (Open Source)
eLabFTW (Open Source)
Labfolder (commercial)
OpenBIS (Open Source)
Rspace ELN (commercial)
ELN Finder of TU Darmstadt - Tool that helps you choose a fitting ELN for your needs
Full list of ELN software at Wikipedia

In general, different disciplines have very different requirements regarding the features an ELN software should provide, so there is no "one-fits-all solution". In a manual prepared by the ZBMed, practical advice is given that may help in the selection and implementation of an ELN.

Recommendations for a Jump Start¶

Jump start

Be aware of field specific meta data schemes and thesauri right from the beginning
Make your research findable and accessible by attaching appropriate meta data