In order for others to be able to understand the data, it is also important to clearly document and describe the data, its origin and organisation.

Documentation enables the comprehensibility and re-use of research data for oneself and for others. The aim of documentation is to make stored information or documents findable and comprehensible by describing them.  An important element of documentation is metadata.

Metadata are structured, uniform and machine-readable descriptions of data or an object; it is therefore data about data. Metadata are therefore an important element in the documentation of research data. There are the following types of metadata:

  • Technical metadata (e.g. size and resolution of an image file)
  • Descriptive metadata (title, author, keywords)
  • Administrative metadata (rights, licences, publication date)
  • Relational metadata (reference to other datasets or the related publication)

Documentation and metadata should already be recorded continuously during the research process, this is also part of good scientific practice. Especially in large and/or long-term projects, it is advisable to define in an internal project standard how data are named, filed and annotated, agreeing on a uniform and ideally standardised vocabulary. This also includes meaningful and uniform naming of files and relevant information in the individual documents. There are various methods and tools for documenting research data: Codebooks, electronic lab notebooks, scientific record keeping, readme files, indexing, etc. Which method is most appropriate is discipline-specific and each project can choose an adequate approach for itself.

When publishing and archiving research data, care must be taken to ensure that the data is adequately documented and that the documentation is also accessible and in best case is not only readable for humans but is also machine readable. Only in this way the data can be found and used in a meaningful way. The documentation makes the data identifiable and states by whom, with which methods and in which context the data was generated, how it was processed and under which conditions it can be accessed and reused by others.

To ensure that similar data are described as uniformly as possible in terms of both content and structure, there are standards that regulate the specification through metadata. A metadata standard allows metadata from different sources to be linked and processed together. Metadata standards allow machine readability.

Depending on the research community, there are individual metadata standards that meet the needs of the respective subjects.

See below to find links to further information and lists of metadata standards for all disciplines and resource types.

Controlled vocabularies prescribe the use of predefined, authorised terms and are used to describe the content of research data by keywords (indexing). Thus, the problems of homographs, synonyms and polysemes are solved by a bijection between concepts and authorised terms. This serves to provide consistency and to reduce ambiguity that occurs in normal human languages, where the same concept can be given different names (or vice versa).

Controlled vocabularies thus increase the accuracy of a free-text search, since irrelevant elements (false positives) in a search list are often caused by the inherent ambiguity of natural language. Recallability is also improved because, unlike natural language schemas, there is no need to search for other terms that may be synonyms of that term. Furthermore, if the same vocabularies are used in different projects of a research discipline, this enables interoperability.

An overview of freely available vocabularies, both interdisciplinary and subject-specific, can be found at bartoc.org.

Codebooks or data dictionaries are essential documentation of the variables, structure, content, and layout of your datasets. A good codebook has enough information about each variable for it to be self-explanatory and interpreted properly by someone outside of your research group. The terms codebooks and data dictionaries are often used interchangeably.
 

Codebooks should include:

Variable name

The name or number assigned to the variable in the data collection.

Variable label

A brief description or the full name of the variable to identify the variable for the user.

Variable meaning

Outline the exact definition of the variable and if possible align it with existing vocabularies to increase interoperability amongst research data.

Question text / Instrument

Where applicable, the exact wording of survey questions.

Where applicable, the technical measurement instrument to which the variable belongs.

Level of Measurement / Scale of measurement

Tells you how precisely variables are recorded. There are 4 levels of measurement:

  • Nominal: the data can only be categorized
  • Ordinal: the data can be categorized and ranked
  • Interval: the data can be categorized, ranked, and evenly spaced
  • Ratio: the data can be categorized, ranked, evenly spaced, and has a natural zero.

Units of measurement

For interval and ratio variables: The variable's units of measurement (e.g. meter, seconds).

Values / Value labels

The actual coded values in the data for this variable.

For numeric (interval, ratio) variables: Range of valid or expected values (min – max).

For categorical (nominal, ordinal) variables: List of valid value labels. If coded numerically, description of the allocation of the numeric codes to the value labels. In either case, include how missing data is labelled.

Summary statistics

Where appropriate and depending on the type of variable, provide unweighted summary statistics for quick reference. If any kind of imputation took place this should be documented.

Missing data

Where applicable, the values and labels used to specify missing data in the dataset (e.g. NA / 99 / “empty”).

Universe skip patterns

Where applicable, information about the population to which the variable refers, as well as the preceding and following variables.

Dates and Times

The dates and times of the data collection

Notes

Additional notes, remarks, or comments that contextualize the information conveyed in the variable or relay special instructions.

Citation Information

Information on how to cite the dataset, with an example in a style that is commonly used in the relevant discipline.

Codebook Version

In case a project runs over various years or even decades, the version of the codebook should be included.


There are many instructions and templates for creating codebooks on the Internet, which differ slightly from one another. The information on our site is based on https://guides.library.upenn.edu/c.php?g=564157&p=9554907 and has been adapted and expanded by our RDM network. 

A widespread and simple way of documentation is adding a readme file to a data set. The readme-style data documentation is recommended for data types or disciplines where no appropriate metadata standard exists. A readme file is a simple text file on the first hierarchy level of the data set explaining what's in the folder and what it's for. If you create readme files for your data sets, try to proceed as uniformly as possible and define one uniform structure for the readme files of different data sets within your research group (or for your own project). This helps to make readme files interoperable (see metadata standards). Markup or JSON files can be used to make a readme file readable and interpretable not only for humans, but also for machines.

A readme file may contain the following parts:

1. General information
  • title
  • author, institution, contact information (provide at least two contacts)
  • date of collection
  • research funder
  • information on related project
2. Access and relations
  • licenses
  • suggested citation
  • links to publications that cite or use the data set
  • links to related or similar data sets
3. Data organisation
  • naming convention, folder structure
  • relations / dependencies between files
  • list of other files with documentation
4. Data set
  • list of all files and short description
  • methods used to generate data
  • used / required software
  • used standard
5. Code book
  • list and explanation of abbreviations for symbols 
  • used format for dates (YYYYMMDD)
  • handling of missing data
6. Data processing
  • applied methods
  • used / required software (incl. versions)
  • used file formats in data sets and recommended software
  • quality control procedure applied
  • versioning history and reasons for updates