Storage and Infrastructure

What are we talking about: storing, publishing or archiving?

icon floppy disc

When it comes to storing, archiving and publishing of research data, misunderstandings often arise because the terms are not clearly defined and are used differently by the service providers. For example, many repositories talk about "archiving" or even "long-term storage" of data, guaranteeing storage and accessibility for 5 to max. 20 years. However, this does not necessarily mean that the data will be deleted after the specified period. A clear separation is therefore not given.

The following is an overview of how we have defined the terms within the network and what solutions are available to researchers at the University of Basel.

termdescriptionprovidertime
Active storageActive data management in a projects filing system within a running research project.IT-Services, sciCORE, switchDrive1-5 years

Storage/ deposition/ deep storage

Finalized data (sets), to be reused for further (internal) projects.sciCORE: deep storage1-10 years
PublicationCurated and disseminated (selected) data (sets), understandable and reusable for others.Repositories5-20 years

Preservation, (long-term) archiving

Highly valuable data and digital objects to be kept for ever (as cultural heritage, singular data as climate data)

University Library

100 + years

Requirements

Like other research funders, the SNSF supports the principle that research data should be openly accessible to science and society. For most of the SNSF's funding instruments, it is therefore required that the data underlying the research result or generated during the research process be made publicly accessible at the latest after completion of the research project, in compliance with the FAIR principles, as long as there are no legal, ethical or other reasons to the contrary.

The demands of the research funders differ in terms of the recommended duration of data archiving and the definition of "long-term". The University of Basel expects a minimum of 5 years for data retention after publication of the research results (Regulation relating to academic integrity at the University of Basel, Art. 4, 8). However, many research sponsors are demanding longer data retention; for example, the SNSF recommends 10 years (SNF, FAQ Preservation and long-term storage of research data). Charges made by the repository for preparing and ingesting the data can be directly costed into the grant application.


University employees can store their research data on the university's servers. It depends on the volume of data and the need for high-performance computing whether the infrastructure of ITS or sciCORE is more  appropriate for your purpose. The University's servers are managed, have daily backup routines and are protected from hacker attacks and viruses. However, if you are working with sensitive data, it must be specially protected against unauthorised access. In this case, please contact the University's data protection and data security officers.

If you are working collaboratively, you will need data exchange options. There are solutions for this at the University of Basel as well.

Researchers who do not have access to the universities infrastructure and therefore cannot use the ITS or sciCORE servers often store their data on personal local devices (notebook, hard drive). They should then ensure that they regularly back up their data (for example to switchdrive) to protect it from loss. They should also protect the device and the data on it from unauthorised access. If you work with sensitive data e.g. personal data, you should always encrypt it.

Although switchDrive creates a backup of the entire system in the emergency case, it does not offer the possibility to restore individual files if they are overwritten or damaged. However, if files are accidentally deleted from the cloud storage, they can be restored within 90 days. If switchDrive is the primary storage location for research data, it should be ensured that a backup of the data exists e.g. on a local hard drive.

We recommend paying attention to good data organisation and documentation already during the research process. This facilitates the work especially in research groups and at the end of the project when the data is to be published and/or archived.

 

If a large amount of data is produced within a research project, so that a lot of storage space is required, sciCORE offers the option of transferring data to tape storage. This type of storage should only be used for data that is not currently actively required, as immediate access to the data is only possible with a few days' notice. This type of data storage is not an appropriate archiving of research data, as the data is not regularly checked for validity and outdated file formats are not migrated. Furthermore, there are no backups at different locations. So in the disaster case, the data is lost. The deep storage is therefore more of an in-between or fallback solution.

This type of storage comes into question for projects that have a long runtime and completed subprojects are filed in the meantime before an archiving solution is found for the entire project. Or also for completed projects where a follow-up project is planned in which the data will continue to be used, but the data cannot be published on a repository for legal reasons, for example.

In both cases, the data should be well organised and documented and metadata should be added before it is written out to tape storage. This serves to ensure that project staff can still understand and comprehend their old data at a later date or to make it easier for new project staff to work with the data. Good documentation and organisation of the data also helps with later selection of the data for publication and archiving.

Since the time factor comes into play here, we also recommend making sure that the data is saved in common or open file formats (see table on recommended file formats).

In parallel to the publication of your research findings in a reviewed journal, it is recognized as good practice (and often required) to make the data that support your findings available to the research community. This allows other researchers to reproduce your findings and enables reusability in subsequent research.

For the publication of research data, repositories are available that make it possible to publish the data together with the documentation and metadata and make them findable. Publication of data on a website is not recommended for reasons of poor sustainability and discoverability of the data.

There are generic, discipline-specific and institutional repositories, and a distinction is made between commercial and non-profit repositories. Further information on repositories and examples can be found here.

Not all data can and may be shared publicly. Ethical and legal aspects must be considered here and must definitely be clarified in advance. It does not always make sense to publish all the data from a research project. If only part of the data is suitable for subsequent use, a selection of the data or only certain data sets can be published.

Most research funders require that data be published in compliance with the FAIR principles. This implies that, in addition to documentation and metadata, the data must also be given a persistent identifier so that it can be cited. To make the data interoperable and reusable, the data should be provided in common or open file formats and accompanied by licences for use (CC licences).

For data that cannot be published for legal reasons, at least the metadata should be made publicly available. There are also repositories that allow data to be stored with restricted access or made available after an embargo period.

Data preservation or archiving means protecting your data in a secure environment for long-term in such a way that they remain usable, understandable, and accessible. This means more than just making backup copies of the data. If you only make backup your data …

  • may become unreadable with future software, because the file format is incompatible
  • may be altered when opend with new software so it is no longer reliable for research.
  • may be changed by somebody, because there is no access controll.
  • may become unintelligible because no documentation or metadata has survived.

A digital archive must therefore extract the data from old data carriers and transfer them to the archiving system together with rich documentation and metadata. The data must be checked for viruses in the process. Using checksums, the data in the archive is regularly checked for validity and protected against unauthorised access and modification. The data is stored in the original format as well as a copy in a file format suitable for archiving. This copy is migrated when the file format threatens to become obsolete. This ensures that the data remains readable. If it is not possible to migrate old file formats, the required software including licences is archived in order to make the data readable via software emulation.

A local backup is made of the archived data and and at least two further copies are kept on other storage media at a locally remote location so that they are protected against loss in the disaster scenario.

The archiving and long-term preservation of digital data is very time-consuming and costly. Therefore, it must be checked well in advance whether the data is worth archiving (see selection criteria).

The University Library (UB) Basel has built up an archive system that meets these requirements for its own digital holdings and has know-how in this area. If you have very valuable or unique data, contact the UB for advice.


Although much research data should be retained, either for Use / reuse or validation of research results, this does not apply to all of these data. Research funding often requires retention of research data for a period of time. Whether a long-term archiving makes sense and is technically and legally possible, must be decided case by case.

Reasons for preserving research data could be:

  • Data must be stored or deleted for contractual, legal, regulatory reasons
  • Data can be reused (this concerns quality of the metadata, completeness of the data, integrity and accessibility of the data, readability of the data, licenses, rights that allow reuse)
  • There is a possable future benefit of the data (contribution to greater data collection, intergenerational comparison)
  • Data have scientific or historical value (time documentation, uniqueness of the data)
  • Regional reference or provenance
  • Reproduction of the data (if possible) is more expensive than the storage of the data
  • There are no ethical concerns that speak against archiving
  • Technical requirements for data preservation is given

Careful selection of file formats can ensure that your files are easily accessible and interoperable and can still be used after many years. This may be particularly important in long-term research projects involving many people, or where staff could change during the research process. A later archiving and the reuse of the data by third parties is considerably facilitated by the choice of a suitable format.

You should use file formats that ... :

  • have no licenses
  • are readble by many software products (at best by open source/code software)
  • have no encryption or DRM
  • are established in your research community
  • have open accessible documentation

 

Data Type

recommended

also possible

plain text

UTF-8, UTF-16, ASCII, txt

 

text

PDF/A-2, DOCX, ODT, OTT

MS-Word, RTF, PDF

table1

PDF/A-2, XLSX, XLSM, ODS, OTS

PDF, CSV, XLS (XLC, XLW, XLM)

presentationPDF/A-2, PPTX, ODP, OTPPDF, PPT,

images /raster graphs

JPEG 2000 Part 1 (lossless compressed)

TIFF, EXIF Image, JPEG, PNG, DNG, BMP, ODG

vector graphs

SVG (without script bindings)

 

audio

WAVE LPCM (uncompressed), BWF

MP3, FLAC, AIFF, AIFC, OGG, MPEG-4, WMA, AAC, MIDI, AU, M4A, EXIF Audio

video / film

Matroska/FFV1.3 (Videocodec ffv1 v3, GOP-size 1; Audiocodec: pcm_s16le; in MKV Matroska container); MPEG-4 Part 10 (Videocodec: h.264, 8Bit, 4:2:0 Chroma Subsamplng; Audiocodecc: aac; in MP4 container)

AVI, Quick Time Movie, MPEG 1, MPEG 2, WMV, VOB

data base2

SIARD

SQL, XML

hypertext / websites3

PDF/A-2, HTML, WARC

PDF

E-mailPDF/A-2MBOX, EML

 

[1] The format selection for tables depends on the preservation of the functionality. Under certain circumstances (eg with formulas or macros) it is necessary to deviate from the recommended formats.

[2] Alternative to the conversion of SQL / XML into SIARD via SIARD Suite (BAR): DaSCH for more complex and often to be used data objects.

[3] Still open questions about web archiving: How should the data be collected (crawler / harvester, evaluation of the relevant subpages)?


Helpful Links