Storage

Research Data Management-Network University of Basel

When it comes to storing, archiving and publishing of research data, misunderstandings often arise because the terms are not clearly defined and are used differently by the service providers. For example, many repositories talk about "archiving" or even "long-term storage" of data, guaranteeing storage and accessibility for 5 to max. 20 years. However, this does not necessarily mean that the data will be deleted after the specified period. A clear separation is therefore not given.

The following is an overview of how we have defined the terms within the network and what solutions are available to researchers at the University of Basel.

term	description	provider	time
Active storage	Active data management in a projects filing system within a running research project.	IT-Services, sciCORE, switchDrive	1-5 years
Mid-term storage/ deposition/ deep storage	Finalized data (sets), to be reused for further (internal) projects.	sciCORE: deep storage	1-10 years
Publication	Curated and disseminated (selected) data (sets), understandable and reusable for others.	Repositories	5-20 years
Preservation, (long-term) archiving	Highly valuable data and digital objects to be kept for ever (as cultural heritage, singular data as climate data)	University Library	100 + years

Like other research funders, the SNSF supports the principle that research data should be openly accessible to science and society. For most of the SNSF's funding instruments, it is therefore required that the data underlying the research result or generated during the research process be made publicly accessible at the latest after completion of the research project, in compliance with the FAIR principles, as long as there are no legal, ethical or other reasons to the contrary.

Different parts of a dataset from a research project may be subject to varying retention periods depending on their specific characteristics and purposes:

A) Personal Data: The retention period for personal data should be clarified during the preparatory phase of the research project. This means that the consent obtained from study participants must explicitly allow for mid- and/or long-term preservation and reuse of the data for research purposes. Abide by the “Data Minimisation” Principle, therefore only retain the data that is directly relevant and necessary for the purposes of the project. The data should be kept only for as long as is necessary to fulfill these purposes and in compliance with applicable regulations. If the preservation of personal data is not specified in the informed consent, such data must be deleted or anonymized as soon as it is no longer required for the research project.

B) All Other Data: Non-personal data, anonymized data, aggregated results, and similar types of data should be retained for a period of at least 5 years or longer, in accordance with regulations:

The demands of the research funders differ in terms of the recommended duration of data archiving and the definition of "long-term". The University of Basel expects a minimum of 5 years for data retention after publication of the research results (Regulation relating to academic integrity at the University of Basel, Art. 4, 8). However, many research sponsors are demanding longer data retention; for example, the SNSF recommends 10 years (SNF, FAQ Preservation and long-term storage of research data). Charges made by the repository for preparing and ingesting the data can be directly costed into the grant application.

University employees can store their research data on the university's servers. It depends on the volume of data and the need for high-performance computing whether the infrastructure of ITS or sciCORE is more appropriate for your purpose. The University's servers are managed, have daily backup routines and are protected from hacker attacks and viruses. However, if you are working with sensitive data, it must be specially protected against unauthorised access. In this case, please contact the University's data protection and data security officers.

If you are working collaboratively, you will need data exchange options. There are solutions for this at the University of Basel as well.

Researchers who do not have access to the universities infrastructure and therefore cannot use the ITS or sciCORE servers often store their data on personal local devices (notebook, hard drive). They should then ensure that they regularly back up their data (for example to switchdrive) to protect it from loss. They should also protect the device and the data on it from unauthorised access. If you work with sensitive data e.g. personal data, you should always encrypt it.

Although switchDrive creates a backup of the entire system in the emergency case, it does not offer the possibility to restore individual files if they are overwritten or damaged. However, if files are accidentally deleted from the cloud storage, they can be restored within 90 days. If switchDrive is the primary storage location for research data, it should be ensured that a backup of the data exists e.g. on a local hard drive.

We recommend paying attention to good data organisation and documentation already during the research process. This facilitates the work especially in research groups and at the end of the project when the data is to be published and/or archived.

Links and documents

If a large amount of data is produced within a research project, so that a lot of storage space is required, sciCORE offers the option of transferring data to tape storage. This type of storage should only be used for data that is not currently actively required, as immediate access to the data is only possible with a few days' notice. This type of data storage is not an appropriate archiving of research data, as the data is not regularly checked for validity and outdated file formats are not migrated. Furthermore, there are no backups at different locations. So in the disaster case, the data is lost. The deep storage is therefore more of an in-between or fallback solution.

This type of storage comes into question for projects that have a long runtime and completed subprojects are filed in the meantime before an archiving solution is found for the entire project. Or also for completed projects where a follow-up project is planned in which the data will continue to be used, but the data cannot be published on a repository for legal reasons, for example.

In both cases, the data should be well organised and documented and metadata should be added before it is written out to tape storage. This serves to ensure that project staff can still understand and comprehend their old data at a later date or to make it easier for new project staff to work with the data. Good documentation and organisation of the data also helps with later selection of the data for publication and archiving.

Since the time factor comes into play here, we also recommend making sure that the data is saved in common or open file formats (see table on recommended file formats).

Data preservation or archiving means protecting your data in a secure environment for long-term in such a way that they remain usable, understandable, and accessible. This means more than just making backup copies of the data. If you only make backup your data …

may become unreadable with future software, because the file format is incompatible
may be altered when opend with new software so it is no longer reliable for research.
may be changed by somebody, because there is no access controll.
may become unintelligible because no documentation or metadata has survived.

A digital archive must therefore extract the data from old data carriers and transfer them to the archiving system together with rich documentation and metadata. The data must be checked for viruses in the process. Using checksums, the data in the archive is regularly checked for validity and protected against unauthorised access and modification. The data is stored in the original format as well as a copy in a file format suitable for archiving. This copy is migrated when the file format threatens to become obsolete. This ensures that the data remains readable. If it is not possible to migrate old file formats, the required software including licences is archived in order to make the data readable via software emulation.

A local backup is made of the archived data and and at least two further copies are kept on other storage media at a locally remote location so that they are protected against loss in the disaster scenario.

The archiving and long-term preservation of digital data is very time-consuming and costly. Therefore, it must be checked well in advance whether the data is worth archiving (see selection criteria).

The University Library (UB) Basel has built up an archive system that meets these requirements for its own digital holdings and has know-how in this area. If you have very valuable or unique data, contact the UB for advice.

Although much research data should be retained, either for use / reuse or validation of research results, this does not apply to all of these data. Research funding often requires retention of research data for a period of time. Whether a long-term archiving makes sense and is technically and legally possible, must be decided case by case.

Reasons for preserving research data could be:

Data must be stored or deleted for contractual, legal, regulatory reasons
Data can be reused (this concerns quality of the metadata, completeness of the data, integrity and accessibility of the data, readability of the data, licenses, rights that allow reuse)
There is a possible future benefit of the data (contribution to greater data collection, intergenerational comparison)
Data have scientific or historical value (time documentation, uniqueness of the data)
Regional reference or provenance
Reproduction of the data (if possible) is more expensive than the storage of the data
There are no ethical concerns that speak against archiving
Technical requirements for data preservation is given

Careful selection of file formats can ensure that your files are easily accessible and interoperable and can still be used after many years. This may be particularly important in long-term research projects involving many people, or where staff could change during the research process. A later archiving and the reuse of the data by third parties is considerably facilitated by the choice of a suitable format.

You should use file formats that ... :

have no licenses
are readble by many software products (at best by open source/code software)
have no encryption or digital rights management (DRM)
are established in your research community
have open accessible documentation

The following is a list of recommended file formats.

Data Type	recommended	also possible
plain text	UTF-8
text	PDF/A-2, DOCX, ODT, OTT	RTF, PDF
table¹	PDF/A-2, XLSX, XLSM, ODS, OTS	PDF, CSV, (legacy: XLC, XLW, XLM)
presentation	PDF/A-2, PPTX, ODP, OTP	PDF
images /raster graphs	JPEG 2000 Part 1 (lossless compressed)	TIFF, EXIF Image, JPEG, PNG, DNG, BMP, ODG
vector graphs	SVG (without script bindings)
audio	WAVE LPCM (uncompressed), BWF	MP3, FLAC, AIFF, AIFC, OGG, MPEG-4, WMA, AAC, MIDI, AU, M4A, EXIF Audio
video / film	Matroska/FFV1.3 (Videocodec ffv1 v3, GOP-size 1; Audiocodec: pcm_s16le; in MKV Matroska container); MPEG-4 Part 10 (Videocodec: h.264, 8Bit, 4:2:0 Chroma Subsamplng; Audiocodecc: aac; in MP4 container)	AVI, Quick Time Movie, MPEG 1, MPEG 2, WMV, VOB
data base²	SIARD	SQL-Dump, XML, HDF5
hypertext / websites	PDF/A-2, HTML, WARC	PDF
E-mail	PDF/A-2	MBOX, EML

[1] The format selection for tables depends on the preservation of the functionality. Under certain circumstances (eg with formulas or macros) it is necessary to deviate from the recommended formats.
[2] Alternative to the conversion of SQL / XML into SIARD via SIARD Suite (BAR): DaSCH for more complex and often to be used data objects.

When an employee leaves a project or a project is completed, it must be clarified in good time who still needs access to the data. For example, if an SNSF-funded project ends, the PhD student associated with the project might still need access to the data to finalize their dissertation.

Access regulations for former project members should be clearly defined and an agreement should be established with the Principal Investigator (PI) and/or the head of the institute. Furthermore, IT should be consulted to ensure proper access management.

The access agreement can be set up for both individual researchers or for collaborations, e.g. within a collaborative research grant.

The agreement should address the following questions:

Scope of the Agreement and Researcher Information: Clearly define who the agreement applies to. Provide necessary details about the researchers involved, e.g. their contact details, affiliations, and roles within the project.
Project Information: Include relevant details about the project.
Limitations and Access Regulations: Determine how former project members require access to data (e.g. access to e-mail, servers for data storage, clusters for data processing, etc.). Specify for what purpose, and for how long.
- Information on Personal Account Access: Data access is provided through a personal account (name@clutterunibas.ch). Former project members still associated with the university should therefore contact the responsible department or faculty personnel. If a researcher has a legitimate interest in maintaining access even after leaving the university, clarify if the personal account can be continued as a so-called §3-account.
Type of Data and Data Use: Determine how already existing data are being further processed and how newly acquired data are being handled. Include information on the type of data, which file formats are being used, and how data is properly documented (e.g. in Readme files).
Technical Conditions: Determine where and how data is stored, e.g. which storage solution is being used, if and how data is encrypted, and which backup procedures are in place.
Ensure that all data relevant to this access agreement are stored within the project folder (and not on other partitions or in private storage). Additionally, clarify if researchers can store an additional copy of the data (or parts thereof) at a separate location.
Reporting and Information Duties: Determine how team members and research partners are informed (e.g. about data processing results or potential publications), and how they are included in publications (authorship, acknowledgments).
Clarification of Responsibilities: Clearly outline the responsibilities of former project members and IT staff regarding data management, including curation, readability, and backup. In particular, clarify when data should be deleted and who is responsible for it.
Confidentiality Agreements: The agreement should specify whether former project members are bound by existing confidentiality agreements or whether a separate confidentiality agreement is required. This is particularly important when dealing with personal data, or unpublished research findings.

Quick Links

Storage

Requirements

Active storage

Links and documents

Deep storage

Data publication

Preservation / archiving

Selecting research data

Recommended file formats

Data access regulations

Links