When it comes to storing, archiving and publishing of research data, misunderstandings often arise because the terms are not clearly defined and are used differently by the service providers. For example, many repositories talk about "archiving" or even "long-term storage" of data, guaranteeing storage and accessibility for 5 to max. 20 years. However, this does not necessarily mean that the data will be deleted after the specified period. A clear separation is therefore not given.
The following is an overview of how we have defined the terms within the network and what solutions are available to researchers at the University of Basel.
|Active data management in a projects filing system within a running research project.
|IT-Services, sciCORE, switchDrive
Storage/ deposition/ deep storage
|Finalized data (sets), to be reused for further (internal) projects.
|sciCORE: deep storage
|Curated and disseminated (selected) data (sets), understandable and reusable for others.
Preservation, (long-term) archiving
|Highly valuable data and digital objects to be kept for ever (as cultural heritage, singular data as climate data)
|100 + years
Like other research funders, the SNSF supports the principle that research data should be openly accessible to science and society. For most of the SNSF's funding instruments, it is therefore required that the data underlying the research result or generated during the research process be made publicly accessible at the latest after completion of the research project, in compliance with the FAIR principles, as long as there are no legal, ethical or other reasons to the contrary.
The demands of the research funders differ in terms of the recommended duration of data archiving and the definition of "long-term". The University of Basel expects a minimum of 5 years for data retention after publication of the research results (Regulation relating to academic integrity at the University of Basel, Art. 4, 8). However, many research sponsors are demanding longer data retention; for example, the SNSF recommends 10 years (SNF, FAQ Preservation and long-term storage of research data). Charges made by the repository for preparing and ingesting the data can be directly costed into the grant application.
University employees can store their research data on the university's servers. It depends on the volume of data and the need for high-performance computing whether the infrastructure of ITS or sciCORE is more appropriate for your purpose. The University's servers are managed, have daily backup routines and are protected from hacker attacks and viruses. However, if you are working with sensitive data, it must be specially protected against unauthorised access. In this case, please contact the University's data protection and data security officers.
If you are working collaboratively, you will need data exchange options. There are solutions for this at the University of Basel as well.
Researchers who do not have access to the universities infrastructure and therefore cannot use the ITS or sciCORE servers often store their data on personal local devices (notebook, hard drive). They should then ensure that they regularly back up their data (for example to switchdrive) to protect it from loss. They should also protect the device and the data on it from unauthorised access. If you work with sensitive data e.g. personal data, you should always encrypt it.
Although switchDrive creates a backup of the entire system in the emergency case, it does not offer the possibility to restore individual files if they are overwritten or damaged. However, if files are accidentally deleted from the cloud storage, they can be restored within 90 days. If switchDrive is the primary storage location for research data, it should be ensured that a backup of the data exists e.g. on a local hard drive.
We recommend paying attention to good data organisation and documentation already during the research process. This facilitates the work especially in research groups and at the end of the project when the data is to be published and/or archived.
If a large amount of data is produced within a research project, so that a lot of storage space is required, sciCORE offers the option of transferring data to tape storage. This type of storage should only be used for data that is not currently actively required, as immediate access to the data is only possible with a few days' notice. This type of data storage is not an appropriate archiving of research data, as the data is not regularly checked for validity and outdated file formats are not migrated. Furthermore, there are no backups at different locations. So in the disaster case, the data is lost. The deep storage is therefore more of an in-between or fallback solution.
This type of storage comes into question for projects that have a long runtime and completed subprojects are filed in the meantime before an archiving solution is found for the entire project. Or also for completed projects where a follow-up project is planned in which the data will continue to be used, but the data cannot be published on a repository for legal reasons, for example.
In both cases, the data should be well organised and documented and metadata should be added before it is written out to tape storage. This serves to ensure that project staff can still understand and comprehend their old data at a later date or to make it easier for new project staff to work with the data. Good documentation and organisation of the data also helps with later selection of the data for publication and archiving.
Since the time factor comes into play here, we also recommend making sure that the data is saved in common or open file formats (see table on recommended file formats).
Data preservation or archiving means protecting your data in a secure environment for long-term in such a way that they remain usable, understandable, and accessible. This means more than just making backup copies of the data. If you only make backup your data …
- may become unreadable with future software, because the file format is incompatible
- may be altered when opend with new software so it is no longer reliable for research.
- may be changed by somebody, because there is no access controll.
- may become unintelligible because no documentation or metadata has survived.
A digital archive must therefore extract the data from old data carriers and transfer them to the archiving system together with rich documentation and metadata. The data must be checked for viruses in the process. Using checksums, the data in the archive is regularly checked for validity and protected against unauthorised access and modification. The data is stored in the original format as well as a copy in a file format suitable for archiving. This copy is migrated when the file format threatens to become obsolete. This ensures that the data remains readable. If it is not possible to migrate old file formats, the required software including licences is archived in order to make the data readable via software emulation.
A local backup is made of the archived data and and at least two further copies are kept on other storage media at a locally remote location so that they are protected against loss in the disaster scenario.
The archiving and long-term preservation of digital data is very time-consuming and costly. Therefore, it must be checked well in advance whether the data is worth archiving (see selection criteria).
The University Library (UB) Basel has built up an archive system that meets these requirements for its own digital holdings and has know-how in this area. If you have very valuable or unique data, contact the UB for advice.
Although much research data should be retained, either for use / reuse or validation of research results, this does not apply to all of these data. Research funding often requires retention of research data for a period of time. Whether a long-term archiving makes sense and is technically and legally possible, must be decided case by case.
Reasons for preserving research data could be:
- Data must be stored or deleted for contractual, legal, regulatory reasons
- Data can be reused (this concerns quality of the metadata, completeness of the data, integrity and accessibility of the data, readability of the data, licenses, rights that allow reuse)
- There is a possable future benefit of the data (contribution to greater data collection, intergenerational comparison)
- Data have scientific or historical value (time documentation, uniqueness of the data)
- Regional reference or provenance
- Reproduction of the data (if possible) is more expensive than the storage of the data
- There are no ethical concerns that speak against archiving
- Technical requirements for data preservation is given
Careful selection of file formats can ensure that your files are easily accessible and interoperable and can still be used after many years. This may be particularly important in long-term research projects involving many people, or where staff could change during the research process. A later archiving and the reuse of the data by third parties is considerably facilitated by the choice of a suitable format.
You should use file formats that ... :
- have no licenses
- are readble by many software products (at best by open source/code software)
- have no encryption or DRM
- are established in your research community
- have open accessible documentation
The following is a list of recommended file formats.
UTF-8, UTF-16, ASCII, txt
PDF/A-2, DOCX, ODT, OTT
MS-Word, RTF, PDF
PDF/A-2, XLSX, XLSM, ODS, OTS
PDF, CSV, XLS (XLC, XLW, XLM)
|PDF/A-2, PPTX, ODP, OTP
images /raster graphs
JPEG 2000 Part 1 (lossless compressed)
TIFF, EXIF Image, JPEG, PNG, DNG, BMP, ODG
SVG (without script bindings)
WAVE LPCM (uncompressed), BWF
MP3, FLAC, AIFF, AIFC, OGG, MPEG-4, WMA, AAC, MIDI, AU, M4A, EXIF Audio
video / film
Matroska/FFV1.3 (Videocodec ffv1 v3, GOP-size 1; Audiocodec: pcm_s16le; in MKV Matroska container); MPEG-4 Part 10 (Videocodec: h.264, 8Bit, 4:2:0 Chroma Subsamplng; Audiocodecc: aac; in MP4 container)
AVI, Quick Time Movie, MPEG 1, MPEG 2, WMV, VOB
hypertext / websites3
PDF/A-2, HTML, WARC
 The format selection for tables depends on the preservation of the functionality. Under certain circumstances (eg with formulas or macros) it is necessary to deviate from the recommended formats.
 Alternative to the conversion of SQL / XML into SIARD via SIARD Suite (BAR): DaSCH for more complex and often to be used data objects.
 Still open questions about web archiving: How should the data be collected (crawler / harvester, evaluation of the relevant subpages)?