Preserving

Floppy Disc

Data preservation  means protecting your data in a secure environment for long-term in such a way that they remain usable, understandable, and accessible. This means more than just making backup copies of the data. If you only make backup your data …

  • It may become unreadable with future software, because the file format is incompatible
  • May be altered when opend with new software so it is no longer reliable for research.
  • It may be changed by somebody, because there is no access controll.
  • May become unintelligible because no documentation or metadata has survived.

The demands of the research sponsors differ in terms of the recommended duration of data archiving and the definition of "long-term". The University of Basel recommends a minimum of 5 years for data retention after publication of the research results. However, many research sponsors are demanding longer data retention; for example, at the SNSF it is 10 years. Charges made by the archive for preparing and ingesting the data can be directly costed into the grant application. In the following some aspects of preservation you should consider as early as possible in your data management planning.

Although much research data should be retained, either for Use / reuse or validation of research results, this does not apply to all of these data. Research funding often requires retention of research data for a period of time. Whether a long-term archiving makes sense and is technically and legally possible, must be decided case by case.

Reasons for preserving research data could be:

  • Data must be stored or deleted for contractual, legal, regulatory reasons
  • Data can be reused (this concerns quality of the metadata, completeness of the data, integrity and accessibility of the data, readability of the data, licenses, rights that allow reuse)
  • There is a possable future benefit of the data (contribution to greater data collection, intergenerational comparison)
  • Data have scientific or historical value (time documentation, uniqueness of the data)
  • Regional reference or provenance
  • Reproduction of the data (if possible) is more expensive than the storage of the data
  • There are no ethical concerns that speak against archiving
  • Technical requirements for data preservation is given

More information:

http://www.dcc.ac.uk/resources/how-guides/appraise-select-data

https://dans.knaw.nl/en/about/organisation-and-policy/legal-information/DANSselectionofresearchdata.pdf

http://www.open.ac.uk/library-research-support/sites/www.open.ac.uk.library-research-support/files/files/RDM-guidelines-for-selecting-research-data-for-retention-and-preservation(1).pdf

http://www.bath.ac.uk/research/data/archiving-data/appraising-and-selecting/

https://researchdata.nl/en/services/data-management/selecting-research-data/

Careful selection of file formats can ensure that your files are easily accessible and interoperable and can still be used after many years. This may be particularly important in long-term research projects involving many people, or where staff could change during the research process.
A later archiving and the reuse of the data by third parties is considerably facilitated by the choice of a suitable format.

Aspects of a suitable Format:

  • No licenses (readable by open source/code software)
  • many software products can read the data format
  • No encryption or DRM
  • Established in community
  • open accessible documentation

The following formats are recommended:

Data Type

recommended

possible

plain text

UTF-8, UTF-16, ASCII, txt

 

text

PDF/A-2, DOCX, ODT, OTT

MS-Word, RTF, PDF

table1

PDF/A-2, XLSX, XLSM, ODS, OTS

PDF, CSV, XLS (XLC, XLW, XLM)

presentationPDF/A-2, PPTX, ODP, OTPPDF, PPT,

images /raster graphs

JPEG 2000 Part 1 (lossless compressed)

TIFF, EXIF Image, JPEG, PNG, DNG, BMP, ODG

vector graphs

SVG (without script bindings)

 

audio

WAVE LPCM (uncompressed), BWF

MP3, FLAC, AIFF, AIFC, OGG, MPEG-4, WMA, AAC, MIDI, AU, M4A, EXIF Audio

video / film

Matroska/FFV1.3 (Videocodec ffv1 v3, GOP-size 1; Audiocodec: pcm_s16le; in MKV Matroska container); MPEG-4 Part 10 (Videocodec: h.264, 8Bit, 4:2:0 Chroma Subsamplng; Audiocodecc: aac; in MP4 container)

AVI, Quick Time Movie, MPEG 1, MPEG 2, WMV, VOB

data base2

SIARD

SQL, XML

hypertext / websites3

PDF/A-2, HTML, WARC

PDF

E-mailPDF/A-2MBOX, EML

 

[1] The format selection for tables depends on the preservation of the functionality. Under certain circumstances (eg with formulas or macros) it is necessary to deviate from the recommended formats.

[2] Alternative to the conversion of SQL / XML into SIARD via SIARD Suite (BAR): DaSCH for more complex and often to be used data objects.

[3] Still open questions about web archiving: How should the data be collected (crawler / harvester, evaluation of the relevant subpages)?

The university of Basel is preparing a long-term preservation solution jointly with other institutions members of the Swiss academic community (partly via funding from the Swissuniversities’ DLCM project). These archives will allow fine grained data access control, and hosting data within Switzerland, which might be a requirement for some data sets.


Helpful Links