Operational Pragmatism in Digital Preservation: a discussion

From Somaya Langley, Policy and Planning Fellow at Cambridge: In September this year, six digital preservation specialists from around the world will be leading a panel and audience discussion. The panel is titled Operational Pragmatism in Digital Preservation: establishing context-aware minimum viable baselines. This will be held at the iPres International Digital Preservation Conference in Kyoto, Japan.


Panellists

Panellists include:

  • Dr. Anthea Seles – The National Archives, UK
  • Andrea K Byrne – Rensselaer Polytechnic Institute, USA
  • Dr. Dinesh Katre – Centre for Development of Advanced Computing (C-DAC), India
  • Dr. Jones Lukose Ongalo – International Criminal Court, The Netherlands
  • Bertrand Caron – Bibliothèque nationale de France
  • Somaya Langley – Cambridge University Library, UK

Panellists have been invited based on their knowledge of a breadth of digital creation, archiving and preservation contexts and practices including having worked in non-Western, non-institutional and underprivileged communities.

Operational Pragmatism

What does ‘operational pragmatism’ mean? For the past year or two I’ve been pondering ‘what corners can we cut’? For over a decade I have witnessed an increasing amount of work in the digital preservation space, yet I haven’t seen the increase in staffing and resources to handle this work. Meanwhile deadlines for transferring digital (and analogue audiovisual) content from carriers are just around the corner (e.g. Deadline 2025).

Panel Topic

Outside of the First World and national institutional/top-tier university context, individuals in the developing world struggle to access basic technology and resources to be able to undertake archiving and preservation of digital materials. Privileged First World institutions (who still struggle with deeply ingrained under-resourcing) are considering Trusted Digital Repository certification, while in the developing world meeting these standards is just not feasible. (Evidenced by work that has taken place in the POWRR project and Anthea Seles’ PhD thesis and more.)

How do we best prioritise our efforts so we can plan effectively (with the current resources we have)? How do we strategically develop these resources in methodical ways while ensuring the critical digital preservation work gets done before it is simply too late?

Approach

This panel discussion will take the form of a series of provocations addressing topics including: fixity, infrastructure and storage, preconditioning, pre-ingest processes, preservation metadata, scalability (including bi-directional scalability), technical policies, tool error reporting and workflows.

Each panellist will present their view on a different topic. Audience involvement in the discussion will be strongly encouraged.

Outcomes

The intended outcome is a series of agreed-upon ‘baselines’ tailored to different cultural, organisational and contextual situations, with the hope that these can be used for digital preservation planning and strategy development.

Further Information

The Panel Abstract is included below.

iPres Digital Preservation Conference program information can be found at: https://ipres2017.jp/program/.

We do hope you’ll be able to join us.


Panel Abstract

Undertaking active digital preservation, holistically and thoroughly, requires substantial infrastructure and resources. National archives and libraries across the Western world have established, or are working towards maturity in digital preservation (often underpinned by legislative requirements). On the other hand, smaller collectives and companies situated outside of memory institution contexts, as well as organisations in non-Western and developing countries, are struggling with the basics of managing their digital materials. This panel continues the debate within the digital preservation community, critiquing the development of digital preservation practices typically from within positions of privilege. Bringing together individuals from diverse backgrounds, the aim is to establish a variety of ‘bare minimum’ baselines for digital preservation efforts, while tailoring these to local contexts.

Email preservation: How hard can it be?

Policy and Planning Fellow Edith summarises some highlights from the Digital Preservation Coalition’s briefing day on email preservation. See the full schedule of speakers on DPC’s website.


Yesterday Sarah and I attended DPC’s briefing day on email preservation at the National Archives (UK) in Kew, London. We were keen to go and hear about latest findings from the Email Preservation Task Force as Sarah will be developing a course dedicated to email preservation for the DPOC teaching programme. An internal survey circulated to staff in Bodleian Libraries’ earlier this year showed a real appetite for learning about email preservation. It is an issue which evidently spans several areas of our organisation.

The subheading of the event “How hard can it be?” turned out to be very apt. Before even addressing preservation, we were asked to take a step back and ask ourselves:

Do I actually know what email is?”

As Kate Murray from the Library of Congress put it: “email is an object, several things and a verb”. In this sense email has much in common with the World Wide Web, as they are heavily linked and complex objects. Retention decisions must be made, not only about text content but also about email attachments and external web links. In addition, supporting features (such as instant messaging and calendars) are increasingly integrated into email services and potential candidates for capture.

Thinking about email “as a verb” also highlights that it is a cultural and social practice. Capturing relationships and structures of communication is an additional layer to preserve. Anecdotally, some participants on the Email Preservation day had found that data mining, including the ability to undertake analysis across email archives, is increasingly in demand from historians using big data research techniques.

Anthea Seles, National Archives (UK), talks about visualisation of email archives.

What are people doing?

So what are organisations currently doing to preserve email? A strength of the Email Preservation Taskforce’s new draft report is that it draws together samples of workflows currently in use by other organisations (primarily US based). Additional speakers from Preservica, National Archives and the British Library supplemented these with some local examples from the UK throughout the day.

The talks and report show that migration is by far the most common approach to email preservation in the institutions consulted. EML and Mbox are the most common formats migrated to. Each have different approaches to storing either single messages (EML) or aggregating messages in a single database file (Mbox). (However, beware that Mbox is a whole family of formats which have varying documentation levels!)

While some archives choose to ingest Mbox and EML files into their repositories without further processing, others choose to unpack content within these files. Unpacking content provides a mode of displaying emails, as well as the ability to normalise content within them.

The British Library for example have chosen to unpack email files using Aid4Mail, and are attempting to replicate the message hierarchy within a folder structure. Using Aid4Mail, they migrate text from email messages to PDF/A-2b which are displayed alongside folders containing any email attachments. PDF/A-2b can then be validated using vera/PDF or other tools. A CSV manifest is also generated and entered into relevant catalogues. Preservica’s out of the box workflow is very similar to the British Library’s, although they choose to migrate text content to HTML or UTF-8 encoded text files.

Another tantalising example (which I can imagine will gain more traction in the future) came from one institution who has used Emulation As A Service to provide access to one of its collections of email. By using an emulation approach it is able to provide access to content within the original operating environment used by the donor of the email archive. This has particular strength in that email attachments, such as images and word processing files, can be viewed on contemporary software (providing licenses can be acquired for the software itself).

Finally, a tool which was considered or already in use by many of the contributors is ePADD. ePADD is an open source tool developed by Stanford University Libraries. It provides functions for processing and appraisal of Mbox files, but also has many interesting features for exploring the social and cultural aspect of email. ePADD can mine emails for subjects such as places, events and people. En masse, these subjects provide researchers with a much richer insight into trends and topics within large email archives. (Tip: why not have a look at the ePADD discovery module to see it in practice?)

What do we still need to explore?

It is encouraging that people are already undertaking preservation of email and that there are workflows out there which other organisations can adopt. However, there are many questions and issues still to explore.

  1. Current processes cannot fully capture the interlinked nature of email archives. Questions were raised during the day about the potential of describing archives using linked open data in order to amalgamate separate collections. Email archives may be more valuable to historians as they acquire critical mass
  2. Other questions were raised around whether or not archives should also crawl web links within emails. Links to external content may be crucial for understanding the context of a message, but this becomes a very tricky issue if emails are accessioned years after creation. If webpages are crawled and associated with the email message years after it was sent, serious doubt is raised around the reliability of the email as a record
  3. The issue of web links also brings up the question of when email harvesting should occur. Would it be better if emails were continually harvested to the archive/record management system than waiting until a member of staff leave their position? The good news is that many email providers are increasingly documenting and providing APIs to their services, meaning that the ability to do so may become more feasible in the future
  4. As seen in many of the sample workflows from the Email Preservation Task Force report, email files are often migrated multiple times. Especially as ePADD works with Mbox, some organisations end up adding an additional migration step in order to use the tool before normalising to EML. There is currently very little available literature on the impact of migrations, and indeed multiple migrations, on the information content of emails.

What can you do now to help?    

So while there are some big technical and philosophical challenges, the good news is that there are things you can do to contribute right now. You can:

  • Become a “Friend of the Email Preservation Task Force” and help them review new reports and outputs
  • Contribute your organisation’s workflows to the Email Preservation Task Force report, so that they can be shared with the community
  • Run trial migrations between different email formats such as PST, Mbox and EML and blog about your finding
  • Support open source tools such as ePADD through either financial aid or (if you are technically savvy) your time. We rely heavily on these tools and need to work together to make them sustainable!

Overall the Email Preservation day was very inspiring and informative, and I cannot wait to hear more from the Email Preservation Task Force. Were you also at the event and have some other highlights to add? Please comment below!  

Data reproducibility, provenance capture and preservation

An update from the Cambridge Fellows about their visit to the Cambridge Computer Laboratory to learn about the team’s research on provenance metadata.


In amongst preparing reports for the powers that be and arranging vendor meetings, Dave and Lee took a trip over to the William Gates Building which houses the University of Cambridge’s Computer Laboratory. The purpose of the visit was to find out about the Digital Technology Group’s  projects from one of their Senior Research Associates, Dr. Ripduman Sohan. 

The particular project was the FRESCO project which stands for Fabric For Reproducible Computing. You can find out more about the strands of this project here: https://www.cl.cam.ac.uk/research/dtg/fresco. The link to the poster is especially useful and clearly and succintly captures the key points of the meeting far better than my meeting notes.

Cambridge Computer Laboratory - FRESCO Poster

FRESCO Poster. Image credit: Cambridge Computer Laboratory.

The discussion on provenance was of interest to me coming from an recordkeeping background and hearing it discussed in computer science terms. What he was talking about and what archivists do really wasn’t a million miles apart – just that the provenance capture on the data happens in nanoseconds on mind blowing amounts of data.

Rip’s approach, to my ears at least, was refreshing. He believes that computer scientists should start to listen to, move across into and understand ‘other’ domains like the humanities. Computer science should be ‘computing for the future of the planet’ and not a subject that should impose itself on other disciplines which creates a binary choice of the CompSci way or the highway. This is so they can use their computer science skills to help both future research and the practitioners working with humanities information and data.

ARA Conference Round up: Day 1

These are some excerpts from Lee’s detailed conference report on the 2016 ARA Conference in London. It ran from 31 August to 2 September and included two full days of sessions devoted to conversations on digital preservation. His full conference report is available for download at the end of this blog post.


It has been three weeks since the last cup of tea was self-served, the last morsel of cake consumed and the sincere goodbyes to fellow colleagues said at the annual ARA Conference, held at Wembley. Many delegates left with minds crammed with new ideas, innovations and practical lessons to use back at work. I left with the strong impression that digital preservation within the recordkeeping community in the UK and Ireland has become part of the ‘mainstream’ in recordkeeping practice across a variety of sectors. The recordkeeping community has moved on from wanting to know what digital preservation is to how it get involved and preserve digital collections for future generations.

Some highlights from the sessions on Day 1 are:

  • Mike Quinn reminded delegates that they needed to remain flexible in relation to digital preservation challenges, nothing is guaranteed: Apple ending support for the .MOV file format demonstrated that.
  • Matthew Addis noted that “apathy is the digital record killer,” so starting from somewhere simple and working from there is the best way to tackle the digital preservation ‘problem’. Addis observed that lots of organisations seem to suffer from a “digital preservation paralysis” and fear getting it wrong. However, he advised those assembled that “doing nothing is the worst choice”.
  • Kristy Lee’s “simple but vital” advice was to understand where your organisation is in terms of digital preservation work and work out what it is you want to do with digital preservation. She found Adrian Brown’s maturity models quite useful for that.
  • The E-ARK project is coming to an end, but has done interesting open source tool development for implementation of specifications that are scalable, modular, robust and adaptable. Find out more about the project and its December conference here.

The afternoon panel session, “Would like to know more” – Digital preservation training and professional development, was a particularly interesting discussion for the Outreach and Training Fellows. It summarised the findings of the ‘Digital Archiving and Preservation Training Needs Survey’ led by the University of London’s Computer Centre (ULCC) in collaboration the Digital Preservation Coalition (DPC) and the Digital Curation Centre (DCC). Ed Pinsent also neatly presented the findings of the needs survey:

  1. People want to learn about strategy and planning, not exclusively DP theory, not exclusively IT;
  2. People are clear that Digital Preservation training will bring them benefits directly related to their job/organisation/collections;
  3. People want to learn by doing;
  4. Everybody wants to know more; and
  5. Everyone wants to feel confident about digital preservation. ‘Confidence’ was not a word that was used in the wording of the survey, but looking through the qualitative data it was a reoccurring word.

To conclude the session, Stephanie Taylor advised that for digital preservation training, there was no ‘magic answer’ or a ‘right path’ in providing training. You do have to accept that ongoing review and starting from anew is a part of the practice.

For the Digital Preservation at Oxford and Cambridge project, these conclusions and lessons from the ULCC led survey will certainly be interesting to compare once the initial Training Needs Survey has been carried out at the two respective institutions.


Lee’s full ARA Conference write-up can be read here.

Introducing Digital Preservation at Oxford and Cambridge

1 August 2016 marks the beginning of a two-year collaborative project between Cambridge University Library (Cambridge) and University of Oxford’s Bodleian Libraries (Oxford). This project has been funded to assess current practices, then design and implement best practice digital preservation programmes at each institution.

Continue reading