What is holding us back from change?

There are worse spots for a meeting. Oxford. Photo by: S. Mason

Every 3 months the DPOC teams gets together in person in either Oxford, Cambridge or London (there’s also been talk of taking a meeting at Bletchley Park sometime). As this is a collaborative effort, these meetings offer a rare opportunity to work face-to-face instead of via Skype with the endless issues around screen sharing and poor connections. Good ideas come when we get to sit down together.

As our next joint board meeting is next week, it was important to look over the work of the past year and make sure we are happy with the plan for year two. Most importantly, we wanted to discuss the messages we need to give our institutions as we look towards the sustainability of our digital preservation activities. How do we ensure that the earlier work and the work being done by us does not get repeated in 2-5 years time?

Silos in institutions

This is especially complicated when dealing with institutions like Oxford and Cambridge. We are big and old institutions with teams often working in silos. What does siloing have an effect on? Well, everything. Communication, effort, research—it all suffers. Work done previously is done again. Over and over.

The same problems are being tackled within different silos; this is duplicated and wasted effort if they are not communicating their work to each other. This means that digital preservation efforts can be fractured and imbalanced if institutional collaboration is ignored. We have an opportunity and responsibility in this project to get people together and to get them to talk openly about the digital preservation problems they are each trying to tackle.

Managers need to lead the culture change in the institution

While not always the case, it is important that managers do not just sit back and say “you will never get this to work” or “it has always been this way.” We need them on our side; they after often the gatekeepers of silos. We have to bring them together in order to start opening the silos.

It is within their power to be the agents of change; we have to empower them to believe in changing the habits of our institution. They have to believe that digital preservation is worth it if their team will also.

This might be the ‘carrot and stick’ approach or the ‘carrot’ only, but whatever approach is used, the are a number of points we agreed needed to be made clear:

  • our digital collections are significant and we have made assurances about their preservation and long term access
  • our institutional reputation plays a role in the preservation our digital assets
  • digital preservation is a moving target and we must be moving with it
  • digital preservation will not be “solved” through this project, but we can make a start; it is important that this is not then the end.

Roadmap to sustainable digital preservation

Backing up any messages is the need for a sustainable roadmap. If you want change to succeed and if you want digital preservation to be a core activity, then steps must be actionable and incremental. Find out where you are, where you want to go and then outline the timeline of steps it will take to get there. Consider using maturity models to set goals for your roadmap, such as Kenney and McGovern’s, Brown’s or the NDSA model. Each are slightly different and some might be more suitable for your institutions than others, so have a look at all of them.

It’s like climbing a mountain. I don’t look at the peak as I walk; it’s too far away and too unattainable. Instead, I look at my feet and the nearest landmark. Every landmark I pass is a milestone and I turn my attention to the next one. Sometimes I glance up at the peak, still in the distance—over time it starts to grow closer. And eventually, my landmark is the peak.

It’s only when I get to the top that I see all of the other mountains I also have to climb. And so I find my landmarks and continue on. I consider digital preservation a bit of the same thing.

What are your suggestions for breaking down the silos and getting fractured teams to work together? 

Operational Pragmatism in Digital Preservation: a discussion

From Somaya Langley, Policy and Planning Fellow at Cambridge: In September this year, six digital preservation specialists from around the world will be leading a panel and audience discussion. The panel is titled Operational Pragmatism in Digital Preservation: establishing context-aware minimum viable baselines. This will be held at the iPres International Digital Preservation Conference in Kyoto, Japan.


Panellists

Panellists include:

  • Dr. Anthea Seles – The National Archives, UK
  • Andrea K Byrne – Rensselaer Polytechnic Institute, USA
  • Dr. Dinesh Katre – Centre for Development of Advanced Computing (C-DAC), India
  • Dr. Jones Lukose Ongalo – International Criminal Court, The Netherlands
  • Bertrand Caron – Bibliothèque nationale de France
  • Somaya Langley – Cambridge University Library, UK

Panellists have been invited based on their knowledge of a breadth of digital creation, archiving and preservation contexts and practices including having worked in non-Western, non-institutional and underprivileged communities.

Operational Pragmatism

What does ‘operational pragmatism’ mean? For the past year or two I’ve been pondering ‘what corners can we cut’? For over a decade I have witnessed an increasing amount of work in the digital preservation space, yet I haven’t seen the increase in staffing and resources to handle this work. Meanwhile deadlines for transferring digital (and analogue audiovisual) content from carriers are just around the corner (e.g. Deadline 2025).

Panel Topic

Outside of the First World and national institutional/top-tier university context, individuals in the developing world struggle to access basic technology and resources to be able to undertake archiving and preservation of digital materials. Privileged First World institutions (who still struggle with deeply ingrained under-resourcing) are considering Trusted Digital Repository certification, while in the developing world meeting these standards is just not feasible. (Evidenced by work that has taken place in the POWRR project and Anthea Seles’ PhD thesis and more.)

How do we best prioritise our efforts so we can plan effectively (with the current resources we have)? How do we strategically develop these resources in methodical ways while ensuring the critical digital preservation work gets done before it is simply too late?

Approach

This panel discussion will take the form of a series of provocations addressing topics including: fixity, infrastructure and storage, preconditioning, pre-ingest processes, preservation metadata, scalability (including bi-directional scalability), technical policies, tool error reporting and workflows.

Each panellist will present their view on a different topic. Audience involvement in the discussion will be strongly encouraged.

Outcomes

The intended outcome is a series of agreed-upon ‘baselines’ tailored to different cultural, organisational and contextual situations, with the hope that these can be used for digital preservation planning and strategy development.

Further Information

The Panel Abstract is included below.

iPres Digital Preservation Conference program information can be found at: https://ipres2017.jp/program/.

We do hope you’ll be able to join us.


Panel Abstract

Undertaking active digital preservation, holistically and thoroughly, requires substantial infrastructure and resources. National archives and libraries across the Western world have established, or are working towards maturity in digital preservation (often underpinned by legislative requirements). On the other hand, smaller collectives and companies situated outside of memory institution contexts, as well as organisations in non-Western and developing countries, are struggling with the basics of managing their digital materials. This panel continues the debate within the digital preservation community, critiquing the development of digital preservation practices typically from within positions of privilege. Bringing together individuals from diverse backgrounds, the aim is to establish a variety of ‘bare minimum’ baselines for digital preservation efforts, while tailoring these to local contexts.

Six Priority Digital Preservation Demands

Somaya Langley, Cambridge Policy and Planning Fellow, talks about her top 6 demands for a digital preservation system.


Photo: Blazej Mikula, Cambridge University Library

As a former user of one digital preservation system (Ex Libris’ Rosetta), I have spent a few years frustrated by the gap between what activities need to be done as part of a digital stewardship end-to-end workflow – including packaging and ingesting ‘information objects’ (files and associated metadata) – and the maturity level of digital preservation systems.

Digital Preservation Systems Review

At Cambridge, we are looking at different digital preservation systems and what each one can offer. This has involved talking to both vendors and users of systems.

When I’m asked about what my top digital preservation system current or future requirements are, it’s excruciatingly hard to limit myself to a handful of things. However, having previously been involved in a digital preservation system implementation project, there are some high-level takeaways from past experiences that remain with me.

Shortlist

Here’s the current list of my six top ‘digital preservation demands’ (aka user requirements):

Integration (with various other systems)

A digital preservation ‘system’ is only one cog in a wheel within a much larger machine; one piece of a much larger puzzle. There is an entire ‘digital ecosystem’ that this ‘system’ should exist within, and end-to-end digital stewardship workflows are of primary importance. The right amount of metadata and/or files should flow should flow from one system to another. We must also know where the ‘source of truth’ is for each bit.

Standards-based

This seems like a no-brainer. We work in Library Land. Libraries rely on standards. We also work with computers and other technologies that also require standard ways (protocols etc.) of communicating.

For files and metadata to flow from one system to another – whether via import, ingest, export, migration or an exit strategy from a system – we already spend a bunch of time creating mappings and crosswalks from one standard (or implementation of a standard) to another. If we don’t use (or fully implement) existing standards, this means we risk mangling data, context or meaning; potentially losing or not capturing parts of the data; or just wasting a whole lot of time.

Error Handling (automated, prioritised)

There’s more work to be done in managing digital materials than there are people to do it. Content creation is increasing at exponential rates, meanwhile the number of staff (with the right skills) just aren’t. We have to be smart about how we work. This requires prioritisation.

We need to have smarter systems that help us. This includes helping to prioritise where we focus our effort. Digital preservation systems are increasingly incorporating new third-party tools. We need to know which tool reports each error and whether these errors are show-stoppers or not. (For example: is the content no longer renderable versus a small piece of non-critical descriptive metadata that is missing?) We have to accept that, for some errors, we will never get around to addressing them.

Reporting

We need to be able to report to different audiences. The different types of reporting classes include (but are not limited to):

  1. High-level reporting – annual reports, monthly reports, reports to managers, projections, costings etc.)
  2. Collection and preservation management reporting – reporting on successes and failures, overall system stats, rolling checksum verification etc.
  3. Reporting for preservation planning purposes – based on preservation plans, we need to be able to identify subsections of our collection (configured around content types, context, file format and/or whatever other parameters we choose to use) and report on potential candidates that require some kind of preservation action.

Provenance

We need to best support – via metadata – where a file has come from. This, for want of a better approach, is currently being handled by the digital preservation community through documenting changes as Provenance Notes. Digital materials acquired into our collections are not just the files, they’re also the metadata. (Hence, why I refer to them as ‘information objects’.) When an ‘information object’ has been bundled, and is ready to be ingested into a system, I think of it as becoming an ‘information package’.

There’s a lot of metadata (administrative, preservation, structural, technical) that appears along the path from an object’s creation until the point at which it becomes an ‘information package’. We need to ensure we’re capturing and retaining the important components of this metadata. Those components we deem essential must travel alongside their associated files into a preservation system. (Not all files will have any or even the right metadata embedded within the file itself.) Standardised ways of handling information held in Provenance Notes (whether these are from ‘outside of the system’ or created by the digital preservation system) and event information so it can be interrogated and reported on is crucial.

Managing Access Rights

Facilitating access is not black and white. Collections are not simply ‘open’ or ‘closed’. We have a myriad of ways that digital material is created and collected; we need to ensure we can provide access to this content in a variety of ways that support both the content and our users. This can include access from within an institution’s building, via a dedicated computer terminal, online access to anyone in the world, mediated remote access, access to only subsets of a collection, support for embargo periods, ensuring we respect cultural sensitivities or provide access to only the metadata (perhaps as large datasets) and more.

We must set a goal of working towards providing access to our users in the many different (and new-ish) ways they actually want to use our content.

It’s imperative to keep in mind the whole purpose of preserving digital materials is to be able to access them (in many varied ways). Provision of content ‘viewers’ and facilitating other modes of access (e.g. to large datasets of metadata) are essential.

Final note: I never said addressing these concerns was going to be easy. We need to factor each in and make iterative improvements, one step at a time.

Email preservation: How hard can it be?

Policy and Planning Fellow Edith summarises some highlights from the Digital Preservation Coalition’s briefing day on email preservation. See the full schedule of speakers on DPC’s website.


Yesterday Sarah and I attended DPC’s briefing day on email preservation at the National Archives (UK) in Kew, London. We were keen to go and hear about latest findings from the Email Preservation Task Force as Sarah will be developing a course dedicated to email preservation for the DPOC teaching programme. An internal survey circulated to staff in Bodleian Libraries’ earlier this year showed a real appetite for learning about email preservation. It is an issue which evidently spans several areas of our organisation.

The subheading of the event “How hard can it be?” turned out to be very apt. Before even addressing preservation, we were asked to take a step back and ask ourselves:

Do I actually know what email is?”

As Kate Murray from the Library of Congress put it: “email is an object, several things and a verb”. In this sense email has much in common with the World Wide Web, as they are heavily linked and complex objects. Retention decisions must be made, not only about text content but also about email attachments and external web links. In addition, supporting features (such as instant messaging and calendars) are increasingly integrated into email services and potential candidates for capture.

Thinking about email “as a verb” also highlights that it is a cultural and social practice. Capturing relationships and structures of communication is an additional layer to preserve. Anecdotally, some participants on the Email Preservation day had found that data mining, including the ability to undertake analysis across email archives, is increasingly in demand from historians using big data research techniques.

Anthea Seles, National Archives (UK), talks about visualisation of email archives.

What are people doing?

So what are organisations currently doing to preserve email? A strength of the Email Preservation Taskforce’s new draft report is that it draws together samples of workflows currently in use by other organisations (primarily US based). Additional speakers from Preservica, National Archives and the British Library supplemented these with some local examples from the UK throughout the day.

The talks and report show that migration is by far the most common approach to email preservation in the institutions consulted. EML and Mbox are the most common formats migrated to. Each have different approaches to storing either single messages (EML) or aggregating messages in a single database file (Mbox). (However, beware that Mbox is a whole family of formats which have varying documentation levels!)

While some archives choose to ingest Mbox and EML files into their repositories without further processing, others choose to unpack content within these files. Unpacking content provides a mode of displaying emails, as well as the ability to normalise content within them.

The British Library for example have chosen to unpack email files using Aid4Mail, and are attempting to replicate the message hierarchy within a folder structure. Using Aid4Mail, they migrate text from email messages to PDF/A-2b which are displayed alongside folders containing any email attachments. PDF/A-2b can then be validated using vera/PDF or other tools. A CSV manifest is also generated and entered into relevant catalogues. Preservica’s out of the box workflow is very similar to the British Library’s, although they choose to migrate text content to HTML or UTF-8 encoded text files.

Another tantalising example (which I can imagine will gain more traction in the future) came from one institution who has used Emulation As A Service to provide access to one of its collections of email. By using an emulation approach it is able to provide access to content within the original operating environment used by the donor of the email archive. This has particular strength in that email attachments, such as images and word processing files, can be viewed on contemporary software (providing licenses can be acquired for the software itself).

Finally, a tool which was considered or already in use by many of the contributors is ePADD. ePADD is an open source tool developed by Stanford University Libraries. It provides functions for processing and appraisal of Mbox files, but also has many interesting features for exploring the social and cultural aspect of email. ePADD can mine emails for subjects such as places, events and people. En masse, these subjects provide researchers with a much richer insight into trends and topics within large email archives. (Tip: why not have a look at the ePADD discovery module to see it in practice?)

What do we still need to explore?

It is encouraging that people are already undertaking preservation of email and that there are workflows out there which other organisations can adopt. However, there are many questions and issues still to explore.

  1. Current processes cannot fully capture the interlinked nature of email archives. Questions were raised during the day about the potential of describing archives using linked open data in order to amalgamate separate collections. Email archives may be more valuable to historians as they acquire critical mass
  2. Other questions were raised around whether or not archives should also crawl web links within emails. Links to external content may be crucial for understanding the context of a message, but this becomes a very tricky issue if emails are accessioned years after creation. If webpages are crawled and associated with the email message years after it was sent, serious doubt is raised around the reliability of the email as a record
  3. The issue of web links also brings up the question of when email harvesting should occur. Would it be better if emails were continually harvested to the archive/record management system than waiting until a member of staff leave their position? The good news is that many email providers are increasingly documenting and providing APIs to their services, meaning that the ability to do so may become more feasible in the future
  4. As seen in many of the sample workflows from the Email Preservation Task Force report, email files are often migrated multiple times. Especially as ePADD works with Mbox, some organisations end up adding an additional migration step in order to use the tool before normalising to EML. There is currently very little available literature on the impact of migrations, and indeed multiple migrations, on the information content of emails.

What can you do now to help?    

So while there are some big technical and philosophical challenges, the good news is that there are things you can do to contribute right now. You can:

  • Become a “Friend of the Email Preservation Task Force” and help them review new reports and outputs
  • Contribute your organisation’s workflows to the Email Preservation Task Force report, so that they can be shared with the community
  • Run trial migrations between different email formats such as PST, Mbox and EML and blog about your finding
  • Support open source tools such as ePADD through either financial aid or (if you are technically savvy) your time. We rely heavily on these tools and need to work together to make them sustainable!

Overall the Email Preservation day was very inspiring and informative, and I cannot wait to hear more from the Email Preservation Task Force. Were you also at the event and have some other highlights to add? Please comment below!  

C4RR – Containers for Reproducible Research Conference

James shares his thoughts after attending the C4RR Containers for Reproducible Research Conference at the University of Cambridge (27 – 28 June).


At the end of June both Dave and I, the Technical Fellows, attended the C4RR conference/workshop hosted by The Software Sustainability Institute in Cambridge. This event brought together researchers, developers and educators to explore best practices when using containers and the future of research software with containers.

Containers, specially Docker and Singularity are the ‘in’ thing at the moment and it was interesting to hear from a variety of research projects who are using them for reproducible research.

Containers are another form of server virtualisation but are lighter than a virtual machine. Containers and virtual machines have similar resource isolation and allocation benefits, but function differently because containers virtualize the operating system instead of hardware; containers are more portable and efficient.

 
Comparison of VM vs Container (Images from docker.com)

Researchers described how they were using Docker, one of the container implementations, to package the software used in their research so they could easily reproduce their computational environment across several different platforms (desktop, server and cluster). Others were using Singularity, another container technology, when implementing containers on a HPC (High-Performance Computing) Cluster due to restrictions of the Docker requirements for root access. It was clear from the talks, there is a rapid development of these technologies and ever increasing complexity of the computing environments involved, which does make me worry how these might be preserved.

Near the end of the second day, Dave and I gave a 20 minute presentation to encourage the audience to think more about preservation. As the audience were all evangelists for container technology it made sense to try to tap into them to promote building preservation into their projects.


Image By Raniere Silva

One aim was to get people to think about their research after the project was over. There is often a lack of motivation to think about how others might reproduce the work, whether that’s six months into the future let alone 15+ years from now.

Another area we briefly covered was relating to depositing research data. We use DROID to scan our repositories to identify file formats which relies on the technical registry of PRONOM. We put out a plea to the audience to ask for help with creating new file signatures for unknown file formats.

I had some great conversations with others over the two days and my main takeaway from the event was that we should look to attend more non-preservation specific conferences with a view to promote preservation in other computer-related areas of study.

Our slides from the event have been posted by The Software Sustainability Institute via Google.