Email preservation 2: it is hard, but why?

A post from Sarah (Oxford) with input from Somaya (Cambridge) about the 24 January 2018 DPC event on email archiving from the Task Force on Technical Approaches to Email Archives.

The discussion of the day circulated around what they had learnt during the year of the task force, that personal and public stories are buried in email, considerable amounts of email have been lost over previous decades, that we should be treating email as data (it allows us to understand other datasets), that current approaches to collecting and preserving email don’t work as they’re not scalable and the need for the integration of artificial intelligence and machine learning (this is already taking place in legal professions with ‘predictive coding’ and clustering technologies) to address email archives, including natural language processing functions is important.

Back in July, Edith attended the first DPC event on email preservation, presented by the Task Force on Technical Approaches to Email Archives. She blogged about here. In January this year, Somaya and I attended the second event hosted again by the DPC.

Under the framework of five working groups, this task force has spent 12 months (2017) focused on five separate areas of the final report, which is due out in around May this year:

  • The Why: Overview / Introduction
  • The When/Who/Where: Email Lifecycles Perspectives
  • The What: The Needs of Researchers
  • The How: Technical Approaches and Solutions
  • The Path Forward: Sustainability & Community Development

The approach being taken is technical, rather than on policy. Membership of the task force includes the DPC, representatives from universities and national institutions from around the world and technology companies including Google and Microsoft.

For Chris Prom (from University of Illinois Urbana Champaign, who authored the 2011 DPC Technology Watch Report on Preserving Email) and Kate Murray’s (Library of Congress and contributor to FADGI) presentation about the work they have been doing, you can view their slides here. Until the final report is published, I have been reviewing the preliminary draft (of June 2017) and available documents to help develop my email preservation training course for Oxford staff in April.

So, when it comes to email preservation, most of the tools and discussions focus on processing email archives. Very little of the discussion has to do with the preservation of email archives over time. There’s a very good reason for this. Processing email archives is the bottleneck in the process, the point at which most institutions are still stuck at. It is hard to make decisions around preservation, when there is no means for collecting email archives or processing them in a timely manner.

There were many excellent questions and proposed solutions from the speakers at the January event. Below are some of the major points from the day that have informed my thinking of how to frame training on email preservation:

Why are email archives so hard to process?

  1. They are big. Few people cull their emails and over time they build up. Reply and ‘reply all’ functions expand out emails chains and attachments are growing in size and diversity. It takes a donor a while to prepare their email archives, much less for an institution to transfer and process them.
  2. They are full of sensitive information. Which is hard to find. Many open source technology assisted review (TAR) tools miss sensitive information. Software used for ‘predictive coding’ and machine learning for reviewing email archives are well out of budget for heritage institutions. Manual review is far too labour intensive.
  3. There is no one tool that can do it all. Email preservation requires ‘tool chaining’ in order to transfer, migrate and process email archives. There are a very wide variety of email software programs which in turn create a many different email file format types. Many of the tools used in email archive processing are not compatible with each of the different email file types; this requires a multiple file format migrations to allow for processing. For a list of some of the current available tools, see the Task Force’s list here.

What are some of the solutions?

  1. Tool chaining will continue. It appears for now, tool chaining is here to stay, often mixing proprietary with open source tools to get workflows running smoothly. This means institutions will need to invest in establishing email processing workflows: the software, people who know about how to handle different email formats etc.
  2. What about researchers? Access to emails is tightly controlled due to sensitivity restraints, but is there space to get researchers to help with the review? If they use the collection for research, could they also be responsible for flagging anything deemed as sensitive? How could this be done ethically?
  3. More automation. Better tool development to assisted with TAR. Reviewing processes must become more automated if email archives are ever to be processed. The scale of work is increasing and traditional appraisal approaches (handling one document at a time) and record schedules are no longer suitable.
  4. Focus on bit-level preservation first. Processing of email archives can come later, but preserving it needs to start on transfer. (But we know users want access and our institutions want to provide this access to email archives.)
  5. Perfection is no longer possible. While archivists would like to be precise, in ‘scaling up’ email archive processing we need to think about it as ‘big data’ and take a ‘good enough’ approach.

Towards a common understanding?

Cambridge Outreach and Training Fellow, Lee, describes the rationale behind trialling a recent workshop on archival science for developers, as well as reflecting on the workshop itself. Its aim was to get those all those working in digital preservation within the organisation to have a better understanding of each other’s work to improve co-operation for a sustainable digital preservation effort.

Quite often, there is a perceived language barrier due to the wide range of practitioners that work in digital preservation. We may be using the same words, but there’s not always a shared common understanding of what they mean. This became clear when I was sitting next to my colleague, a systems integration manager, at an Archivematica workshop in September. Whilst not a member of the core Cambridge DPOC team, our colleague is a key member of our extended digital preservation network at Cambridge University Library a is a key member for development for understanding and retaining digital preservation knowledge in the institution.

For those from a recordkeeping background, the design principles behind the front end of Archivematica should be obvious, as it incorporates both traditional principles of archival practice and features of the OAIS model. However, coming from a systems integration point of view, there was a need to have to translate for my colleague words such as ‘accession’, ‘appraisal’ and ‘arrangement’, which many of us with archival education take their meanings for granted.

I asked my colleague if an introductory workshop on archival science would be useful, and she said, “yes, please!” Thus, the workshop was born. Last week, a two and a half hour workshop was trialled for members of our developer and systems integration colleagues. The aim of the workshop was to enable them to understand what archivists are taught on postgraduate courses and how this teaching informs their practice. After understanding the attendees’ impressions of an archivist and the things that they do (see image) the workshop then practically explored how an archivist would acquire and describe a collection. The workshop was based on an imaginary company, complete with a history and description of the business units and examples of potential records they would deposit. There were practical exercises on making an accession record, appraising a collection, artificial arrangement and subsequent description through ISAD(G).

Sticky notes about archivists

Sticky notes about archivists from a developer point of view.

Having then seen how an archivist would approach a collection, the workshop moved into explaining physical storage and preservation before moving onto digital preservation, specifically looking at OAIS and then examples of digital preservation software systems. One exercise was to get the attendees to use what they had learned in the workshop to see where archival ideas mapped onto the systems.

The workshop tried to demonstrate how archivists have approached digital preservation armed with the professional skills and knowledge that they have. The idea was to inform to teams working with archivists and the digital preservation of how archivists think and how and why some of the tools and products are design in the way that they are. My hope was for ‘IT’ to understand the depth of knowledge that archivists have in order to help everyone work together on a collaborative digital preservation solution.

Feedback was positive and it will be run again in the New Year. Similarly, I’m hoping to devise a course from a developer perspective that will help archivists communicate more effectively with developers. Ultimately, both will be working from a better level of understanding each other’s professional skill sets. Co-operation and collaboration on digital preservation projects will become much easier across disciplines and we’ll have a better informed (and relaxed) environment to share practices and thoughts.

Advocating for digital preservation

Bodleian Libraries and Cambridge University Library are entering into the last phase of the DPOC project, where they are starting to write up business cases for digital preservation. In preparation, the Fellows attended DPC’s “advocacy briefing day” in London.  Policy and Planning Fellow, Edith, blogs about some of the highlights and lessons from the day.

This week I had the pleasure of attending DPC’s advocacy training day. It was ran by Catherine Heaney, the founder of DHR Communications, and a veteran when it comes to advocating for supporting digital heritage. Before the event I thought I had a clear idea of what advocacy means in broad terms. You invite yourself into formal meetings and try to deliver measured facts and figures which will be compelling to the people in front of you – right?

Well… not quite it turns out. Many of these assumptions were put on their head during this session. Here are my four favourite pieces of (sometimes surprising) advocacy advice from Catherine.

Tip 1: Advocacy requires tenaciousness

The scenario which was described above is what communications professionals might call “the speech” – but it is only one little part of effective advocacy. “The digital preservation speech” is important, but it is not necessarily where you will get the most buy-in for digital preservation. Research has shown that one-off communications like these are usually not effective.

In fact, all of those informal connections and conversations you have with colleagues also come under advocacy and may reap greater benefits due to their frequency. And if one of these colleagues are themselves talented at influencing others, they can be invaluable in advocating for digital preservation when you are not there in person.

Lesson learnt: you need to keep communicating the message whenever and wherever you can if you want it to seep in to peoples’ consciousness. Since digital preservation issues do not crop up that often in popular culture and the news, it is up to us to deliver, re-deliver… and then re-deliver the message if we want it to stick.

Tip 2: Do your background research

When you know that you will be interacting with colleagues and senior management, it is important to do your background research and find out what argument will most appeal to the person you are meeting. Having a bog-standard ‘speech’ about digital preservation which you pull out at all occasions is not the most effective approach. In order to make your case, the problem you are attempting to solve should also reflect the goals and the challenges which the person you are trying to advocate to are facing.

The aspects which appeal about digital preservation will be different depending on the role, concerns and responsibilities of the person you are advocating to. Are they concerned with:

  • Legal or reputational risk?
  • Financial costs and return on investment?
  • About being seen as someone at the forefront of the digital preservation fields?
  • Creating reproducible research?
  • Collecting unique collections?
  • Or perhaps about the opportunity to collaborate cross-institutionally?

Tip 3: Ensure that you have material for a “stump speech” ready

Tailoring your message to the audience is important, and this will be easier if you have material ready at hand which you can pick and choose from. Catherine suggested preparing a folder of stories, case studies, data and facts about digital preservation which you can cut and paste from to suit the occasion.

What is interesting though is the order of that list of “things to collect”:

  1. Stories
  2. Case studies
  3. Data and facts

The ranking is intentional. We tend to think that statistics and raw data will convince people, as this appeals to their logic. In fact, your argument will be stronger if your pitch starts with a narrative (a story) about WHY we need digital preservation and case studies to illustrate your point.  Catherine advises that it is then when the audience is listening that you bring out the data and facts. This approach is both more memorable and more effective in capturing your audience’s attention.

Tip 4: Personalise your follow up

This connects to tip 2 – about knowing your audience. Catherine advised that, although it may feel strange at first, writing a personalised follow up message is a very effective tool. When you do have the chance to present your case to an important group within your organisation, the follow up message can further solidify that initial pitch (again – see tip 1 about repeated communication).

By taking notes about the concerns or points that have been made during a meeting, you have the opportunity to write personalised messages which captures and refers back to the concerns raised by that particular person. The personalised message also has the additional benefit of opening up a channel for future communication.

This was just a small subsection of all the interesting things we talked about on the advocacy briefing day. For some more information have a look at the hashtag for the day #DPAdvocacy.

Institutional risk and born-digital content: the shutdown of DCist #IDPD17

Another post for today’s International Digital Preservation Day 2017. Outreach and Training Fellow, Sarah, discusses just how real institutional risk is and how it can lead to a loss of born digital archives — a risk that digital-only sites like DCist have recently proven. Read more about the Gothamist’s website shutdowns this November.

In today’s world, so much of what we create and share exists only in digital form. These digital-only creations are referred to as born-digital — they were created digitally and they often continue in that way. And so much of our born-digital content is shared online. We often take for granted content on the Internet, assuming it will always be there. But is it? Likely it will at least be captured by the Internet Archive’s Wayback Machine or a library web archiving equivalent. But is that actually enough? Does it capture a complete, usable record? What happens when a digital-only creation, like a magazine or newspaper, is shut down?

Institutional risk is real. In the commercial world of born-digital content that persists only in digital form, the risk of loss is high.

Unfortunately, there’s recently been a very good example of this kind of risk when the Gothamist shut down its digital-only content sites such as the DCist. This happened in early November this year.

The sites and all the associated content was completely removed from the Internet by the morning of 3 November. Gone. Taken down and replaced with a letter from billionaire CEO, Joe Ricketts, justifying the shutdown because despite its enormous popularity and readership, it just wasn’t “economically successful.”

Wayback Machine’s capture of the redirect page and Ricketts’ letter

The DCist site and all of its content was gone completely; readers instead were redirected to another page entirely to read Joe Ricketts’ letter. Someone had literally pulled the plug on the whole thing.

Internet Archive’s 3 November 2017 capture, showing a redirect from the page. DCist was gone from the Internet.

The access to content was completely lost, save for what the Internet Archive captured and what content was saved by creators elsewhere. But access to the archives of 13 years of DCist content was taken from the Internet and its millions of readers. At that point all we had were some web captures, incomplete records of the content left to us.

The Internet Archive’s web captures for over the past 13 years.

What would happen to the DCist’s archive now? All over Twitter people were being sent to Internet Archive or to check Google’s cache to download the lost content. But as Benjamin Freed pointed out in his recent Washingtonian article:

“Those were noble recommendations, but would have been incomplete. The Wayback Machine requires knowledge about URLs, and versions stored in Google’s memory banks do not last long enough. And, sure, many of the subjects DCist wrote about were covered by others, but not all of them, and certainly not with the attitude with which the site approached the world.”

As Freed reminds us “A newspaper going out of business is tragic, but when it happens, we don’t torch the old issues or yank the microfilms from the local library.” In the world of born-digital content, simply unplugging the servers and leaving the digital archive to rot means that at best, we may only have an incomplete record of the 1,000s of articles and content of a community.

If large organisations are not immune to this kind of institutional risk, what about the small ones? The underfunded ones?

To be clear, I think web archiving is important and I have used it a number of times when a site is no longer available — it’s a valuable resource. But it only goes so far and sometimes the record of website is incomplete. So what else can we do? How can we keep the digital archive alive? The good news is that while Ricketts has put the DCist site back up as an “archive” — it’s more like a “digital graveyard” that he could pull the plug on again any time he wants. How do you preserve something so fragile, so at risk? The custodians of the digital content care little for it, so how will it survive for the future?

The good news is that the DCist archive may have another home, not just one that survives on the mercy of a CEO.

The born-digital archives of the DCist require more than just a functioning server over time to ensure access. Fortunately, there are places where digital preservation is happening to all kinds of born-digital collections and there are passionate people who are custodians of this content. These custodians care about keeping it accessible and understandable for future generations. Something that Joe Ricketts clearly does not.

What are your thoughts on this type of institutional risk and its impacts on digital preservation? How can we preserve this type of content in the future? Is web archiving enough or do we need a multi-prong approach? Share your thoughts below and on Twitter using the #IDPD17 hashtag.


The vision for a preservation repository

Over the last couple of months, work at Cambridge University Library has begun to look at what a potential digital preservation system will look like, considering technical infrastructure, the key stakeholders and the policies underpinning them. Technical Fellow, Dave, tells us more about the holistic vision…

This post discusses some of the work we’ve been doing to lay foundations beneath the requirements for a ‘preservation system’ here at Cambridge. In particular, we’re looking at the core vision for the system. It comes with the standard ‘work in progress’ caveats – do not be surprised if the actual vision varies slightly (or more) from what’s discussed here. A lot of the below comes from Mastering the Requirements Process by Suzanne and James Robertson.

Also – it’s important to note that what follows is based upon a holistic definition of ‘system’ – a definition that’s more about what people know and do, and less about Information Technology, bits of tin and wiring.

Why does a system change need a vision?

New systems represent changes to the existing status-quo. The vision is like the Pole Star for such a change effort – it ensures that people have something fixed to move towards when they’re buried under minute details. When confusion reigns, you can point to the vision for the system to guide you back to sanity.

Plus, as with all digital efforts, none of this is real: there’s no definite, obvious end point to the change. So the vision will help us recognise when we’ve achieved what we set out to.

Establishing scope and context

Defining what the system change isn’t is a particularly good a way of working out what it actually represents. This can be achieved by thinking about the systems around the area you’re changing and the information that’s going to flow in and out. This sort of thinking makes for good diagrams: one that shows how a preservation repository system might sit within the broader ecosystem of digitisation, research outputs / data, digital archives and digital published material is shown below.

System goals

Being able to concisely sum-up the key goals of the system is another important part of the vision. This is a lot harder than it sounds and there’s something journalistic about it – what you leave out is definitely more important than what you keep in. Fortunately, the vision is about broad brush strokes, not detail, which helps at this stage.

I found some great inspiration in Sustainable Economics for a Digital Planet, which indicated goals such as: “the system should make the value of preserving digital resources clear”, “the system should clearly support stakeholders’ incentives to preserve digital resources” and “the functional aspects of the system should map onto clearly-defined preservation roles and responsibilities”.

Who are we implementing this for?

The final main part of the ‘vision’ puzzle is the stakeholders: who is going to benefit from a preservation system? Who might not benefit directly, but really cares that one exists?

Any significant project is likely to have a LOT of these, so the Robertsons suggest breaking the list down by proximity to the system (using Ian Alexander’s Onion Model), from the core team that uses the system, through the ‘operational work area’ (i.e. those with the need to actually use it) and out to interested parties within the host organisation, and then those in the wider world beyond. An initial attempt at thinking about our stakeholders this way is shown below.

One important thing that we realised was that it’s easy to confuse ‘closeness’ with ‘importance’: there are some very important stakeholders in the ‘wider world’ (e.g. Research Councils or historians) that need to be kept in the loop.

A proposed vision for our preservation repository

After iterating through all the above a couple of times, the current working vision (subject to change!) for a digital preservation repository at Cambridge University Library is as follows:

The repository is the place where the best possible copies of digital resources are stored, kept safe, and have their usefulness maintained. Any future initiatives that need the most perfect copy of those resources will be able to retrieve them from the repository, if authorised to do so. At any given time, it will be clear how the digital resources stored in the repository are being used, how the repository meets the preservation requirements of stakeholders, and who is responsible for the various aspects of maintaining the digital resources stored there.

Hopefully this will give us a clear concept to refer back to as we delve into more detail throughout the months and years to come…

Planning your (digital) funeral: for projects

Cambridge Policy & Planning Fellow, Somaya, writes about her paper and presentation from the Digital Culture Heritage Conference 2017. The conference paper, Planning for the End from the Start: an Argument for Digital Stewardship, Long-Term Thinking and Alternative Capture Approaches, looks at considering digital preservation at the start of a digital humanities project and provides useful advice for digital humanities researchers to use in their current projects.

In August I presented at the Digital Cultural Heritage 2017 international conference in Berlin (incidentally, my favourite city in the whole world).

Berlin - view from the river Spree. Photo: Somaya Langley

Berlin – view from the river Spree. Photo: Somaya Langley

I presented the Friday morning Plenary session on Planning for the End from the Start: an Argument for Digital Stewardship, Long-Term Thinking and Alternative Capture Approaches. Otherwise known as: ‘planning for your funeral when you are conceived’. This is a presentation that represents challenges faced by both Oxford and Cambridge and the thinking behind this has been done collaboratively by myself and my Oxford Policy & Planning counterpart, Edith Halvarsson.

We decided it was a good idea to present on this topic to an international digital cultural heritage audience, who are likely to also experience similar challenges as our own researchers. It is based on some common digital preservation use cases that we are finding in each of our universities.

The Scenario

A Digital Humanities project receives project funding and develops a series of digital materials as part of the research project, and potentially some innovative tools as well. For one reason or another, ongoing funding cannot be secured and so the PIs/project team need to find a new home for the digital outputs of the project.

Example Cases

We have numerous examples of these situations at Cambridge and Oxford. Many projects containing digital content that needs to be ‘rehoused’ are created in the online environment, typically as websites. Some examples include:

Holistic Thinking

We believe that thinking holistically right at the start of a project can provide options further down the line, should an unfavourable funding outcome be received.

So it is important to consider holistic thinking, specifically a Digital Stewardship approach (incorporating Digital Curation & Digital Preservation).

Models for Preservation

Digital materials don’t necessarily exist in a static form and often they don’t exist in isolation. It’s important to think about digital content as being part of a lifecycle and managed by a variety of different workflows. Digital materials are also subject to many risks so these also need to be considered.

Some models to frame thinking about digital materials:


It is incredibly important to document your project and when handing over the responsibility of your digital materials and data, also handing over documentation to someone responsible for hosting or preserving your digital project will need to rely on this information. Also ensuring the implementation of standards, metadata schemas and persistent identifiers etc.

This can include providing associated materials, such as:

Data Management Plans

Some better use of Data Management Plans (DMPs) could be:

  • Submitting DMPs alongside the data
  • Writing DMPs as dot-points rather than prose
  • Including Technical Specifications such as information about code, software, software versions, hardware and other dependencies

An example of a DMP from Cambridge University’s Dr Laurent Gatto: Data Management Plan for a Biotechnology and Biological Sciences Research Council

Borrowing from Other Disciplines

Rather than having to ‘rebuild the wheel’, we should also consider borrowing from other disciplines. For example, borrowing from the performing arts we might provide similar documents and information such as:

  • Technical Rider (a list of requirements for staging a music gig and theatre show)
  • Stage Plots (layout of instruments, performers and other equipment on stage)
  • Input Lists (ordered list of the different audio channels from your instruments/microphones etc. that you’ll need to send to the mixing desk)

For digital humanities projects and other complex digital works, providing simple and straight forward information about data flows (including inputs and outputs) will greatly assist digital preservationists in determining where something has broken in the future.

Several examples of Technical Riders can be found here:


Here are some approaches to consider in regards to interim digital preservation of digital materials:

Bundling & Bitstream Preservation

The simplest and most basic approach may be to just zip up files and undertake bitstream preservation. Bitstream preservation only ensures that the zeroes and ones that went into a ‘system’ come out as the same zeroes and ones. Nothing more.

Exporting / Migrating

Consider exporting digital materials and/or data plus metadata into recognised standards as a means of migrating into another system.

For databases, the SIARD (Software Independent Archiving of Relational Databases) standard may be of use.

Hosting Code

Consider hosting code within your own institutional repository or digital preservation system (if your organisation has access to this option) or somewhere like GitHub or other services.

Packing it Down & ‘Putting on Ice’

You may need to consider ‘packing up’ your digital materials and doing it in a way that you can ‘put it on ice’. Doing this in a way that – when funding is secured in the future – it can be somewhat simply be brought back to life.

An example of this is the the work that Peter Sefton, from the University of Sydney in Australia, has been trialling. Based on Omeka, he has created a version of code called OzMeka. This is an attempt at a standardised way of being able to handle research project digital outputs that have been presented online. One example of this is Dharmae.

Alternatively, the Kings Digital Lab, provide infrastructure for eResearch and Digital Humanities projects that ensure the foundations of digital projects are stable from the get-go and mitigates risks regarding longer-term sustainability of digital content created as part of the projects.

Maintaining Access

This could be done through traditional web archiving approaches, such as using tools Web Archiving Tools (Heritrix or HTTrack) or downloading video materials using Video Download Helper for video. Alternatively, if you are part of an institution, the Internet Archive’s ArchiveIt service may be something you want to consider and can work with your institution to implement this.

Hosted Infrastructure Arrangements

Finding another organisation to take on the hosting of your service. If you do manage to negotiate this, you will need to either put in place a contract or Memorandum of Understanding (MOU) as well as handing over various documentation, which I have mentioned earlier.

Video Screen Capture

A simple way of attempting to document a journey through a complex digital work (not necessarily online, this can apply to other complex interactive digital works as well), may be by way of capturing a Video Screen Capture.

Kymata Atlas - Video Screen Capture still

Kymata Atlas – Video Screen Capture still

Alternatively, recording a journey through an interactive website using the Webrecorder, developed by Rhizome, which will produce WARC web archive files.

Documenting in Context

Another means of understanding complex digital objects is to document the work in the context in which it was experienced. One example of this is the work of Robert Sakrowski and Constant Dullart, netart.database.

An example of this is the work of Dutch and Belgian net.artists JODI (Joan Heemskerk & Dirk Paesmans) shown here.

JODI - netart.database

JODI – netart.database

Borrowing from documenting and archiving in the arts, an approach of ‘documenting around the work‘ might be suitable – for example, photographing and videoing interactive audiovisual installations.

Web Archives in Context

Another opportunity to understand websites – if they have been captured by the Internet Archive – is viewing these websites using another tool developed by Rhizome,

An example of the Cambridge University Library website from 1997, shown in a Netscape 3.04 browser.

Cambridge University Library website in 1997 via

Cambridge University Library website in 1997 via


While there is no one perfect solution and each have their own pros and cons, using an approach that combines different methods might make your digital materials available post the lifespan of your project. These methods will help ensure that digital material is suitably documented, preserved and potentially accessible – so that both you and others can use the data in an ongoing manner.


  • How you want to preserve the data?
  • How you want to provide access to your digital material?
  • Developing a strategy including several different methods.

Finally, I think this excerpt is relevant to how we approach digital stewardship and digital preservation:

“No man is an island entire of itself; every man is a piece of the continent, a part of the main” – Meditation XVII, John Donne

We are all in this together and rather than each having to troubleshoot alone and building our own separate solutions, it would be great if we can work to our strengths in collaborative ways, while sharing our knowledge and skills with others.

Putting ‘stuff’ in ‘context’: deep thoughts triggered by PASIG 2017

Cambridge Technical Fellow, Dave, delves a bit deeper into what PASIG 2017 talks really got him thinking further about digital preservation and the complexity of it.

After a year of studying digital preservation, my thoughts are starting to coalesce, and the presentations at PASIG 2017 certainly helped that. (I’ve already discussed what I thought were the most important talks, so the ones below some that stimulated me about preservation in particular)…

The one that matched my current thoughts on digital preservation generally was John Sheridan’s Creating and sustaining a disruptive digital archive. It was similar to another previous blog post, and to chats with fellow Fellow Lee too (some of which he’s captured in a blog post for the Digital Preservation Coalition)… I.e.: computing’s ‘paper paradigm’ makes little sense in relation to preservation, hierarchical / neat information structures don’t hold together as well digitally, we’re going to need to compute across the whole archive, and, well, ‘digital objects’ just aren’t really material ‘objects’, are they?

An issue with thinking about digital ‘stuff’ too much in terms of tangible objects is that opportunities related to the fact the ‘stuff’ is digital can be missed. Matt Zumwalt highlighted one such opportunity in Data together: Communities & institutions using decentralized technologies to make a better web when he introduced ‘content addressing’: using cryptographic hashing and Directed Acyclic Graphs (in this case, information networks that record content changing as time progresses) to manage many copies of ‘stuff’ robustly.

This addresses some of the complexities of preserving digital ‘stuff’, but perhaps thinking in terms of ‘copies’, and not ‘branches’ or ‘forks’ is an over simplification? Precisely because digital ‘stuff’ is rarely static, all ‘copies’ have the potential to deviate from the ‘parent’ or ‘master’ copy. What’s the ‘version of true record’ in all this? Perhaps there isn’t one? Matt referred to ‘immutable data structures’, but the concept of ‘immutability’ only really holds if we think it’s possible for data to ever be completely separated from its informational context, because the information does change, constantly. (Hold that thought).

Switching topics, fellow Polonsky Somaya often tries to warn me just how complicated working with technical metadata can get. Well, the pennies dropped further during Managing digital preservation metadata at Sound and Vision: A case on matching OAIS and PREMIS with the DPX file format from Annemieke De Jong and Josefien Schuurman. Space precludes going into the same level of detail they did regarding building a Preservation Metadata Dictionary (PMD) about just one, ‘relatively’ simple file format – but let’s say, well, it’s really complicated. (They’ve blogged about it and the whole PMD is online too). The conclusion: preserving files properly means drilling down deep into their formats, but it also got me thinking – shouldn’t the essence of a ‘preservation file format’ be its simplicity?

The need for greater simplicity in preservation was further emphasised by Mathieu Giannecchini’s The Eclair Archive cinema heritage use case: Rising to the challenges of complex formats at large scale. Again – space precludes me from getting into detail, but the key takeaway was that Mathieu has 2 million reels of film to preserve using the Digital Cinema Distribution Master (DCDM) format, and after lots of good work, he’s optimised the process to preserve 8tb a day, (with a target of 15tb). Now, we don’t know how much film is on each reel, but assuming a (likely over-) estimate of 10 minutes per reel, that’s roughly 180,000 films of 1 hour 50 mins in length. Based on Mathieu’s own figures, it’s going to take many decades, perhaps even a few hundred years, to get through all 2 million reels… So further, major optimisations are required, and I suspect DCDM (a format with a 155-page spec, which relies on TIFF, a format with a 122-page spec) might be one of the bottlenecks.

Of course, the trade-off with simplifying formats is that data will likely be ‘decontextualised’, so there must be a robust method for linking data back to context… Thoughts on this were triggered by Developing and applying principles for discovery and access for the UK Data Service by Katherine McNeill from the UK Data Archive, as Katherine discussed production of a next-generation access system based on a linked-data model with which, theoretically, single cells’ worth of data could be retrieved from research datasets.

Again – space precludes entering into the whole debate around the process of re-using data stripped of original context… Mauthner and Parry illustrate the two contrary sides well, and furthermore argue that merely entertaining the possibility of decontextualising data indicates a certain ‘foundational’ way of thinking that might be invalid from the start? This is where I link to William Kilbride’s excellent DPC blog post from a few months ago

William’s PASIG talk Sustainable digital futures was also one of two that got closer to what we know are the root of the preservation problem; economics. The other was Aging of digital: Managed services for digital continuity by Natasa Milic-Frayling, which flagged-up the current “imbalance in control and empowerment” between tech providers and content producers / owners / curators, an imbalance that means tech firms can effectively doom our digital ‘stuff’ into obsolescence, and we have to suck it up.

I think this imbalance in part exists because there’s too much technical context related to data, because it’s generally in the tech providers’ interests to bloat data formats to match the USPs of their software. So, is a pure ‘preservation format’ one in which the technical context of the data is generalised to the point where all that’s left is commonly-understood mathematics? Is that even possible? Do we really need 122-page specs to explain how raster image data is stored? (It’s just an N-dimensional array of pixel values…, isn’t it…?) I think perhaps we don’t need all the complexity – at the data storage level at least. Though I’m only guessing at this stage: much more research required.

PASIG 2017: honest reflections from a trainee digital archivist

A guest blog post by Kelly, one of the Bodleian Libraries’ graduate digital archivist trainees, on what she learned as a volunteer and attendee of PASIG 2017 Oxford.

Amongst the digital preservation professionals from almost every continent and 130 institutions, myself and my 5 traineeship colleagues were amongst the lecture theatre seats, annexe demos and the awesome artefacts at the Museum of Natural History for PASIG 2017, Oxford. It was a brilliant opportunity at just 6 months into our traineeship to not only apply some of our new knowledge to work at Special Collections, Bodleian Libraries, but we were also able to gain a really current and relevant insight to theories we have been studying as part of our long distance MSc in Digital Curation at Aberystwyth University. The first ‘Bootcamp’ day was exactly what I needed to throw myself in, and it really consolidated my confidence in my understanding of some aspects of the shared language that is used amongst the profession (fixity checks, maturity models…as well as getting to grips with submission information packages, dissemination information packages and everything that occurs in between!).

My pen didn’t stop scribbling all three days, except maybe for tea breaks. Saying that, the demo presentations were also a great time for myself and other trainees to ask questions specifically about workflows and benefits of certain software such as LibNova, Preservica and ResourceSpace.

For want of a better word (and because it really is the truth) PASIG 2017 was genuinely inspiring and there were messages delivered so powerfully I hope that I stay grounded in these for my entire career. Here is what I was taught:

The Community is invaluable. Many of the speakers were quick to assert that sharing practice amongst the digital preservation community is key. This is a value I was familiar with, yet witnessing it happening throughout the conference in such a sincere manner. I can assure you the gratitude and affirmation that followed Eduardo del Valle, University of the Balearic Islands and his presentation: “Sharing my loss to protect your data: A story of unexpected data loss and how to do real preservation” was as encouraging to witness as someone new to the profession as it was to all of the other experienced delegates present. As well as sharing practice, it was clear that the community need to be advocating on behalf of each other. It is time and resource consuming but oh-so important.

Digital archives are preserving historical truths. Yes, the majority of the workflow is technological but the objectives and functions are so much more than technology; to just reduce digital preservation down to this is an oversimplification. It was so clear that the range of use cases presented at PASIG were all driven towards documenting social, political, historical information (and preserving that documentation) that will be of absolute necessity for society and infrastructure in future. Right now, for example, Angeline Takewara and her colleagues at UN MICT are working on a digital preservation programme to ensure absolute accountability and usability of the records of the International Criminal Tribunals of both Rwanda and Yugoslavia. I have written a more specific post on Angeline’s presentation here.

Due to the nature of technology and the digital world, the goalposts will always be moving. For example, Somaya Langley’s talk on the future of digital preservation and the mysteries of extracting data from smart devices will soon become (and maybe already is) a reality for those working with accessions of archives or information management. We should, then, embrace change and embrace the unsure and ultimately ‘get over the need for tidiness’ as pointed out by John Sheridan from The National Archives during his presentation “Creating and sustaining a disruptive digital archive” . This is usually counter-intuitive, but as the saying goes, one of the most dangerous phrases to use is ‘we’ve always done it that way’.

The value of digital material outlives the software, so the enabling of prolonged use of software is a real and current issue. Admittedly, this was a factor I had genuinely not even considered before. In my brain I linked obsolescence with hardware and hardware only. Therefore,  Dr. Natasa Milic-Frayling’s presentation on “Aging of Digital: Managed Services for digital continuity” shed much light on the changing computing ecosystem and the gradual aging of software. What I found especially interesting about the proposed software-continuity plan was the transparency of it; the fact that the client can ask to see the software at any time whilst it is being stabilised and maintained.

Thank you so much PASIG 2017 and everybody involved!

One last thing…in closing, Cliff Lynch, CNI, bought up that there was comparably less Web Archiving content this year. If anybody fancies taking a trainee to Mexico next year to do a (lightning) talk on Bodleian Libraries’ Web Archive I am keen…



Computers are the apogee of profligacy: a response to THE most important PASIG 2017 presentations

Following the PASIG conference, Cambridge Technical Fellow Dave Gerrard couldn’t simply wait to fire off his thoughts on the global context of digital preservation and how we need to better consider the world around us to work on a global solution and not just one that suits capitalist agenda. We usually preface these blogs with “enjoy” but in this instance, please, find a quiet moment, make yourself comfortable, read on and contemplate the global issues presented passionately presented here.

I’m going to work on a more technical blog about PASIG later, but first I want to get this one off my chest. It’s about the two most important presentations: Angeline Takawira’s Digital preservation at the United Nations Mechanism for International Criminal Tribunals and Keep your eyes on the information, Patricia Sleeman’s discussion of preservation work at the UN Refugee Agency (UNHCR).

Angeline Takawira described, in a very precise and formal manner, how the current best practice in Digital Preservation is being meticulously applied to preserving information from UN war crimes tribunals in The Hague (covering the Balkan conflict) and Arusha, Tanzania (covering the Rwandan genocide). As befitted her work, it was striking how calm Angeline was; how well the facts were stuck to, despite the emotive context. Of course, this has to be the case for work underpinning legal processes: intrusion of emotion into the capture of facts could let those trying to avoid justice escape it.

And the importance of maintaining a dispassionate outlook was echoed in the title of the other talk. “Keep your eyes on the information” was what Patricia Sleeman was told when learning to work with the UNHCR, as to engage too emotionally with the refugee crisis could make vital work impossible to perform. However, Patricia provided some context, in part by playing Head Over Heels, (Emi Mahmoud’s poem about the conflict and refugee crisis in Darfur), and by describing the brave, inspirational people she had met in Syria and Kurdistan. An emotionless response was impossible: the talk resulted in the conference’s longest and loudest applause.

Indeed, I think the audience was so stunned by Patricia’s words that questions were hard to formulate. However, my colleague Somaya at least asked the $64,000 one: how can we help? I’d like to tie this question back to one that Patricia raised in her talk, namely (and I paraphrase here): how do you justify expenditure on tasks like preservation when doing so takes food from the mouths of refugees?

So, now I’m less stunned, here’s my take: feeding refugees solves a symptom of the problem. Telling their stories helps to solve the problem, by making us engage our emotions, and think about how our lives are related to theirs, and about how we behave impacts upon them. And how can we help? Sure, we can help Patricia with her data management and preservation problems. But how can we really contribute to a solution? How can we stop refugee crises occurring in the first place?

We have a responsibility to recognise the connections between our own behaviour and the circumstances refugees find themselves in, and it all comes down, of course, to resources, and the profligate waste of them in the developed world. Indeed, Angeline and Patricia’s talks illustrated the borderline absurdity of a bunch of (mostly) privileged ‘Westerners’ / ‘Northerners’ (take your pick) talking about the ‘preservation’ of anything, when we’re products of a society that’s based upon throwing everything away.

And computers / all things ‘digital’ are at the apogee of this profligacy: Natasa Milic-Frayling highlighted this when she (diplomatically) referred to the way in which the ‘innovators’ hold all the cards, currently, in the relationship with ‘content producers’, and can hence render the technologies upon which we depend obsolete across ever-shorter cycles. Though, after Patricia’s talk, I’m inclined to frame this more in terms of ‘capitalist industrialists generating unnecessary markets at the expense of consumers’; particularly given that, while we were listening to Patricia, the latest iPhone was being launched in the US.

Though, of course, it’s not really the ‘poor consumers’ who genuinely suffer due to planned obsolescence… That would be the people in Africa and the Middle East whose countries are war zones due to grabs for oil or droughts caused by global warming. As the world’s most advanced tech companies, Apple, Google, Facebook, Amazon, Microsoft et al are the biggest players in a society that – at best indirectly, at worst carelessly – causes the suffering of the people Patricia and Angeline are helping and providing justice for. And, as someone typing a blog post using a Macbook Pro that doesn’t even let me add a new battery – I’m clearly part of the problem, not the solution.

So – in answer to Somaya’s question: how can we help? Well, for a start, we can stop fetishising the iPhone and start bigging up Fairphone and Phonebloks. However, keeping the focus on Digital Preservation, we’ve got to be really careful that our efforts aren’t used to support an IT industry that’s currently profligate way beyond moral acceptability. So rather than assuming (as I did above) that all the ‘best-practice’ of digital preservation flows from the ‘developed’ (ahem) world to the ‘developing’, we ought to seek some lessons in how to preserve technology from those who have fewer opportunities to waste it.

Somaya’s already on the case with her upcoming panel at iPres on the 28th September: Then we ought to continue down the road of holding PASIG in Mexico City next year by holding one in Africa as soon as possible. As long as – when we’re there, we make sure we shut up and listen.

PASIG 2017 Twitter round-up

After many months of planning it feels quite strange to us that PASIG 2017 is over. Hosting the PASIG conference in Oxford has been a valuable experience for the DPOC fellows and a great chance for Bodleian Libraries’ staff to meet with and listen to presentations by digital preservation experts from around the world.

In the end 244 conference delegates made their way to Oxford and the Museum of Natural History. The delegates came from 130 different institutions and every continent of the world was represented (…well, apart from Antarctica).

What was especially exciting though were all the new faces. In fact 2/3 of the delegates this year had not been to a PASIG conference before! Is this perhaps a sign that interest in digital preservation is on the rise?

As always at PASIG, Twitter was ablaze with discussion in spite of an at times flaky Wifi connection. Over three days #PASIG17 was mentioned a whopping 5300 times on Twitter and had a “reach” of 1.7 million. Well done everyone on some stellar outreach! Most active Twittering came from the UK, USA and Austria.

Twitter activity by country using #PASIG17 (Talkwalker statistics)

Although it is hard to choose favourites among all the Tweets, a few of the DPOC project’s personal highlights included:

Cambridge Fellow Lee Pretlove lists “digital preservation skills” and why we cannot be an expert in all areas. Tweet by Julian M. Morley

Bodleian Fellow James makes some insightful observations about the incompatibility between tar pits and digital preservation.

Cambridge Fellow Somaya Langley presents in the last PASIG session on the topic of “The Future of Digital Preservation”.  

What were some of your favourite talks and Twitter conversations? What would you like to see more of at PASIG 2018? #futurePASIG