PASIG 2017 Twitter round-up

After many months of planning it feels quite strange to us that PASIG 2017 is over. Hosting the PASIG conference in Oxford has been a valuable experience for the DPOC fellows and a great chance for Bodleian Libraries’ staff to meet with and listen to presentations by digital preservation experts from around the world.

In the end 244 conference delegates made their way to Oxford and the Museum of Natural History. The delegates came from 130 different institutions and every continent of the world was represented (…well, apart from Antarctica).

What was especially exciting though were all the new faces. In fact 2/3 of the delegates this year had not been to a PASIG conference before! Is this perhaps a sign that interest in digital preservation is on the rise?

As always at PASIG, Twitter was ablaze with discussion in spite of an at times flaky Wifi connection. Over three days #PASIG17 was mentioned a whopping 5300 times on Twitter and had a “reach” of 1.7 million. Well done everyone on some stellar outreach! Most active Twittering came from the UK, USA and Austria.

Twitter activity by country using #PASIG17 (Talkwalker statistics)

Although it is hard to choose favourites among all the Tweets, a few of the DPOC project’s personal highlights included:

Cambridge Fellow Lee Pretlove lists “digital preservation skills” and why we cannot be an expert in all areas. Tweet by Julian M. Morley

Bodleian Fellow James makes some insightful observations about the incompatibility between tar pits and digital preservation.

Cambridge Fellow Somaya Langley presents in the last PASIG session on the topic of “The Future of Digital Preservation”.  

What were some of your favourite talks and Twitter conversations? What would you like to see more of at PASIG 2018? #futurePASIG

Visit to the Parliamentary Archives: Training and business cases

Edith Halvarsson, Policy and Planning Fellow at Bodleian Libraries, writes about the DPOC project’s recent visit to the Parliamentary Archives.

This week the DPOC fellows visited the Parliamentary Archives in London. Thank you very much to Catherine Hardman (Head of Preservation and Access), Chris Fryer (Digital Archivist) and Grace Bell (Digital Preservation Trainee) for having us. Shamefully I have to admit that we have been very slow to make this trip; Chris first invited us to visit all the way back in September last year! However, our tardiness to make our way to Westminster was in the end aptly timed with the completion of year one of the DPOC project and planning for year 2.

Like CUL and Bodleian Libraries, the Parliamentary Archives also first began their own Digital Preservation Project back in 2010. Their project has since transitioned into digital preservation in a more programmatic capacity as of 2015. As CUL and Bodleian Libraries will be beginning to draft business cases for moving from project to programme in year 2; meeting with Chris and Catherine was a good opportunity to talk about how you start making that tricky transition.

Of course, every institution has its own drivers and risks which influence business cases for digital preservation, but there are certain things which will sound familiar to a lot of organisations. For example, what Parliamentary Archives have found over the past seven years, is that advocacy for digital collections and training staff in digital preservation skills is an ongoing activity. Implementing solutions is one thing, whereas maintaining them is another. This, in addition to staff who have received digital preservation training eventually moving on to new institutions, means that you constantly need to stay on top of advocacy and training. Making “the business case” is therefore not a one-off task.

Another central challenge in terms of building business cases, is how you frame digital preservation as a service rather than as “an added burden”. The idea of “seamless preservation” with no human intervention is a very appealing one to already burdened staff, but in reality workflows need to be supervised and maintained. To sell digital preservation, that extra work must therefore be perceived as something which adds value to collection material and the organisation. It is clear that physical preservation adds value to collections, but the argument for digital preservation can be a harder sell.

Catherine had, however, some encouraging comments on how we can attempt to turn advice about digital preservation into something which is perceived as value adding.  Being involved with and talking to staff early on in the design of new project proposals – rather than as an extra add on after processes are already in place – is an example of this.

Image by James Mooney

All in all, it has been a valuable and encouraging visit to the Parliamentary Archives. The DPOC fellows look forward to keeping in touch – particularly to hear more about the great work Parliamentary Archive have been doing to provide digital preservation training to staff!

Email preservation: How hard can it be?

Policy and Planning Fellow Edith summarises some highlights from the Digital Preservation Coalition’s briefing day on email preservation. See the full schedule of speakers on DPC’s website.

Yesterday Sarah and I attended DPC’s briefing day on email preservation at the National Archives (UK) in Kew, London. We were keen to go and hear about latest findings from the Email Preservation Task Force as Sarah will be developing a course dedicated to email preservation for the DPOC teaching programme. An internal survey circulated to staff in Bodleian Libraries’ earlier this year showed a real appetite for learning about email preservation. It is an issue which evidently spans several areas of our organisation.

The subheading of the event “How hard can it be?” turned out to be very apt. Before even addressing preservation, we were asked to take a step back and ask ourselves:

Do I actually know what email is?”

As Kate Murray from the Library of Congress put it: “email is an object, several things and a verb”. In this sense email has much in common with the World Wide Web, as they are heavily linked and complex objects. Retention decisions must be made, not only about text content but also about email attachments and external web links. In addition, supporting features (such as instant messaging and calendars) are increasingly integrated into email services and potential candidates for capture.

Thinking about email “as a verb” also highlights that it is a cultural and social practice. Capturing relationships and structures of communication is an additional layer to preserve. Anecdotally, some participants on the Email Preservation day had found that data mining, including the ability to undertake analysis across email archives, is increasingly in demand from historians using big data research techniques.

Anthea Seles, National Archives (UK), talks about visualisation of email archives.

What are people doing?

So what are organisations currently doing to preserve email? A strength of the Email Preservation Taskforce’s new draft report is that it draws together samples of workflows currently in use by other organisations (primarily US based). Additional speakers from Preservica, National Archives and the British Library supplemented these with some local examples from the UK throughout the day.

The talks and report show that migration is by far the most common approach to email preservation in the institutions consulted. EML and Mbox are the most common formats migrated to. Each have different approaches to storing either single messages (EML) or aggregating messages in a single database file (Mbox). (However, beware that Mbox is a whole family of formats which have varying documentation levels!)

While some archives choose to ingest Mbox and EML files into their repositories without further processing, others choose to unpack content within these files. Unpacking content provides a mode of displaying emails, as well as the ability to normalise content within them.

The British Library for example have chosen to unpack email files using Aid4Mail, and are attempting to replicate the message hierarchy within a folder structure. Using Aid4Mail, they migrate text from email messages to PDF/A-2b which are displayed alongside folders containing any email attachments. PDF/A-2b can then be validated using vera/PDF or other tools. A CSV manifest is also generated and entered into relevant catalogues. Preservica’s out of the box workflow is very similar to the British Library’s, although they choose to migrate text content to HTML or UTF-8 encoded text files.

Another tantalising example (which I can imagine will gain more traction in the future) came from one institution who has used Emulation As A Service to provide access to one of its collections of email. By using an emulation approach it is able to provide access to content within the original operating environment used by the donor of the email archive. This has particular strength in that email attachments, such as images and word processing files, can be viewed on contemporary software (providing licenses can be acquired for the software itself).

Finally, a tool which was considered or already in use by many of the contributors is ePADD. ePADD is an open source tool developed by Stanford University Libraries. It provides functions for processing and appraisal of Mbox files, but also has many interesting features for exploring the social and cultural aspect of email. ePADD can mine emails for subjects such as places, events and people. En masse, these subjects provide researchers with a much richer insight into trends and topics within large email archives. (Tip: why not have a look at the ePADD discovery module to see it in practice?)

What do we still need to explore?

It is encouraging that people are already undertaking preservation of email and that there are workflows out there which other organisations can adopt. However, there are many questions and issues still to explore.

  1. Current processes cannot fully capture the interlinked nature of email archives. Questions were raised during the day about the potential of describing archives using linked open data in order to amalgamate separate collections. Email archives may be more valuable to historians as they acquire critical mass
  2. Other questions were raised around whether or not archives should also crawl web links within emails. Links to external content may be crucial for understanding the context of a message, but this becomes a very tricky issue if emails are accessioned years after creation. If webpages are crawled and associated with the email message years after it was sent, serious doubt is raised around the reliability of the email as a record
  3. The issue of web links also brings up the question of when email harvesting should occur. Would it be better if emails were continually harvested to the archive/record management system than waiting until a member of staff leave their position? The good news is that many email providers are increasingly documenting and providing APIs to their services, meaning that the ability to do so may become more feasible in the future
  4. As seen in many of the sample workflows from the Email Preservation Task Force report, email files are often migrated multiple times. Especially as ePADD works with Mbox, some organisations end up adding an additional migration step in order to use the tool before normalising to EML. There is currently very little available literature on the impact of migrations, and indeed multiple migrations, on the information content of emails.

What can you do now to help?    

So while there are some big technical and philosophical challenges, the good news is that there are things you can do to contribute right now. You can:

  • Become a “Friend of the Email Preservation Task Force” and help them review new reports and outputs
  • Contribute your organisation’s workflows to the Email Preservation Task Force report, so that they can be shared with the community
  • Run trial migrations between different email formats such as PST, Mbox and EML and blog about your finding
  • Support open source tools such as ePADD through either financial aid or (if you are technically savvy) your time. We rely heavily on these tools and need to work together to make them sustainable!

Overall the Email Preservation day was very inspiring and informative, and I cannot wait to hear more from the Email Preservation Task Force. Were you also at the event and have some other highlights to add? Please comment below!  

Policy ramblings

For the second stage of the DPOC project Oxford and Cambridge have started looking at policy and strategy development. As part of the DPOC deliverables, the Policy and Planning Fellows will be collaborating with colleagues to produce a digital preservation policy and strategy for their local institutions. Edith (Policy and Planning Fellow at Oxford) blogs about what DPOC has been up to so far.

Last Friday I met with Somaya (Policy and Planning Fellow) and Sarah (Training and Outreach Fellow) at the British Library in London. We spent the day discussing review work which DPOC has done of digital preservation policies so far. The meeting also gave us a chance to outline an action plan for consulting stakeholders at CUL and Bodleian Libraries on future digital preservation policy development.

Step 1: Policy review work
Much work has already gone into researching digital preservation policy development [see for example the SCAPE project and OSUL’s policy case study]. As considerable effort has been exerted in this area, we want to make sure we are not reinventing the wheel while developing our own digital preservation policies. We therefore started by reading as many digital preservation policies from other organisations as we could possibly get our hands on. (Once we ran out of policies in English, I started feeding promising looking documents into Google Translate with a mixed bag of results.) The policy review drew attention to aspects of policies which we felt were particular successful, and which could potentially be re-purposed for the local CUL and Bodleian Libraries contexts.

My colleague Sarah helped me with the initial policy review work. Between the two of us we read 48 policies dating from 2008-2017. However, determining which documents were actual policies was trickier than we had first anticipated. We found that documents named ‘strategy’ sometimes read as policy, and documents named policy sometimes read as more low level procedures. For this reason, we decided to add another 12 strategy documents to the review which had strong elements of policy in them. This brought us up to a round 60 documents in total.

So we began reading…. But we soon found that once you are on your 10th policy of the day, you start to get them muddled up. To better organise our review work, we decided to put them into a classification system developed by Kirsten Snawder (2011) and adapted by Madeline Sheldon (2013). Snawder and Sheldon identified nineteen common topics from digital preservation policies. The topics range from ‘access and use’ to ‘preservation planning’ [for the full list of topics, see Sheldon’s article on The Signal from 2013]. I was interested in seeing how many policies would make direct reference to the Open Archival Information System (OAIS) reference model, so I added this in as an additional topic to the original nineteen identified by Snawder and Sheldon.

Reviewing digital preservation policies written between 2008-2017

Step 2: Looking at findings
Interestingly, after we finished annotating the policy documents we did not find a correlation between covering all of Snawder and Sheldon’s nineteen topics and having what we perceived as an effective policy. Effective in this context was defined as the ability of the policy to clearly guide and inform preservation decisions within an organisation. In fact, the opposite was more common as we judged several policies which had good coverage of topics from the classification system to be too lengthy, unclear, and sometimes inaccessible due to heavy use of digital preservation terminology.

In terms of OAIS, another interesting finding was that 33 out of 60 policies made direct reference to the OAIS. In addition to these 33, several of the ones which did not make an overt reference to the model still used language and terminology derived from it.

So while we found that the taxonomy was not able to guide us on which policy topics were an absolute essential in all circumstances, using it was a good way of arranging and documenting our thoughts.

Step 3: Thinking about guiding principles for policy writing
What this foray into digital preservation policies has shown us is that there is no ‘one fits all’ approach or a magic formula of topics which makes a policy successful. What works in the context of one institution will not work in another. What ultimately makes a successful policy also comes down to communication of the policy and organisational uptake. However, there are number of high level principles which the three of us all felt strongly about and which we would like to guide future digital preservation policy development at our local institutions.

Principle 1: Policy should be accessible to a broad audience. Contrary to findings from the policy review, we believe that digital preservation specific language (including OAIS) should be avoided at policy level if possible. While reviewing policy statements we regularly asked ourselves:

“Would my mother understand this?”

If the answer is yes, the statement gets to stay. If it is no, maybe consider re-writing it. (Of course, this does not apply if your mother works in digital preservation.)

Principle 2: Policy also needs to be high-level enough that it does not require constant re-writing in order to make minor procedural changes. In general, including individuals’ names or prescribing specific file formats can make a policy go out of date quickly. It is easier to change these if they are included in lower level procedures and guidelines.

Principle 3: Digital preservation requires resources. Getting financial commitment to invest in staff at policy level is important. It takes time to build organisation expertise in digital preservation, but losing it can happen a lot quicker. Even if you choose to outsource several aspects of digital preservation, it is important that staff have skills which enables them to understand and critically assess the work of external digital preservation service providers.

What are your thoughts? Do you have other principles guiding digital preservation policy development in your organisations? Do you agree or disagree with our high-level principles?

Over 20 years of digitization at the Bodleian Libraries

Policy and Planning Fellow Edith writes an update on some of her findings from the DPOC project’s survey of digitized images at the Bodleian Libraries.

During August-December 2016 I have been collating information about Bodleian Libraries’ digitized collections. As an early adopter of digitization technology, Bodleian Libraries have made digital surrogates of its collections available online since the early 1990’s. A particular favourite of mine, and a landmark among the Bodleian Libraries’ early digital projects, is the Toyota Transport Digitization Project (1996). [Still up and running here]

At the time of the Toyota Project, digitization was still highly specialised and the Bodleian Libraries opted to outsource the digital part to Laser Bureau London. Laser Bureau ‘digitilised’ 35mm image negatives supplied by Bodleian Libraries’ imaging studio and sent the files over on a big bundle of CDs. 1244 images all in all – which was a massive achievement at the time. It is staggering to think that we could now produce the same many times over in just a day!

Since the Toyota projects completion twenty years ago, Bodleian Libraries have continued large scale digitization activities in-house via its commercial digitization studio, outsourced to third party suppliers, and in project partnerships. With generous funding from the Polonsky Foundation the Bodleian Libraries are now set to add over half a million image surrogates of Special Collection manuscripts to its image portal – Digital.Bodleian.

What happens to 20 years’ worth of digitized material? Since 1996 both Bodleian Libraries and digitization standards have changed massively. Early challenges around storage alone have meant that content inevitably has been squirreled away in odd locations and created to the varied standards of the time. Profiling our old digitized collections is the first step to figuring out how these can be brought into line with current practice and be made more visible to library users.

“So what is the extent of your content?”, librarians from other organisations have asked me several times over the past few months. In the hope that it will be useful for other organisations trying to profile their legacy digitized collections, I thought I would present some figures here on the DPOC blog.

When tallying up our survey data, I came to a total of approximately 134 million master images in primarily TIFF and JP2 format. From very early digitization projects however, the idea of ‘master files’ was not yet developed and master and access files will, in these cases, often be one and the same.

The largest proportion of content, some 127,000,000 compressed JP2s, were created as part of the Google Books project up to 2009 and are available via Search Oxford Libraries Online. These add up to 45 TB of data. The library further holds three archives of 5.8million/99.4TB digitized image content primarily created by the Bodleian Libraries’ in-house digitization studio in TIFF. These figures does not include back-ups – with which we start getting in to quite big numbers.

Of the remaining 7 million digitized images which are not from the Google Books project, 2,395,000 are currently made available on a Bodleian Libraries website. In total the survey examined content from 40 website applications and 24 exhibition pages. 44% of the images which are made available online were, at the time of the survey, hosted on Digital.Bodleian, 4% on ODL Greenstone and 1% on Luna.The latter two are currently in the processes of being moved onto Digital.Bodleian. At least 6% of  content from the sample was duplicated across multiple website applications and are candidates for deduplication. Another interesting fact from the survey is that JPEG, JP2 (transformed to JPEG on delivery) and GIF are by far the most common access/derivative formats on Bodleian Libraries’ website applications.

The final digitized image survey report has now been reviewed by the Digital Preservation Coalition and is being looked at internally. Stay tuned to hear more in future blog posts!

DPOC visits the Wellcome Library in London

A brief summary by Edith Halvarsson, Policy and Planning Fellow at the Bodleian Libraries, of the DPOC project’s recent visit to the Wellcome Library.

Last Friday the Polonsky Fellows had the pleasure of spending a day with Rioghnach Ahern and David Thompson at the Wellcome Library. With a collection of over 28.6 million digitized images, the Wellcome is a great source of knowledge and experience in working with digitisation at a large scale. Themes of the day centred around pragmatic choices, achieving consistency across time and scale, and horizon scanning for emerging trends.

The morning started with an induction from Christy Henshaw, the Wellcome’s Digital Production Manager. We discussed digitisation collection development and Jpeg2000 profiles, but also future directions for the library’s digitised collection. One point which particularly stood out to me, was changes in user requirements around delivery of digitised collections. The Wellcome has found that researchers are increasingly requesting delivery of material for “use as data”. (As a side note: this is something which the Bodleian Libraries have previously explored in their Blockbooks project, which used facial recognition algorithms traditionally associated with security systems, to trace provenance of dispersed manuscripts). As the possibilities for large scale analysis using these types of algorithms multiply, the Wellcome is considering how delivery will need to change to accommodate new scholarly research methods.


Brain teaser: Spot the Odd One Out (or is it a trick question?). Image credit: Somaya Langley

Following Christy’s talk we were given a tour of the digitization studios by Laurie Auchterlonie. Laurie was in the process of digitising recipe books for the Wellcome Library’s Recipe Book Project. He told us about some less appetising recipes from the collection (such as three-headed pig soup, and puppy dishes) and about the practical issues of photographing content in a studio located on top of one of the busiest underground lines in London!

After lunch with David and Rioghnach at the staff café, we spent the rest of the afternoon looking at Goobi plug-ins, Preservica and the Wellcome’s hybrid-cloud storage model. Despite talking digitisation – metadata was a reoccurring topic in several of the presentations. Descriptive metadata is particularly challenging to manage as it tends to be a work in progress – always possible to improve and correct. This creates a tension between curators and cataloguers doing their work, and the inclination to store metadata together with digital objects in preservation systems to avoid orphaning files. Wellcome’s solution has been to articulate their three core cataloguing systems as the canonical bibliographic source, while allowing potentially out of data metadata to travel with objects in both Goobi and Preservica for in-house use only. As long as there is clarity around which is the canonical metadata record, these inconsistencies are not problematic to the library. ‘You would be surprised how many institutions have not made a decision around which their definitive bibliographic records is’, says David.


Presentation on the Wellcome Library’s digitisation infrastructure. Image credit: Somaya Langley

The last hour was spent pondering the future of digital preservation and I found the conversations very inspiring and uplifting. As we work with the long-term in mind, it is invaluable to have these chances to get out of our local context and discuss wider trends with other professionals. Themes included: digital preservation as part of archival masters courses, cloud storage and virtualisation, and the move from repository software to dispersed micro-services.

The fellow’s field trip to the Wellcome is one of a number of visits that DPOC will make during 2017 talk to institutions around the UK about their work around digital preservation. Watch for more updates.