Using ePADD with Josh Schneider

Edith, Policy and Planning Fellow at Bodleian Libraries, writes about her favourite features in ePADD (an open source software for email archives) and about how the tool aligns with digital preservation workflows.

At iPres a few weeks ago I had the pleasure of attending an ePadd workshop ran by Josh Schneider from Stanford University Libraries. The workshop was for me one of the major highlights of the conference, as I have been keen to try out ePADD since first hearing about it at DPC’s Email Preservation Day. I wrote a blog about the event back in July, and have now finally taken the time to review ePADD using my own email archive.

ePADD is primarily for appraisal and delivery, rather than a digital preservation tool. However, as a potential component in ingest workflows to an institutional repository, ensuring that email content retains integrity during processing in ePADD is paramount. The creators behind ePADD are therefore thinking about how to enhance current features to make the tool fit better into digital preservation workflows. I will discuss these features later in the blog, but first I wanted to show some of the capabilities of ePADD. I can definitely recommend having a play with this tool yourself as it is very addictive!

ePADD: Appraisal module dashboard

Josh, our lovely workshop leader, recommends that new ePADD users go home and try it on their own email collections. As you know your own material fairly well it is a good way of learning about both what ePADD does well and its limits. So I decided to feed in my work emails from the past year into ePADD – and found some interesting trends about my own working patterns.

ePADD consists of four modules, although I will only be showing features from the first two in this blog:

Module 1: Appraisal (Module used by donors for annotation and sensitivity review of emails before delivering them to the archive)

Module 2: Processing (A module with some enhanced appraisal features used by archivist to find additional sensitive information which may have been missed in the first round of appraisal)

Module 3: Discovery (A module which provides users with limited key word searching for entities in the email archive)

Module 4: Delivery (This module provides more enhanced viewing of the content of the email archive – including a gallery for viewing images and other document attachments)

Note that ePADD only support MBOX files, so if you are an Outlook user like myself you will need to first convert from PST to MBOX. After you have created an MBOX file, setting up ePADD is fairly simple and quick. Once the first ePADD module (“Appraisal”) was up and running, processing my 1,500 emails and 450 attachments took around four minutes. This time includes time for natural language processing. ePADD recognises and indexes various “entities” – including persons, places and events – and presents these in a digestible way.

ePADD: Appraisal module processing MBOX file

Looking at the entities recognised by ePADD, I was able to see who I have been speaking with/about during the past year. There were some not so surprising figures that popped up (such as my DPOC colleagues James Mooney and Dave Gerrard). However, curiously I seem to also have received a lot of messages about the “black spider” this year (turns out they were emails from the Libraries’ Dungeons and Dragons group).

ePADD entity type: Person (some details removed)

An example of why you need to look deeper at the results of natural language processing was evident when I looked under the “place entities” list in ePADD:

ePADD entity type: Place

San Francisco comes highest up on the list of mentioned places in my inbox. I was initially quite surprised by this result. Looking a bit closer, all 126 emails containing a mention of San Francisco turned out to be from “Slack”.  Slack is an instant messaging service used by the DPOC team, which has its headquarters in San Francisco. All email digests from Slack contains the head office address!

Another one of my favourite things about ePADD is its ability to track frequency of messages between email accounts. Below is a graph showing correspondence between myself and Sarah Mason (outreach and training fellow on the DPOC project). The graph shows that our peak period of emailing each other was during the PASIG conference, which DPOC hosted in Oxford at the start of September this year. It is easy to imagine how this feature could be useful to academics using email archives to research correspondence between particular individuals.

ePADD displaying correspondence frequency over time between two users

The last feature I wanted to talk about is “sensitivity review” in ePADD. Although I annotate personal data I receive, I thought that the one year mark of the DPOC project would also be a good time to run a second sensitivity review of my own email archive. Using ePADD’s “lexicon hits search” I was able to sift through a number of potentially sensitive emails. See image below for categories identified which cover everything from employment to health. These were all false positives in the end, but it is a feature I believe I will make use of again.

ePADD processing module: Lexicon hits for sensitive data

So now on to the Digital Preservation bit. There are currently three risks of using ePADD in terms of preservation which stands out to me.

1) For practical reasons, MBOX is currently the only email format option supported by ePADD. If MBOX is not the preferred preservation format of an archive it may end up running multiple migrations between email formats resulting in progressive loss of data

2) There are no checksums being generated by ePADD when you downloading content from an ePADD module in order to copy it onto the next one. This could be an integrity and authenticity issue as emails are copied multiple times

3) There is currently limited support for assigning multiple identifiers to archives in ePADD. This could potentially become an issue when trying to aggregate email archives from different intuitions. Local identifiers could in this scenario clash and other additional unique identifiers would then also be required

Note however that these concerns are already on the ePADD roadmap, so they are likely to improve or even be solved within the next year.

To watch out for ePADD updates, or just have a play with your own email archive (it is loads of fun!), check out their:

Guest post: The 6-million-image gap

Bodleian Digital Library Systems and Services’ Digital Curator, Emma Stanford, guest blogs for the DPOC project this week. Emma writes about what she is doing to close some of the 6-million-image gap between what’s in our tape archive and what’s available online at Digital.Bodleian. It’s no small task, but sometimes Emma finds some real gems just waiting to be made available to researchers. She also raises some good questions about what metadata we should make available to researchers to interpret our digitized image. Read more from Emma below.

Thanks to Edith’s hard work, we now know that the Bodleian Imaging Services image archive contains about 5.8 million unique images. This is in addition to various images held on hard drives and other locations around the Bodleian, which bring the total up to almost 7 million. Digital.Bodleian, however, our flagship digital image platform, contains only about 710,000 unique images–a mere tenth of our total image archive. What gives?

That 6-million-image gap consists of two main categories:

Images that are online elsewhere (aka the migration backlog). In the decades before Digital.Bodleian, we tried a number of other image delivery platforms that remain with us today: Early Manuscripts at Oxford University, the Toyota City Imaging Project, the Oxford Digital Library, Luna, etc., etc. Edith has estimated that the non-Digital.Bodleian content comprises about 1.4 million images. Some of these images don’t belong in Digital.Bodleian, either because we don’t have rights to the images (for example, Queen Victoria’s Journals) or because they are incomplete selections rather than full image sets (for example, the images in the Bodleian Treasures exhibition). Our goal is to migrate all the content we can to Digital.Bodleian and eventually shut down most of the old sites. We’ve been chipping away at this task very slowly, but there is a lot left to do.

Images that have never been online. Much of Imaging Services’ work is commercial orders: shooting images for researchers, publishers, journalists, etc. We currently store all these images on tape, and we have a database that records the shelfmark, number of images, and list of captured pages, along with information about when and how the images were captured. Searching through this archive for Digital.Bodleian-appropriate images is a difficult task, though. Shelfmark notation isn’t standardized at all, so there are lots of duplicate records. Also, in many cases, just a few pages from a book or manuscript were captured, or the images were captured in black-and-white or grayscale; either way, not suitable for Digital.Bodleian, where we aim to publish fully-digitized works in full colour.

I’m working on extracting a list of complete, full-colour image sets from this database. In the meantime, we’ve started approaching the problem from the other direction: creating a list of items that we’d like to have on Digital.Bodleian, and then searching the archive for images of them. To do this, we asked the Bodleian’s manuscript and rare book curators to share with us their lists of “greatest hits”: the Bodleian’s most valuable, interesting, and/or fragile holdings, which would benefit most from online surrogates. We then began going through this list searching for the shelfmarks in the image archive. Mostly, we’ve found only a few images for each shelfmark, but occasionally we hit the jackpot: a complete, full-colour image set of a 13th-century bestiary or a first edition of a Shakespeare play.

Going through the archives in this way has underlined for me just how much the Bodleian’s imaging standards have changed in the last two decades. File size has increased, of course, as higher-resolution digital scanning backs have become available; but changes in lighting equipment, book cradles, processing software, rulers and colour charts have all made their mark on our images too. For me, this has raised the question of whether the technical metadata we’re preserving in our archives, about when and how the images were captured, should also be made available to researchers in some way, so that they can make an informed choice about how to interpret the images they encounter on sites like Digital.Bodleian.

In the meantime, here are some of the image sets we’ve pulled out of the archive and digitized so far:

Jane Austen’s juvenilia
a 13th-century bestiary
the Oxford Catullus

MS. Bodl. 764, fol. 2r (detail)

MS. Bodl. 764, fol. 2r (detail)

Putting ‘stuff’ in ‘context’: deep thoughts triggered by PASIG 2017

Cambridge Technical Fellow, Dave, delves a bit deeper into what PASIG 2017 talks really got him thinking further about digital preservation and the complexity of it.

After a year of studying digital preservation, my thoughts are starting to coalesce, and the presentations at PASIG 2017 certainly helped that. (I’ve already discussed what I thought were the most important talks, so the ones below some that stimulated me about preservation in particular)…

The one that matched my current thoughts on digital preservation generally was John Sheridan’s Creating and sustaining a disruptive digital archive. It was similar to another previous blog post, and to chats with fellow Fellow Lee too (some of which he’s captured in a blog post for the Digital Preservation Coalition)… I.e.: computing’s ‘paper paradigm’ makes little sense in relation to preservation, hierarchical / neat information structures don’t hold together as well digitally, we’re going to need to compute across the whole archive, and, well, ‘digital objects’ just aren’t really material ‘objects’, are they?

An issue with thinking about digital ‘stuff’ too much in terms of tangible objects is that opportunities related to the fact the ‘stuff’ is digital can be missed. Matt Zumwalt highlighted one such opportunity in Data together: Communities & institutions using decentralized technologies to make a better web when he introduced ‘content addressing’: using cryptographic hashing and Directed Acyclic Graphs (in this case, information networks that record content changing as time progresses) to manage many copies of ‘stuff’ robustly.

This addresses some of the complexities of preserving digital ‘stuff’, but perhaps thinking in terms of ‘copies’, and not ‘branches’ or ‘forks’ is an over simplification? Precisely because digital ‘stuff’ is rarely static, all ‘copies’ have the potential to deviate from the ‘parent’ or ‘master’ copy. What’s the ‘version of true record’ in all this? Perhaps there isn’t one? Matt referred to ‘immutable data structures’, but the concept of ‘immutability’ only really holds if we think it’s possible for data to ever be completely separated from its informational context, because the information does change, constantly. (Hold that thought).

Switching topics, fellow Polonsky Somaya often tries to warn me just how complicated working with technical metadata can get. Well, the pennies dropped further during Managing digital preservation metadata at Sound and Vision: A case on matching OAIS and PREMIS with the DPX file format from Annemieke De Jong and Josefien Schuurman. Space precludes going into the same level of detail they did regarding building a Preservation Metadata Dictionary (PMD) about just one, ‘relatively’ simple file format – but let’s say, well, it’s really complicated. (They’ve blogged about it and the whole PMD is online too). The conclusion: preserving files properly means drilling down deep into their formats, but it also got me thinking – shouldn’t the essence of a ‘preservation file format’ be its simplicity?

The need for greater simplicity in preservation was further emphasised by Mathieu Giannecchini’s The Eclair Archive cinema heritage use case: Rising to the challenges of complex formats at large scale. Again – space precludes me from getting into detail, but the key takeaway was that Mathieu has 2 million reels of film to preserve using the Digital Cinema Distribution Master (DCDM) format, and after lots of good work, he’s optimised the process to preserve 8tb a day, (with a target of 15tb). Now, we don’t know how much film is on each reel, but assuming a (likely over-) estimate of 10 minutes per reel, that’s roughly 180,000 films of 1 hour 50 mins in length. Based on Mathieu’s own figures, it’s going to take many decades, perhaps even a few hundred years, to get through all 2 million reels… So further, major optimisations are required, and I suspect DCDM (a format with a 155-page spec, which relies on TIFF, a format with a 122-page spec) might be one of the bottlenecks.

Of course, the trade-off with simplifying formats is that data will likely be ‘decontextualised’, so there must be a robust method for linking data back to context… Thoughts on this were triggered by Developing and applying principles for discovery and access for the UK Data Service by Katherine McNeill from the UK Data Archive, as Katherine discussed production of a next-generation access system based on a linked-data model with which, theoretically, single cells’ worth of data could be retrieved from research datasets.

Again – space precludes entering into the whole debate around the process of re-using data stripped of original context… Mauthner and Parry illustrate the two contrary sides well, and furthermore argue that merely entertaining the possibility of decontextualising data indicates a certain ‘foundational’ way of thinking that might be invalid from the start? This is where I link to William Kilbride’s excellent DPC blog post from a few months ago

William’s PASIG talk Sustainable digital futures was also one of two that got closer to what we know are the root of the preservation problem; economics. The other was Aging of digital: Managed services for digital continuity by Natasa Milic-Frayling, which flagged-up the current “imbalance in control and empowerment” between tech providers and content producers / owners / curators, an imbalance that means tech firms can effectively doom our digital ‘stuff’ into obsolescence, and we have to suck it up.

I think this imbalance in part exists because there’s too much technical context related to data, because it’s generally in the tech providers’ interests to bloat data formats to match the USPs of their software. So, is a pure ‘preservation format’ one in which the technical context of the data is generalised to the point where all that’s left is commonly-understood mathematics? Is that even possible? Do we really need 122-page specs to explain how raster image data is stored? (It’s just an N-dimensional array of pixel values…, isn’t it…?) I think perhaps we don’t need all the complexity – at the data storage level at least. Though I’m only guessing at this stage: much more research required.

PASIG 2017: honest reflections from a trainee digital archivist

A guest blog post by Kelly, one of the Bodleian Libraries’ graduate digital archivist trainees, on what she learned as a volunteer and attendee of PASIG 2017 Oxford.

Amongst the digital preservation professionals from almost every continent and 130 institutions, myself and my 5 traineeship colleagues were amongst the lecture theatre seats, annexe demos and the awesome artefacts at the Museum of Natural History for PASIG 2017, Oxford. It was a brilliant opportunity at just 6 months into our traineeship to not only apply some of our new knowledge to work at Special Collections, Bodleian Libraries, but we were also able to gain a really current and relevant insight to theories we have been studying as part of our long distance MSc in Digital Curation at Aberystwyth University. The first ‘Bootcamp’ day was exactly what I needed to throw myself in, and it really consolidated my confidence in my understanding of some aspects of the shared language that is used amongst the profession (fixity checks, maturity models…as well as getting to grips with submission information packages, dissemination information packages and everything that occurs in between!).

My pen didn’t stop scribbling all three days, except maybe for tea breaks. Saying that, the demo presentations were also a great time for myself and other trainees to ask questions specifically about workflows and benefits of certain software such as LibNova, Preservica and ResourceSpace.

For want of a better word (and because it really is the truth) PASIG 2017 was genuinely inspiring and there were messages delivered so powerfully I hope that I stay grounded in these for my entire career. Here is what I was taught:

The Community is invaluable. Many of the speakers were quick to assert that sharing practice amongst the digital preservation community is key. This is a value I was familiar with, yet witnessing it happening throughout the conference in such a sincere manner. I can assure you the gratitude and affirmation that followed Eduardo del Valle, University of the Balearic Islands and his presentation: “Sharing my loss to protect your data: A story of unexpected data loss and how to do real preservation” was as encouraging to witness as someone new to the profession as it was to all of the other experienced delegates present. As well as sharing practice, it was clear that the community need to be advocating on behalf of each other. It is time and resource consuming but oh-so important.

Digital archives are preserving historical truths. Yes, the majority of the workflow is technological but the objectives and functions are so much more than technology; to just reduce digital preservation down to this is an oversimplification. It was so clear that the range of use cases presented at PASIG were all driven towards documenting social, political, historical information (and preserving that documentation) that will be of absolute necessity for society and infrastructure in future. Right now, for example, Angeline Takewara and her colleagues at UN MICT are working on a digital preservation programme to ensure absolute accountability and usability of the records of the International Criminal Tribunals of both Rwanda and Yugoslavia. I have written a more specific post on Angeline’s presentation here.

Due to the nature of technology and the digital world, the goalposts will always be moving. For example, Somaya Langley’s talk on the future of digital preservation and the mysteries of extracting data from smart devices will soon become (and maybe already is) a reality for those working with accessions of archives or information management. We should, then, embrace change and embrace the unsure and ultimately ‘get over the need for tidiness’ as pointed out by John Sheridan from The National Archives during his presentation “Creating and sustaining a disruptive digital archive” . This is usually counter-intuitive, but as the saying goes, one of the most dangerous phrases to use is ‘we’ve always done it that way’.

The value of digital material outlives the software, so the enabling of prolonged use of software is a real and current issue. Admittedly, this was a factor I had genuinely not even considered before. In my brain I linked obsolescence with hardware and hardware only. Therefore,  Dr. Natasa Milic-Frayling’s presentation on “Aging of Digital: Managed Services for digital continuity” shed much light on the changing computing ecosystem and the gradual aging of software. What I found especially interesting about the proposed software-continuity plan was the transparency of it; the fact that the client can ask to see the software at any time whilst it is being stabilised and maintained.

Thank you so much PASIG 2017 and everybody involved!

One last thing…in closing, Cliff Lynch, CNI, bought up that there was comparably less Web Archiving content this year. If anybody fancies taking a trainee to Mexico next year to do a (lightning) talk on Bodleian Libraries’ Web Archive I am keen…



Computers are the apogee of profligacy: a response to THE most important PASIG 2017 presentations

Following the PASIG conference, Cambridge Technical Fellow Dave Gerrard couldn’t simply wait to fire off his thoughts on the global context of digital preservation and how we need to better consider the world around us to work on a global solution and not just one that suits capitalist agenda. We usually preface these blogs with “enjoy” but in this instance, please, find a quiet moment, make yourself comfortable, read on and contemplate the global issues presented passionately presented here.

I’m going to work on a more technical blog about PASIG later, but first I want to get this one off my chest. It’s about the two most important presentations: Angeline Takawira’s Digital preservation at the United Nations Mechanism for International Criminal Tribunals and Keep your eyes on the information, Patricia Sleeman’s discussion of preservation work at the UN Refugee Agency (UNHCR).

Angeline Takawira described, in a very precise and formal manner, how the current best practice in Digital Preservation is being meticulously applied to preserving information from UN war crimes tribunals in The Hague (covering the Balkan conflict) and Arusha, Tanzania (covering the Rwandan genocide). As befitted her work, it was striking how calm Angeline was; how well the facts were stuck to, despite the emotive context. Of course, this has to be the case for work underpinning legal processes: intrusion of emotion into the capture of facts could let those trying to avoid justice escape it.

And the importance of maintaining a dispassionate outlook was echoed in the title of the other talk. “Keep your eyes on the information” was what Patricia Sleeman was told when learning to work with the UNHCR, as to engage too emotionally with the refugee crisis could make vital work impossible to perform. However, Patricia provided some context, in part by playing Head Over Heels, (Emi Mahmoud’s poem about the conflict and refugee crisis in Darfur), and by describing the brave, inspirational people she had met in Syria and Kurdistan. An emotionless response was impossible: the talk resulted in the conference’s longest and loudest applause.

Indeed, I think the audience was so stunned by Patricia’s words that questions were hard to formulate. However, my colleague Somaya at least asked the $64,000 one: how can we help? I’d like to tie this question back to one that Patricia raised in her talk, namely (and I paraphrase here): how do you justify expenditure on tasks like preservation when doing so takes food from the mouths of refugees?

So, now I’m less stunned, here’s my take: feeding refugees solves a symptom of the problem. Telling their stories helps to solve the problem, by making us engage our emotions, and think about how our lives are related to theirs, and about how we behave impacts upon them. And how can we help? Sure, we can help Patricia with her data management and preservation problems. But how can we really contribute to a solution? How can we stop refugee crises occurring in the first place?

We have a responsibility to recognise the connections between our own behaviour and the circumstances refugees find themselves in, and it all comes down, of course, to resources, and the profligate waste of them in the developed world. Indeed, Angeline and Patricia’s talks illustrated the borderline absurdity of a bunch of (mostly) privileged ‘Westerners’ / ‘Northerners’ (take your pick) talking about the ‘preservation’ of anything, when we’re products of a society that’s based upon throwing everything away.

And computers / all things ‘digital’ are at the apogee of this profligacy: Natasa Milic-Frayling highlighted this when she (diplomatically) referred to the way in which the ‘innovators’ hold all the cards, currently, in the relationship with ‘content producers’, and can hence render the technologies upon which we depend obsolete across ever-shorter cycles. Though, after Patricia’s talk, I’m inclined to frame this more in terms of ‘capitalist industrialists generating unnecessary markets at the expense of consumers’; particularly given that, while we were listening to Patricia, the latest iPhone was being launched in the US.

Though, of course, it’s not really the ‘poor consumers’ who genuinely suffer due to planned obsolescence… That would be the people in Africa and the Middle East whose countries are war zones due to grabs for oil or droughts caused by global warming. As the world’s most advanced tech companies, Apple, Google, Facebook, Amazon, Microsoft et al are the biggest players in a society that – at best indirectly, at worst carelessly – causes the suffering of the people Patricia and Angeline are helping and providing justice for. And, as someone typing a blog post using a Macbook Pro that doesn’t even let me add a new battery – I’m clearly part of the problem, not the solution.

So – in answer to Somaya’s question: how can we help? Well, for a start, we can stop fetishising the iPhone and start bigging up Fairphone and Phonebloks. However, keeping the focus on Digital Preservation, we’ve got to be really careful that our efforts aren’t used to support an IT industry that’s currently profligate way beyond moral acceptability. So rather than assuming (as I did above) that all the ‘best-practice’ of digital preservation flows from the ‘developed’ (ahem) world to the ‘developing’, we ought to seek some lessons in how to preserve technology from those who have fewer opportunities to waste it.

Somaya’s already on the case with her upcoming panel at iPres on the 28th September: Then we ought to continue down the road of holding PASIG in Mexico City next year by holding one in Africa as soon as possible. As long as – when we’re there, we make sure we shut up and listen.

PASIG 2017 Twitter round-up

After many months of planning it feels quite strange to us that PASIG 2017 is over. Hosting the PASIG conference in Oxford has been a valuable experience for the DPOC fellows and a great chance for Bodleian Libraries’ staff to meet with and listen to presentations by digital preservation experts from around the world.

In the end 244 conference delegates made their way to Oxford and the Museum of Natural History. The delegates came from 130 different institutions and every continent of the world was represented (…well, apart from Antarctica).

What was especially exciting though were all the new faces. In fact 2/3 of the delegates this year had not been to a PASIG conference before! Is this perhaps a sign that interest in digital preservation is on the rise?

As always at PASIG, Twitter was ablaze with discussion in spite of an at times flaky Wifi connection. Over three days #PASIG17 was mentioned a whopping 5300 times on Twitter and had a “reach” of 1.7 million. Well done everyone on some stellar outreach! Most active Twittering came from the UK, USA and Austria.

Twitter activity by country using #PASIG17 (Talkwalker statistics)

Although it is hard to choose favourites among all the Tweets, a few of the DPOC project’s personal highlights included:

Cambridge Fellow Lee Pretlove lists “digital preservation skills” and why we cannot be an expert in all areas. Tweet by Julian M. Morley

Bodleian Fellow James makes some insightful observations about the incompatibility between tar pits and digital preservation.

Cambridge Fellow Somaya Langley presents in the last PASIG session on the topic of “The Future of Digital Preservation”.  

What were some of your favourite talks and Twitter conversations? What would you like to see more of at PASIG 2018? #futurePASIG

Digital Preservation futurology

I fancy attempting futurology, so here’s a list of things I believe could happen to ‘digital preservation systems’ over the next decade. I’ve mostly pinched these ideas from folks like Dave Thompson, Neil Jefferies, and my fellow Fellows. But if you see one of your ideas, please claim it using the handy commenting mechanism. And because it’s futurology, it doesn’t have to be accurate, so kindly contradict me!

Ingest becomes a relationship, not a one-off event

Many of the core concepts underpinning how computers are perceived to work are crude, paper-based metaphors – e.g. ‘files’, ‘folders’, ‘desktops’, ‘wastebaskets’ etc – that don’t relate to what your computer’s actually doing. (The early players in office computing were typewriter and photocopier manufacturers, after all…) These metaphors have succeeded at getting everyone to use computers, but they’ve also suppressed various opportunities to work smarter, too.

The concept of ingesting (oxymoronic) ‘digital papers’ is obviously heavily influenced by this paper paradigm.  Maybe the ‘paper paradigm’ has misled the archival community about computers a bit, too, given that they were experts at handling ‘papers’ before computers arrived?

As an example of what I mean: in the olden days (25 whole years ago!), Professor Plum would amass piles of important papers until the day he retired / died, and then, and only then, could these personal papers be donated and archived. Computers, of course, make it possible for the Prof both to keep his ‘papers’ where he needs them, and donate them at the same time, but the ‘ingest event’ at the centre of current digital preservation systems still seems to be underpinned by a core concept of ‘piles of stuff needing to be dealt with as a one-off task’. In future, the ‘ingest’ of a ‘donation’ will actually become a regular, repeated set of occurrences based upon ongoing relationships between donors and collectors, and forged initially when Profs are but lowly postgrads. Personal Digital Archiving and Research Data Management will become key; and ripping digital ephemera from dying hard disks will become less necessary as they become so.

The above depends heavily upon…

Object versioning / dependency management

Of course, if Dr. Damson regularly donates materials from her postgrad days onwards, some of these may be updates to things donated previously. Some of them might have mutated so much since the original donation that they can be considered ‘child’ objects, which may have ‘siblings’ with ‘common ancestors’ already extant in the archive. Hence preservation systems need to manage multiple versions of ‘digital objects’, and the relationships between them.

Some of the preservation systems we’ve looked at claim to ‘do versioning’ but it’s a bit clunky – just side-by-side copies of immutable ‘digital objects’, not records of the changes from one version to the next, and with no concept of branching siblings from a common parent. Complex structures of interdependent objects are generally problematic for current systems. The wider computing world has been pushing at the limits of the ‘paper-paradigm’ immutable object for a while now (think Git, Blockchain, various version control and dependency management platforms, etc). Digital preservation systems will soon catch up.

Further blurring of the object / metadata boundary

What’s more important, the object or the metadata? The ‘paper-paradigm’ has skewed thinking towards the former (the sacrosanct ‘digital object’, comparable to the ‘original bit of paper’), but after you’ve digitised your rare book collection, what are Humanities scholars going to text-mine? It won’t be images of pages – it’ll be the transcripts of those (i.e. the ‘descriptive metadata’)*. Also, when seminal papers about these text mining efforts are published, how is this history of the engagement with your collection going to be recorded? Using a series of PREMIS Events (that future scholars can mine in turn), perhaps?

The above talk of text mining and contextual linking of secondary resources raises two more points…

* While I’m here, can I take issue with the term ‘descriptive metadata’? All metadata is descriptive. It’s tautological; like saying ‘uptight Englishman’. Can we think of a better name?

Ability to analyse metadata at scale

‘Delivery’ no longer just means ‘giving users a viewer to look at things one-by-one with’ – it now also means ‘letting people push their Natural Language or image processing algorithms to where the data sits, and then coping with vast streams of output data’.

Storage / retention informed by well-understood usage patterns

The fact that everything’s digital, and hence easier to disseminate and link together than physical objects, also means better understanding how people use our material. This doesn’t just mean ‘wiring things up to Google Analytics’ – advances in bibliometrics that add social / mainstream media analysis, and so forth, to everyday citation counts present opportunities to judge the impact of our ‘stuff’ on the world like never before. Smart digital archives will inform their storage management and retention decisions with this sort of usage information, potentially in fully or semi-automated ways.

Ability to get data out, cleanly – all systems are only ever temporary!

Finally – it’s clear that there are no ‘long-term’ preservation system options. The system you procure today will merely be ‘custodian’ of your materials for the next ten or twenty years (if you’re lucky). This may mean moving heaps of content around in future, but perhaps it’s more pragmatic to think of future preservation systems as more like ‘lenses’ that are laid on top of more stable data stores to enable as-yet-undreamt-of functionality for future audiences?

(OK – that’s enough for now…)

DPOC: 1 year on

Oxford’s Outreach & Training Fellow, Sarah, reflects on how the first year of the DPOC project has gone and looks forward to the big year ahead.

A lot can happen in a year.

A project can finally get a name, a website can launch and a year of auditing can finally reach completion. It has been a long year of lessons and finding things for the Oxford DPOC team.

While project DR@CO and PADLOC never got off the ground, we got the DPOC Project. And with it has come a better understanding of our digital preservation practices at Bodleian Libraries. We’re starting year two with plenty of informed ideas that will lead to roadmaps for implementation and a business case to help continue to move Oxford forward with a digital preservation programme.

Auditing our collections

For the past year, Fellows have been auditing the many collections. The Policy and Planning Fellow spent nearly 6 months tracking down the digitized content of Bodleian Libraries across tape storage and many legacy websites. There was more to be found on hard drives under desks, on network drives and CDs. What Edith found was 20 years of digitized images at Bodleian Libraries. From that came a roadmap and recommendations to improve storage, access and workflows. Changes have already been made to the digitization workflow (we use jpylyzer now instead of jhove) and more changes are in progress.

James, the Technical Fellow at Oxford, has been looking at validating and characterising the TIFFs we have stored on tape, especially the half a million TIFFs from the Polonsky Foundation Digitization Project. There were not only some challenges to recovering the files from tape to disk for the characterisation and validating process, but there was issue with customising the output from JHOVE in XML. James did find a workaround to getting the outputs into a reporting tool for assessment in the end, but not without plenty of trial and error. However, we’re learning more about our digitized collections (and the preservation challenges facing them) and during year 2 we’ll be writing more about that as we continue to roadmap our future digital preservation work.

Auditing our skills

I spoke to a lot of staff and ran an online survey to understand the training needs of Bodleian Libraries. It is clear that we need to develop a strong awareness about digital preservation and its fundamental importance to the long-term accessibility of our digital collections. We also need to create a strong shared language in order to have these important discussions; this is important when we are coming together from several different disciplines, each with a different language. As a result, some training has begun in order to get staff thinking about the risks surrounding the digital content we use every day, in order to later translate it into our collections. The training and skills gaps identified from the surveys done in year 1 will continue to inform the training work coming in year 2.


What is planned for year 2?

Now that we have a clearer picture of where we are and what challenges are facing us, we’ve been putting together roadmaps and risk registers. This is allowing us to look at what implementation work we can do in the next year to set us up for the work of the next 3, 5, 10, and 15 years. There are technical implementations we have placed into a roadmap to address the major risks highlighted in our risk register. This work is hopefully going to include things like implementing PREMIS metadata and file format validation. This work will prepare us for future preservation planning.

We also have a training programme roadmap and implementation timeline. While not all of the training can be completed in year 2 of the DPOC project, a start can be made and materials prepared for a future training programme. This includes developing a training roadmap to support the technical implementations roadmap and the overall digital preservation roadmap.

There is also the first draft of our digital preservation policy to workshop with key stakeholders and develop into a final draft. There are roles and responsibilities to review and key stakeholders to work with if we want to make sustainable changes to our existing workflows.

Ultimately, what we are working towards is an organisational change. We want more people to think about digital preservation in their work. We are putting forward sustainable recommendations to help develop an ongoing digital preservation programme. There is still a lot a work ahead of us — well beyond the final year of this project — but we are hoping that what we have started will keep going even after the project reaches completion.



Audiovisual creation and preservation: part 2

Paul Heslin, Digital Collection Infrastructure Support Officer/Film Preservation Officer at the National Film and Sound Archive of Australia (NFSA) has generously contributed the following blog post. Introduction by Cambridge Policy and Planning Fellow, Somaya.


As Digital Preservation is such a wide-ranging field, people working in this field can’t be an absolute expert on absolutely everything. It’s important to have areas of expertise and to connect and collaborate with others who can share their knowledge and experience.

While I have a background in audio, broadcast radio, multimedia and some video editing, moving image preservation is not my area of speciality. It is for this reason I invited Paul Heslin to compose a follow-up to my Audiovisual creation and preservation blog post. Paul Heslin is a Digital Archivist at the NFSA, currently preoccupied with migrating the digital collection to a new generation of LTO tapes.

I am incredibly indebted to Paul and the input from his colleagues and managers (some of whom are also my former colleagues, from when I worked at the NFSA).

Background to moving image preservation

A core concern for all archives is the ongoing accessibility of their collections. In this regard film archives have traditionally been spoilt: a film print does not require any intermediate machinery for assessment, and conceptually a projector is not a complicated device (at least in regards to presenting the visual qualities of the film). Film material can be expected to last hundreds of years if kept in appropriate vault conditions; other moving image formats are not so lucky. Many flavours of videotape are predicted to be extinct within a decade, due to loss of machinery or expertise, and born-digital moving image items can arrive at the archive in any possible format. This situation necessitates digitisation and migration to formats which can be trusted to continue to be suitable. But not only suitable!

Optimistically, the digital preservation of these formats carries the promise of these items maintaining their integrity perpetually. Unlike analogue preservation, there is no assumption of degradation over time, however there are other challenges to consider. The equipment requirements for playing back a digital audiovisual file can be complicated, especially as the vast majority of such files are compressed using encoding/decoding systems called codecs. There can be very interesting results when these systems go wrong!

Example of Bad Compression (in Paris). Copyright Paul Heslin

Example of Bad Compression (in Paris). Copyright Paul Heslin


Codecs can be used in an archival context for much the same reason as the commercial world. Data storage is expensive and money saved can certainly be spent elsewhere. However, a key difference is that archives require truly lossless compression. So, it is important here to distinguish between lossless codecs which are mathematically lossless and those which are visually lossless. The later claims to encode in a way which is visually indistinguishable from an original source file, but it still dispenses with ‘superfluous’ data. This is not appropriate for archival usage, as this data loss cannot be recovered, and accumulated migration will ultimately result in visual and aural imperfections.

Another issue for archivists is that many codecs are proprietary or commercially owned: Apple’s ProRes format is a good example. While it is ubiquitously used within the production industry, it is an especially troubling example given signs that Apple will not be providing support into the future, especially for non-Mac platforms. This is not a huge issue for production companies who will have moved on to new projects and codecs, but for archives collecting these materials this presents a real problem. For this reason there is interest in dependable open standards which exist outside the commercial sphere.


One of the more interesting developments in this area has been the emergence of the FFV1 codec. FFV1 started life in the early 2000s as a lossless codec associated with the FFMPEG free software project and has since gained some traction as a potential audiovisual preservation codec for the future. The advantages of the codec are:

  • It is non-proprietary, unlike the many other popular codecs currently in use.
  • It makes use of truly lossless compression, so archives can store more material in less space without compromising quality.
  • FFV1 files are ALWAYS losslessly compressed, which avoids accidents that can result from using formats which can either encode losslessly or lossily (like the popular JPEG-2000 archival format).
  • It internally holds checksums for each frame, allowing archivists to check that everything is as it should be. Frame checksums are especially useful in identifying where error has specifically occurred.
  • Benchmark tests indicate that conversion speeds are quicker than JPEG-2000. This makes a difference for archives dealing with large collections and limited computing resources.

The final, and possibly most exciting, attribute of FFV1 is that it is developing out of the needs of the archival community, rather than relying on specifications designed for industry use. Updates from the original developer, Michael Niedermayer, have introduced beneficial features for archival use and so far the codec has been implemented in different capacities by the The National Archives in the UK, the Austrian National Archives, and the Irish Film Institute, as well as being featured in the FIAF Journal Of Film Preservation.

Validating half a million TIFF files. Part Two.

Back in May, I wrote a blog post about preparing the groundwork for the process of validating over 500,000 TIFF files which were created as part of a Polonsky Digitization Project which started in 2013. You can read Part One here on the blog.

Restoring the TIFF files from tape

Stack of backup tapes. Photo: Amazon

For the digitization workflow we used Goobi and within that process, the master TIFF files from the project were written to tape. In order to actually check these files, it was obvious we would need to restore all the content to spinning disk. I duly made a request to our system administration team and waited.

As I mentioned in Part One, we had setup a new virtualised server which had access to a chunk of network storage. The Polonsky TIFF files were restored to this network storage, however midway through the restoration from tape, the tape server’s operating system crashed…disaster.

After reviewing the failure, it appeared there was a bug within the RedHat operating system which had caused the problem. This issue proved to be a good lesson, a tape backup copy is only useful if you can actually restore it!

Question for you. When was the last time you tried to restore a large quantity of data from tape?

After some head scratching, patching and a review of the related systems, a second attempt at restoring all the TIFF content from tape commenced and this time all went well and the files were restored to the network storage. Hurrah!

JHOVE to validate those TIFFs

I decided that for the initial validation of the TIFF files, checking the files were well-formed and valid, JHOVE would provide a good baseline report.

As I mentioned in another blog post Customizable JHOVE TIFF output handler anyone? JHOVE’s XML output is rather unwieldy and so I planned to transform the XML using xsltproc (a command line xslt processor) with a custom XSLT stylesheet, allowing us to select any of attributes from the file which we might want to report on later, this would then produce a simple CSV output.

On a side note, work on adding a CSV output handler to JHOVE is in progress! This would mean the above process would be much simpler and quicker.

Parallel processing for the win.

What’s better than one JHOVE process validating TIFF content? Two! (well actually for us, sixteen at once works out quite nicely.)

It was clear from some initial testing with a 10,000 sample set of TIFF files that a single JHOVE process was going to take a long time to process 520,000+ images (around two and half days!)

So I started to look for a simple way to run many JHOVE processes in parallel. Using GNU Parallel seemed like a good way to go.

I created a command line BASH script which would take a list of directories to scan and then utilise GNU Parallel to fire off many JHOVE + XSLT processes to result in a CSV output, one line per TIFF file processed.

As our validation server was virtualised, it meant that I could scale the memory and CPU cores in this machine to do some performance testing. Below is a chart showing the number of images that the parallel processing system could handle per minute vs. the number of CPU cores enabled on the virtual server. (For all of the testing the memory in the server remained at 4 GB.)

So with 16 CPU cores, the estimate was that it would take around 6-7 hours to process all the Polonksy TIFF content, so a nice improvement on a single process.

At the start of this week, I ran a full production test, validating all 520,000+ TIFF files. 4 and half hours later the process was complete and 100 MB+ CSV file was generated with 520,000+ rows of data. Success!

For Part Three of this story I will write up how I plan to visualise the CSV data in Qlik Sense and the further analysis of those few files which failed the initial validation.