Cambridge University Libraries inaugural Digital Preservation Policy

The inaugural Cambridge University Libraries Digital Preservation Policy has been published last week. Somaya Langley (Cambridge Policy & Planning Fellow) provides some insight into the policy development process and announces a policy event in London, presented in collaboration with Edith (Oxford Policy & Planning Fellow) to be held in early December 2018.


In December 2016, I started the digital preservation policy development process for Cambridge University Library (CUL), which has finally culminated in a published policy.

Step one

Commencing with a ‘quick and dirty’ policy gap analysis at CUL, what I discovered was not so much that there were some gaps in their existing policy landscape but rather that there was a dearth of much-needed policies. The gap analysis at CUL found that a few key policies did exist for different audiences (some intended to guide CUL, some to guide researchers and some meant for all staff and researchers working at the University of Cambridge). While my counterpart at Oxford found there was duplication in their policies across Bodleian Libraries and the University of Oxford, I mostly found chasms.

Next step

The second step in the policy development process was attempting to meet an immediate need from staff, by adding some “placeholder” digital preservation statements into the Collection Care and Conservation Policy that was currently under review. In the longer term, while it might be ideal to combine a preservation policy into one (encompassing the conservation and preservation of physical and digital collection items), CUL’s digital preservation maturity and skill capabilities are too low at present. Focus needed to be really drawn to how to manage digital content, hence the need for a separate Cambridge University Libraries Digital Preservation Policy.

That said, like everything else I’ve been doing at Cambridge, it needed to be addressed holistically. And policy is no exception. Being able to undertake about two full weeks of work (spanning several months in early 2017) contributing to the review of the Collection Care and Conservation Policy has meant including some statements in this policy that will support better care for digital (and audiovisual) content still remaining on carriers (that are yet to be transferred).

Collaborative development

Then in June 2017 we moved onto undertaking policy development collaboratively. Part of this was to do an international digital preservation policy review – looking at dozens of different policies (and some strategies). Edith wrote about the policy development process back in middle of last year.

The absolute lion’s share of the work was carried out by my Oxford counterparts, Edith and Sarah. Due to other work priorities, I didn’t have much available time during this stage. This is why it is so important to have a team – whether this is a co-located team or distributed across an organisation or multiple organisations – when working in the digital preservation space. I really can’t thank them enough for carrying the load for this task.

Policy template

My contribution was to develop a generic policy template, for use in both our organisations. For those that know me, you will know I prefer to ‘borrow and adapt’ rather than reinvent the wheel. So I used the layout of policies from a previous workplace and constructed a template for use by CUL and the Bodleian Libraries. I was particularly keen to ensure what I developed was generic, so that it could be used for any type of policy development in future.

This template has now been provided to the Digital Preservation Coalition, who will make it available with other documents in the coming years – so that some of this groundwork doesn’t have to be carried out by every other organisation still needing to do digital preservation policy (or other policy) development. We found in our international digital preservation maturity and resourcing survey (another blog post on this is still to follow), that there’s still at least 42% of organisations internationally, that do not have a digital preservation policy.

Who has a digital preservation policy?

What next?

Due to other work priorities, drafting the digital preservation policy didn’t properly commence until earlier this year. But by this point I had a good handle on my organisation’s specific:

  • Challenges and issues related to digital content (not just preservation and management concerns)
  • High-level ‘profile’ of digital collections, right across all content ‘classes’
  • Gaps in policy, standards, procedures and guidelines (PSPG) as well as strategy
  • Appreciation of a wide-range of digital preservation policies (internationally)
  • Digital preservation maturity (holistic, not just technical) – based on maturity assessments using several digital preservation maturity models
  • Governance (related to policy and strategy)
  • Language relevant to my organisation
  • Responsibilities across the organisation
  • Relevant legislation (UK/EU)

This formed my approach of how to draft the digital preservation policy, that would meet CUL’s needs.

Approach

I realised that CUL required a comprehensive policy, that would fill the many gaps that ideally other policies would cover. I should note that there are many ways of producing a policy, and it does have to be tailored to meet the needs of your organisation. (You can compare with Edith’s digital preservation policy for the Bodleian Libraries, Oxford.)

The next steps involved:

  • Gathering requirements (this had already taken place during 2017)
  • Setting out a high-level structure/list of points to address
  • Defining the stakeholder group membership (and ways of engaging with them)
  • Setting the frame of the task ahead
  • Agreeing on the scope (this changed from ‘Cambridge University Library’ to ‘Cambridge University Libraries’ – encompassing CUL’s affiliate and dependent libraries‘)

Then came the iterative process of:

  1. Drafting policy statements and principles
  2. Meeting with the stakeholder group and discussing the draft
  3. Gathering feedback on the policy draft (internally and externally)
  4. Incorporating feedback
  5. Circulating a new version of the draft
  6. Developing associated documentation (to support the policy)

Once a final version had been reached, this was followed by the approvals and ratification process.

What do we have?

Last week, the inaugural Cambridge University Libraries Digital Preservation Policy was published (which was not without a few more hurdles).

It has been an ‘on again, off again’ process that has taken 23 months in total. Now we can say that for CUL and the University of Cambridge, that:

“Long-term preservation of digital content is essential to the University’s mission of contributing to society through the pursuit of education, learning, and research.”

Which compliments some of our other CUL policies.

What now?

This is never the end of a policy process. Policy should be a ‘live and breathing’ process, with the policy document itself purely being there to keep a record of the agreed upon decisions and principles.

So, of course there is more to do. “But what’s that?”, I hear you say.

Join us

There is so much more that Edith and I would like to share with you about our policy development journey over the past two years of the Digital Preservation at Oxford and Cambridge (DPOC) project.

So much so that we’re running an event in London on Tuesday 4th December 2018 on Devising Your Digital Preservation Policy, hosted by the DPC. (There is one seat left – if you’re quick, that could be you).

We’re also lucky to be joined by two ‘provocateurs’ for the day:

  • Kirsty Lingstadt, Head of Digital Library and Deputy Director of Library and University Collections, University of Edinburgh
  • Jenny Mitcham, Head of Good Practice and Standards, Digital Preservation Coalition (who has just landed in her new role – congrats & welcome to Jenny!)

There is so much more I could say about policy development in relation to digital content, but I’ll leave it there. I do hope you get to hear Edith and I wax lyrical about this.

Thank-yous

Finally, I must thank my Cambridge Polonsky team members, Edith Halvarsson (my Oxford counterpart), plus Paul Wheatley and William Kilbride from the DPC. Policy can’t be developed in a void and their contributions and feedback have been invaluable.

Closing the digitization gap

MS. Canon. Misc. 378, fol. 136r

Bodleian Digital Library’s Digitization Assistant, Tim, guest blogs about the treasures he finds while migrating and preparing complete, high-fidelity digitised items for Digital Bodleian. The Oxford DPOC Fellows feel lucky to sit across the office from the team that manages Digital Bodleian and so many of our amazing digitized collections.


We might spend most of our time on an industrial estate here at BDLSS, but we still get to do a bit of treasure-hunting now and then. Our kind has fewer forgotten ruins or charming wood-panelled reading rooms than we might like, admittedly – it’s more a rickety MySQL databases and arcane php scripts affair. But the rewards can be great. Recent rummages have turned up a Renaissance masterpiece, a metaphysical manuscript, and the legacy of a Polish queen.

Back in October, Emma wrote about our efforts to identify digital images held by the Bodleian which would make good candidates for Digital Bodleian, but for one reason or another haven’t yet made it onto the site. Since that post was published, we have been making good progress migrating images from our legacy websites, including the Oxford Digital Library and – coming soon to Digital Bodleian – our Luna collection of digitized slides. Many of the remaining undigitized images in our archive are unsuitable for the site, as they don’t constitute full image sets: we’re trying to keep Digital Bodleian a reserve for complete, high-fidelity digitized items, rather than a dumping-ground for fragmentary facsimiles. But among the millions of images are a few sets of fully-photographed books and manuscripts still waiting to be showcased to the public on our digital platform.


A recent Digital Bodleian addition: the Notitia Dignitatum, a hugely important Renaissance copy of a late-Roman administrative text (MS. Canon. Misc. 378).

Identifying these full-colour, complete image sets isn’t as easy as we’d like, thanks to some slightly creaky legacy databases, and the sheer volume of material versus limited staff time. An approach mentioned by Emma has, however, yielded some successes. Taking suggestions from our curators – and, more recently, our Twitter followers  – we’ve been able draw up a digitization wishlist, which also serves as a list of targets for when we go ferreting around in the archive. Most haven’t been fully photographed, but we’ve turned up a clutch of exciting items from these efforts.

Finding the images is only half the hunt, though. To present the digital facsimiles usefully, we need to give them some descriptive metadata. Digital Bodleian isn’t intended to be a catalogue, but we like to provide some information about an item where we have it, and make our digitized collections discoverable, as well as giving context for non-experts. But as with finding images, locating useful metadata isn’t always simple.

Most of the items on Digital Bodleian sit within the Bodleian’s Special Collections. Each object is unique, requiring the careful attention of an expert to be properly catalogued. For this reason, modern cataloguing efforts focus on subsets of the collections. For those not covered by these, often the only published descriptions (if any) are in 19th century surveys – which can be excellent, but can be terse, or no longer up-to-date. Other descriptions and scholarly analyses are spread around a variety of published and unpublished material, some of it available in a digital form, most of it not. This all presents a challenge when it comes to finding information to go along with items on Digital Bodleian: much as we’d like to be, Emma and I aren’t yet experts on the entirety of all the periods, areas and traditions represented in the Bodleian’s holdings.


Another item pulled from the Bodleian’s image archive: a finely decorated 16th-century Book of Hours (MS. Douce 112).

Happily, our colleagues responsible for curating these collections are engaged in constant, dogged efforts to make descriptions more accessible. Especially useful to those of us unable to pop into the Weston to rifle through printed finding aids are a set of TEI-based electronic catalogues*, developed in conjunction with BDLSS. These aim to provide systematically-structured digital catalogue entries for a variety of Western and Oriental Special Collections. They’re fantastic resources, but they represent ongoing cataloguing campaigns, rather than finished products. Nor do they cover all the Special Collections.

Our most valuable resource therefore remains the ever-patient curators themselves. They kindly help us track down information about the items we’re putting on Digital Bodleian from a sometimes-daunting array of potential sources, put us in touch with other experts where required, and are always ready to answer our questions when we need something clarified. This has been enormously helpful in providing descriptions for our new additions to the site.

With this assistance, and the help of our colleagues in the Imaging Studio, who provide similar expertise in tracking down the images, and try hard to squeeze in time to photograph items from the aforementioned wishlist, we’ve managed to get 25 new treasures onto Digital Bodleian since Emma’s post, on top of all the ongoing new photography and migration projects. This totals around 9,300 images altogether, and we have more items on the way (due soon are a couple of Mesoamerican codices and an Old Sundanese text printed on palm leaves from Java). Slowly, we’re closing the gap.

A selection of recent items we’ve dug up from our archives:

MS. Ashmole 304
MS. Ashmole 399
MS. Auct. D. inf. 2. 11
MS. Canon. Bibl. Lat. 61
MS. Canon. Misc. 213
MS. Canon. Misc. 378
MS. Douce 112
MS. Douce 134
MS. Douce 40
MS. Holkham misc. 49
MS. Lat. liturg. e. 17
MS. Lat. liturg. f. 2
MS. Laud Misc. 108
MS. Tanner 307

 

*Currently live are catalogues of medieval manuscripts, Hebrew manuscripts, Genizah fragments,  and union catalogues of Islamicate manuscripts and Shan Buddhist manuscripts in the United Kingdom. Catalogues of Georgian and Armenian manuscripts, to an older TEI standard, are still online and are currently undergoing conversion work. Similar, non-TEI-based resources for Incunables and some of our Chinese Special collections are also available.

The vision for a preservation repository

Over the last couple of months, work at Cambridge University Library has begun to look at what a potential digital preservation system will look like, considering technical infrastructure, the key stakeholders and the policies underpinning them. Technical Fellow, Dave, tells us more about the holistic vision…


This post discusses some of the work we’ve been doing to lay foundations beneath the requirements for a ‘preservation system’ here at Cambridge. In particular, we’re looking at the core vision for the system. It comes with the standard ‘work in progress’ caveats – do not be surprised if the actual vision varies slightly (or more) from what’s discussed here. A lot of the below comes from Mastering the Requirements Process by Suzanne and James Robertson.

Also – it’s important to note that what follows is based upon a holistic definition of ‘system’ – a definition that’s more about what people know and do, and less about Information Technology, bits of tin and wiring.

Why does a system change need a vision?

New systems represent changes to the existing status-quo. The vision is like the Pole Star for such a change effort – it ensures that people have something fixed to move towards when they’re buried under minute details. When confusion reigns, you can point to the vision for the system to guide you back to sanity.

Plus, as with all digital efforts, none of this is real: there’s no definite, obvious end point to the change. So the vision will help us recognise when we’ve achieved what we set out to.

Establishing scope and context

Defining what the system change isn’t is a particularly good a way of working out what it actually represents. This can be achieved by thinking about the systems around the area you’re changing and the information that’s going to flow in and out. This sort of thinking makes for good diagrams: one that shows how a preservation repository system might sit within the broader ecosystem of digitisation, research outputs / data, digital archives and digital published material is shown below.

System goals

Being able to concisely sum-up the key goals of the system is another important part of the vision. This is a lot harder than it sounds and there’s something journalistic about it – what you leave out is definitely more important than what you keep in. Fortunately, the vision is about broad brush strokes, not detail, which helps at this stage.

I found some great inspiration in Sustainable Economics for a Digital Planet, which indicated goals such as: “the system should make the value of preserving digital resources clear”, “the system should clearly support stakeholders’ incentives to preserve digital resources” and “the functional aspects of the system should map onto clearly-defined preservation roles and responsibilities”.

Who are we implementing this for?

The final main part of the ‘vision’ puzzle is the stakeholders: who is going to benefit from a preservation system? Who might not benefit directly, but really cares that one exists?

Any significant project is likely to have a LOT of these, so the Robertsons suggest breaking the list down by proximity to the system (using Ian Alexander’s Onion Model), from the core team that uses the system, through the ‘operational work area’ (i.e. those with the need to actually use it) and out to interested parties within the host organisation, and then those in the wider world beyond. An initial attempt at thinking about our stakeholders this way is shown below.

One important thing that we realised was that it’s easy to confuse ‘closeness’ with ‘importance’: there are some very important stakeholders in the ‘wider world’ (e.g. Research Councils or historians) that need to be kept in the loop.

A proposed vision for our preservation repository

After iterating through all the above a couple of times, the current working vision (subject to change!) for a digital preservation repository at Cambridge University Library is as follows:

The repository is the place where the best possible copies of digital resources are stored, kept safe, and have their usefulness maintained. Any future initiatives that need the most perfect copy of those resources will be able to retrieve them from the repository, if authorised to do so. At any given time, it will be clear how the digital resources stored in the repository are being used, how the repository meets the preservation requirements of stakeholders, and who is responsible for the various aspects of maintaining the digital resources stored there.

Hopefully this will give us a clear concept to refer back to as we delve into more detail throughout the months and years to come…

Guest post: The 6-million-image gap

Bodleian Digital Library Systems and Services’ Digital Curator, Emma Stanford, guest blogs for the DPOC project this week. Emma writes about what she is doing to close some of the 6-million-image gap between what’s in our tape archive and what’s available online at Digital.Bodleian. It’s no small task, but sometimes Emma finds some real gems just waiting to be made available to researchers. She also raises some good questions about what metadata we should make available to researchers to interpret our digitized image. Read more from Emma below.


Thanks to Edith’s hard work, we now know that the Bodleian Imaging Services image archive contains about 5.8 million unique images. This is in addition to various images held on hard drives and other locations around the Bodleian, which bring the total up to almost 7 million. Digital.Bodleian, however, our flagship digital image platform, contains only about 710,000 unique images–a mere tenth of our total image archive. What gives?

That 6-million-image gap consists of two main categories:

Images that are online elsewhere (aka the migration backlog). In the decades before Digital.Bodleian, we tried a number of other image delivery platforms that remain with us today: Early Manuscripts at Oxford University, the Toyota City Imaging Project, the Oxford Digital Library, Luna, etc., etc. Edith has estimated that the non-Digital.Bodleian content comprises about 1.4 million images. Some of these images don’t belong in Digital.Bodleian, either because we don’t have rights to the images (for example, Queen Victoria’s Journals) or because they are incomplete selections rather than full image sets (for example, the images in the Bodleian Treasures exhibition). Our goal is to migrate all the content we can to Digital.Bodleian and eventually shut down most of the old sites. We’ve been chipping away at this task very slowly, but there is a lot left to do.

Images that have never been online. Much of Imaging Services’ work is commercial orders: shooting images for researchers, publishers, journalists, etc. We currently store all these images on tape, and we have a database that records the shelfmark, number of images, and list of captured pages, along with information about when and how the images were captured. Searching through this archive for Digital.Bodleian-appropriate images is a difficult task, though. Shelfmark notation isn’t standardized at all, so there are lots of duplicate records. Also, in many cases, just a few pages from a book or manuscript were captured, or the images were captured in black-and-white or grayscale; either way, not suitable for Digital.Bodleian, where we aim to publish fully-digitized works in full colour.

I’m working on extracting a list of complete, full-colour image sets from this database. In the meantime, we’ve started approaching the problem from the other direction: creating a list of items that we’d like to have on Digital.Bodleian, and then searching the archive for images of them. To do this, we asked the Bodleian’s manuscript and rare book curators to share with us their lists of “greatest hits”: the Bodleian’s most valuable, interesting, and/or fragile holdings, which would benefit most from online surrogates. We then began going through this list searching for the shelfmarks in the image archive. Mostly, we’ve found only a few images for each shelfmark, but occasionally we hit the jackpot: a complete, full-colour image set of a 13th-century bestiary or a first edition of a Shakespeare play.

Going through the archives in this way has underlined for me just how much the Bodleian’s imaging standards have changed in the last two decades. File size has increased, of course, as higher-resolution digital scanning backs have become available; but changes in lighting equipment, book cradles, processing software, rulers and colour charts have all made their mark on our images too. For me, this has raised the question of whether the technical metadata we’re preserving in our archives, about when and how the images were captured, should also be made available to researchers in some way, so that they can make an informed choice about how to interpret the images they encounter on sites like Digital.Bodleian.

In the meantime, here are some of the image sets we’ve pulled out of the archive and digitized so far:

Jane Austen’s juvenilia
a 13th-century bestiary
the Oxford Catullus

MS. Bodl. 764, fol. 2r (detail)

MS. Bodl. 764, fol. 2r (detail)

Audiovisual creation and preservation: part 2

Paul Heslin, Digital Collection Infrastructure Support Officer/Film Preservation Officer at the National Film and Sound Archive of Australia (NFSA) has generously contributed the following blog post. Introduction by Cambridge Policy and Planning Fellow, Somaya.

Introduction

As Digital Preservation is such a wide-ranging field, people working in this field can’t be an absolute expert on absolutely everything. It’s important to have areas of expertise and to connect and collaborate with others who can share their knowledge and experience.

While I have a background in audio, broadcast radio, multimedia and some video editing, moving image preservation is not my area of speciality. It is for this reason I invited Paul Heslin to compose a follow-up to my Audiovisual creation and preservation blog post. Paul Heslin is a Digital Archivist at the NFSA, currently preoccupied with migrating the digital collection to a new generation of LTO tapes.

I am incredibly indebted to Paul and the input from his colleagues and managers (some of whom are also my former colleagues, from when I worked at the NFSA).


Background to moving image preservation

A core concern for all archives is the ongoing accessibility of their collections. In this regard film archives have traditionally been spoilt: a film print does not require any intermediate machinery for assessment, and conceptually a projector is not a complicated device (at least in regards to presenting the visual qualities of the film). Film material can be expected to last hundreds of years if kept in appropriate vault conditions; other moving image formats are not so lucky. Many flavours of videotape are predicted to be extinct within a decade, due to loss of machinery or expertise, and born-digital moving image items can arrive at the archive in any possible format. This situation necessitates digitisation and migration to formats which can be trusted to continue to be suitable. But not only suitable!

Optimistically, the digital preservation of these formats carries the promise of these items maintaining their integrity perpetually. Unlike analogue preservation, there is no assumption of degradation over time, however there are other challenges to consider. The equipment requirements for playing back a digital audiovisual file can be complicated, especially as the vast majority of such files are compressed using encoding/decoding systems called codecs. There can be very interesting results when these systems go wrong!

Example of Bad Compression (in Paris). Copyright Paul Heslin

Example of Bad Compression (in Paris). Copyright Paul Heslin

Codecs

Codecs can be used in an archival context for much the same reason as the commercial world. Data storage is expensive and money saved can certainly be spent elsewhere. However, a key difference is that archives require truly lossless compression. So, it is important here to distinguish between lossless codecs which are mathematically lossless and those which are visually lossless. The later claims to encode in a way which is visually indistinguishable from an original source file, but it still dispenses with ‘superfluous’ data. This is not appropriate for archival usage, as this data loss cannot be recovered, and accumulated migration will ultimately result in visual and aural imperfections.

Another issue for archivists is that many codecs are proprietary or commercially owned: Apple’s ProRes format is a good example. While it is ubiquitously used within the production industry, it is an especially troubling example given signs that Apple will not be providing support into the future, especially for non-Mac platforms. This is not a huge issue for production companies who will have moved on to new projects and codecs, but for archives collecting these materials this presents a real problem. For this reason there is interest in dependable open standards which exist outside the commercial sphere.

FFV1

One of the more interesting developments in this area has been the emergence of the FFV1 codec. FFV1 started life in the early 2000s as a lossless codec associated with the FFMPEG free software project and has since gained some traction as a potential audiovisual preservation codec for the future. The advantages of the codec are:

  • It is non-proprietary, unlike the many other popular codecs currently in use.
  • It makes use of truly lossless compression, so archives can store more material in less space without compromising quality.
  • FFV1 files are ALWAYS losslessly compressed, which avoids accidents that can result from using formats which can either encode losslessly or lossily (like the popular JPEG-2000 archival format).
  • It internally holds checksums for each frame, allowing archivists to check that everything is as it should be. Frame checksums are especially useful in identifying where error has specifically occurred.
  • Benchmark tests indicate that conversion speeds are quicker than JPEG-2000. This makes a difference for archives dealing with large collections and limited computing resources.

The final, and possibly most exciting, attribute of FFV1 is that it is developing out of the needs of the archival community, rather than relying on specifications designed for industry use. Updates from the original developer, Michael Niedermayer, have introduced beneficial features for archival use and so far the codec has been implemented in different capacities by the The National Archives in the UK, the Austrian National Archives, and the Irish Film Institute, as well as being featured in the FIAF Journal Of Film Preservation.

Validating half a million TIFF files. Part Two.

Back in May, I wrote a blog post about preparing the groundwork for the process of validating over 500,000 TIFF files which were created as part of a Polonsky Digitization Project which started in 2013. You can read Part One here on the blog.

Restoring the TIFF files from tape

Stack of backup tapes. Photo: Amazon

For the digitization workflow we used Goobi and within that process, the master TIFF files from the project were written to tape. In order to actually check these files, it was obvious we would need to restore all the content to spinning disk. I duly made a request to our system administration team and waited.

As I mentioned in Part One, we had setup a new virtualised server which had access to a chunk of network storage. The Polonsky TIFF files were restored to this network storage, however midway through the restoration from tape, the tape server’s operating system crashed…disaster.

After reviewing the failure, it appeared there was a bug within the RedHat operating system which had caused the problem. This issue proved to be a good lesson, a tape backup copy is only useful if you can actually restore it!

Question for you. When was the last time you tried to restore a large quantity of data from tape?

After some head scratching, patching and a review of the related systems, a second attempt at restoring all the TIFF content from tape commenced and this time all went well and the files were restored to the network storage. Hurrah!

JHOVE to validate those TIFFs

I decided that for the initial validation of the TIFF files, checking the files were well-formed and valid, JHOVE would provide a good baseline report.

As I mentioned in another blog post Customizable JHOVE TIFF output handler anyone? JHOVE’s XML output is rather unwieldy and so I planned to transform the XML using xsltproc (a command line xslt processor) with a custom XSLT stylesheet, allowing us to select any of attributes from the file which we might want to report on later, this would then produce a simple CSV output.

On a side note, work on adding a CSV output handler to JHOVE is in progress! This would mean the above process would be much simpler and quicker.

Parallel processing for the win.

What’s better than one JHOVE process validating TIFF content? Two! (well actually for us, sixteen at once works out quite nicely.)

It was clear from some initial testing with a 10,000 sample set of TIFF files that a single JHOVE process was going to take a long time to process 520,000+ images (around two and half days!)

So I started to look for a simple way to run many JHOVE processes in parallel. Using GNU Parallel seemed like a good way to go.

I created a command line BASH script which would take a list of directories to scan and then utilise GNU Parallel to fire off many JHOVE + XSLT processes to result in a CSV output, one line per TIFF file processed.

As our validation server was virtualised, it meant that I could scale the memory and CPU cores in this machine to do some performance testing. Below is a chart showing the number of images that the parallel processing system could handle per minute vs. the number of CPU cores enabled on the virtual server. (For all of the testing the memory in the server remained at 4 GB.)

So with 16 CPU cores, the estimate was that it would take around 6-7 hours to process all the Polonksy TIFF content, so a nice improvement on a single process.

At the start of this week, I ran a full production test, validating all 520,000+ TIFF files. 4 and half hours later the process was complete and 100 MB+ CSV file was generated with 520,000+ rows of data. Success!

For Part Three of this story I will write up how I plan to visualise the CSV data in Qlik Sense and the further analysis of those few files which failed the initial validation.

Over 20 years of digitization at the Bodleian Libraries

Policy and Planning Fellow Edith writes an update on some of her findings from the DPOC project’s survey of digitized images at the Bodleian Libraries.


During August-December 2016 I have been collating information about Bodleian Libraries’ digitized collections. As an early adopter of digitization technology, Bodleian Libraries have made digital surrogates of its collections available online since the early 1990’s. A particular favourite of mine, and a landmark among the Bodleian Libraries’ early digital projects, is the Toyota Transport Digitization Project (1996). [Still up and running here]

At the time of the Toyota Project, digitization was still highly specialised and the Bodleian Libraries opted to outsource the digital part to Laser Bureau London. Laser Bureau ‘digitilised’ 35mm image negatives supplied by Bodleian Libraries’ imaging studio and sent the files over on a big bundle of CDs. 1244 images all in all – which was a massive achievement at the time. It is staggering to think that we could now produce the same many times over in just a day!

Since the Toyota projects completion twenty years ago, Bodleian Libraries have continued large scale digitization activities in-house via its commercial digitization studio, outsourced to third party suppliers, and in project partnerships. With generous funding from the Polonsky Foundation the Bodleian Libraries are now set to add over half a million image surrogates of Special Collection manuscripts to its image portal – Digital.Bodleian.

What happens to 20 years’ worth of digitized material? Since 1996 both Bodleian Libraries and digitization standards have changed massively. Early challenges around storage alone have meant that content inevitably has been squirreled away in odd locations and created to the varied standards of the time. Profiling our old digitized collections is the first step to figuring out how these can be brought into line with current practice and be made more visible to library users.

“So what is the extent of your content?”, librarians from other organisations have asked me several times over the past few months. In the hope that it will be useful for other organisations trying to profile their legacy digitized collections, I thought I would present some figures here on the DPOC blog.

When tallying up our survey data, I came to a total of approximately 134 million master images in primarily TIFF and JP2 format. From very early digitization projects however, the idea of ‘master files’ was not yet developed and master and access files will, in these cases, often be one and the same.

The largest proportion of content, some 127,000,000 compressed JP2s, were created as part of the Google Books project up to 2009 and are available via Search Oxford Libraries Online. These add up to 45 TB of data. The library further holds three archives of 5.8million/99.4TB digitized image content primarily created by the Bodleian Libraries’ in-house digitization studio in TIFF. These figures does not include back-ups – with which we start getting in to quite big numbers.

Of the remaining 7 million digitized images which are not from the Google Books project, 2,395,000 are currently made available on a Bodleian Libraries website. In total the survey examined content from 40 website applications and 24 exhibition pages. 44% of the images which are made available online were, at the time of the survey, hosted on Digital.Bodleian, 4% on ODL Greenstone and 1% on Luna.The latter two are currently in the processes of being moved onto Digital.Bodleian. At least 6% of  content from the sample was duplicated across multiple website applications and are candidates for deduplication. Another interesting fact from the survey is that JPEG, JP2 (transformed to JPEG on delivery) and GIF are by far the most common access/derivative formats on Bodleian Libraries’ website applications.

The final digitized image survey report has now been reviewed by the Digital Preservation Coalition and is being looked at internally. Stay tuned to hear more in future blog posts!

Validating half a million TIFF files. Part One.

Oxford Technical Fellow, James, reports on the validation work he is doing with JHOVE and DPF Manager in Part One of this blog series on validation tools for auditing the Polonsky Digitization Project’s TIFF files.


In 2013, The Bodleian Libraries of the University of Oxford and the Biblioteca Apostolica Vaticana (Vatican Library) joined efforts in a landmark digitization project. The aim was to open up their repositories of ancient texts including Hebrew manuscripts, Greek manuscripts, and incunabula, or 15th-century printed books. The goal was to digitize over one and half million pages. All of this was made possible by funding from the Polonsky Foundation.

As part of our own Polonsky funded project, we have been preparing the ground to validate over half a million TIFF files which have been created from digitization work here at Oxford.

Many in the Digital Preservation field have already written articles and blogs on the tools available for validating TIFF files, Yvonne Tunnat (from ZBW Leibniz Information Centre for Economics) wrote a blog for the Open Preservation Foundation regarding the tools. I also had the pleasure of hearing from Yvonne and Michelle Lindlar (from TIB Leibniz Information Centre for Science and Technology) talk at IDCC 2017 conference on this very subject in more detail when discussing JHOVE in their talk, How Valid Is Your Validation? A Closer Look Behind The Curtain Of JHOVE

The go-to validator for TIFF files?

Preparation for validation

In order to validate the master TIFF files, firstly we needed to retrieve these from our tape storage system; fortunately around two-thirds of the images had already been restored to spinning disk storage as part of another internal project. When the master TIFF files were written to tape this included MD5 hashes of the files, so as part of this validation work we will confirm the fixity of all the files. Our network storage system had plenty of room to accommodate all the required files, so we began auditing what still needed to be recovered.

Whilst the auditing and retrieval was progressing, I set about investigating validating a sample set of master TIFF files using both JHOVE and DPF Manager to get an estimate on the time it would take to process the approximate 50 TB of files. I was also interested to compare the results of both tools when faced with invalid or corrupted sample sets of files.

We setup a new virtual machine server in order to carry out the validation workload; this allowed us to scale this machine’s performance as required. Both validation tools were going to be run on a RedHat Linux environment and both would be run from the command line.

It quickly became clear that JHOVE was going to be able to validate the TIFF files a lot quicker than DPF Manager. If DPF Manager is being used as part of one of your workflows, you may not have noticed any real-time penalty when processing small numbers of files, however with a large batch, the time difference with the two tools was noticeable.

Potential alternative for TIFF validation?

During the testing I noticed there were several issues with DPF Manager, including the lack of being able to specify the number of threads the process could use, which I suspect resulted in the poor initial performance. I dutifully reported the bug to the DPF community GitHub and was pleased to see an almost instant response stating that it would be resolved in the next monthly release. I do love Open Source projects, and I think this highlights the importance of those using the tools being responsible for improving them. Without community engagement, these projects are liable to run out of steam and slowly die.

I’m going to reserve judgement on the tools until the next release of DPF Manager. We will then also be in a position to report back on our findings from this validation case study. So check back with our blog for Part Two.

I would be interested to hear from anyone else who might have been faced with validating large batches of files, what tools are you using? what challenges have you faced? Do let me know!

Visit to the National Archives: herons and brutalism

An update from Edith Halvarsson about the DPOC team’s trip to visit the National Archives last week. Prepare yourself for a discussion about digital preservation, PRONOM, dark archives, and wildlife!


Last Thursday DPOC visited the National Archives in London. David Clipsham kindly put much time into organising a day of presentations with the TNA’s developers, digitization experts and digital archivists. Thank you Diana, David & David, Ron, Ian & Ian, Anna and Alex for all your time and interesting thoughts!

After some confusion, we finally arrived at the picturesque Kew Gardens station. The area around Kew is very sleepy, and our first thought on arrival was “is this really the right place?” However, after a bit more circling around Kew, you definitely cannot miss it. The TNA is located in an imposing brutalist building, surrounded by beautiful nature and ponds built as flood protection for the nation’s collections. They even have a tame heron!

After we all made it on site, the day the kicked off with an introduction from Diana Newton (Head of Digital Preservation). Diana told us enthusiastically about the history of the TNA and its Digital Records Infrastructure. It was really interesting to hear how much has changed in just six years since DRI was launched – both in terms of file format proliferation and an increase in FOI requests.

We then had a look at TNA’s ingest workflows into Preservica and storage model with Ian Hoyle (Senior Developer) and David Underdown (Senior Digital Archivist). It was particularly interesting to hear about the TNA’s decision to store all master file content on offline tape, in order to bring down the archive’s carbon footprint.

After lunch with Ron Davies (Senior Project Manager), Anna de Sousa and Ian Henderson spoke to us about their work digitizing audiovisual material and 2D images. Much of our discussion focused on standards and formats (particularly around A/V). Alex Green and David Clipsham then finished off the day talking about born-digital archive accession streams and PRONOM/DROID developments. This was the first time we had seen the clever way a file format identifier is created – there is much detective work required on David’s side. David also encouraged us and anyone else who relies on DROID to have a go and submit something to PRONOM – he even promised its fun! Why not read Jenny Mitcham’s and Andrea Byrne’s articles for some inspiration?

Thanks for a fantastic visit and some brilliant discussions on how digital preservation work and digital collecting is done at the TNA!

The things we find…

Sarah shares some finds from Edith’s Digitized image survey of the Bodleian Libraries’ many digitization projects and initiatives over the years.


We’ve been digitizing our collections for a long time. And that means we have a lot of things, in a lot of places. Part of the Policy & Planning Fellow’s task is to find them, count them, and make sure we’re looking after them. That includes making decisions to combat the obsolescence of the hardware they are stored on, the software they rely on (this includes the website that has been designed to display them), and the files themselves so they do not become victim to bit rot.

At Oxford, Edith has been hard at work searching, counting, emailing, navigating countless servers and tape managers, and writing up the image survey report. But while she has been hard at work, she has been sharing some of her best finds with the team and I thought it was time we share them with you.

Below are some interesting finds from Edith’s image survey work. Some of them a real gems:

What? a large and apparently hungry dragon from Oracula, folio 021v (Shelfmark: Barocci 170) Found? On the ODL (Oxford Digital Library) site here.

What? Toby the Sapient Pig. Found? On the Bodleian Treasures website. Currently on display in the Treasures gallery at the Weston library and open to the public. The digital version is available 24/7.

What? A very popular and beautiful early manuscript: an illustrated guide to Oxford University and its colleges, prepared for Queen Elizabeth I in 1566. This page is of the Bodleian Libraries’ Divinity School. Found? On the ODL (Oxford Digital Library) site here.

What? Corbyn in the early years (POSTER 1987-23). Found? Part of the CPA Poster Collection here.

What? And this brilliant general election poster (POSTER 1963-04). Found? Part of the CPA Poster Collection here.

What? Cosmographia, 1482, a map of the known World (Auct. P 1.4). Found? In Medieval and Renaissance Manuscripts here.

What? Gospels, folio 28v (Auct. D. 2.16). Found? Medieval and Renaissance Manuscripts here.

There are just a few of the wonderful and weird finds in our rich and diverse collections. One thing is certain, digitized collections provide hours of discovery to anyone with a computer and Internet access. It is one of the most exciting things about digitization–access to almost anyone, anywhere.

Of course providing access means preserving the digital images. Knowing what we have and where we have it, is one step to ensuring that they will be preserved for future access and discovery of the beautiful, the weird, and the wonderful.