Guest post: The 6-million-image gap

Bodleian Digital Library Systems and Services’ Digital Curator, Emma Stanford, guest blogs for the DPOC project this week. Emma writes about what she is doing to close some of the 6-million-image gap between what’s in our tape archive and what’s available online at Digital.Bodleian. It’s no small task, but sometimes Emma finds some real gems just waiting to be made available to researchers. She also raises some good questions about what metadata we should make available to researchers to interpret our digitized image. Read more from Emma below.

Thanks to Edith’s hard work, we now know that the Bodleian Imaging Services image archive contains about 5.8 million unique images. This is in addition to various images held on hard drives and other locations around the Bodleian, which bring the total up to almost 7 million. Digital.Bodleian, however, our flagship digital image platform, contains only about 710,000 unique images–a mere tenth of our total image archive. What gives?

That 6-million-image gap consists of two main categories:

Images that are online elsewhere (aka the migration backlog). In the decades before Digital.Bodleian, we tried a number of other image delivery platforms that remain with us today: Early Manuscripts at Oxford University, the Toyota City Imaging Project, the Oxford Digital Library, Luna, etc., etc. Edith has estimated that the non-Digital.Bodleian content comprises about 1.4 million images. Some of these images don’t belong in Digital.Bodleian, either because we don’t have rights to the images (for example, Queen Victoria’s Journals) or because they are incomplete selections rather than full image sets (for example, the images in the Bodleian Treasures exhibition). Our goal is to migrate all the content we can to Digital.Bodleian and eventually shut down most of the old sites. We’ve been chipping away at this task very slowly, but there is a lot left to do.

Images that have never been online. Much of Imaging Services’ work is commercial orders: shooting images for researchers, publishers, journalists, etc. We currently store all these images on tape, and we have a database that records the shelfmark, number of images, and list of captured pages, along with information about when and how the images were captured. Searching through this archive for Digital.Bodleian-appropriate images is a difficult task, though. Shelfmark notation isn’t standardized at all, so there are lots of duplicate records. Also, in many cases, just a few pages from a book or manuscript were captured, or the images were captured in black-and-white or grayscale; either way, not suitable for Digital.Bodleian, where we aim to publish fully-digitized works in full colour.

I’m working on extracting a list of complete, full-colour image sets from this database. In the meantime, we’ve started approaching the problem from the other direction: creating a list of items that we’d like to have on Digital.Bodleian, and then searching the archive for images of them. To do this, we asked the Bodleian’s manuscript and rare book curators to share with us their lists of “greatest hits”: the Bodleian’s most valuable, interesting, and/or fragile holdings, which would benefit most from online surrogates. We then began going through this list searching for the shelfmarks in the image archive. Mostly, we’ve found only a few images for each shelfmark, but occasionally we hit the jackpot: a complete, full-colour image set of a 13th-century bestiary or a first edition of a Shakespeare play.

Going through the archives in this way has underlined for me just how much the Bodleian’s imaging standards have changed in the last two decades. File size has increased, of course, as higher-resolution digital scanning backs have become available; but changes in lighting equipment, book cradles, processing software, rulers and colour charts have all made their mark on our images too. For me, this has raised the question of whether the technical metadata we’re preserving in our archives, about when and how the images were captured, should also be made available to researchers in some way, so that they can make an informed choice about how to interpret the images they encounter on sites like Digital.Bodleian.

In the meantime, here are some of the image sets we’ve pulled out of the archive and digitized so far:

Jane Austen’s juvenilia
a 13th-century bestiary
the Oxford Catullus

MS. Bodl. 764, fol. 2r (detail)

MS. Bodl. 764, fol. 2r (detail)

Validating half a million TIFF files. Part Two.

Back in May, I wrote a blog post about preparing the groundwork for the process of validating over 500,000 TIFF files which were created as part of a Polonsky Digitization Project which started in 2013. You can read Part One here on the blog.

Restoring the TIFF files from tape

Stack of backup tapes. Photo: Amazon

For the digitization workflow we used Goobi and within that process, the master TIFF files from the project were written to tape. In order to actually check these files, it was obvious we would need to restore all the content to spinning disk. I duly made a request to our system administration team and waited.

As I mentioned in Part One, we had setup a new virtualised server which had access to a chunk of network storage. The Polonsky TIFF files were restored to this network storage, however midway through the restoration from tape, the tape server’s operating system crashed…disaster.

After reviewing the failure, it appeared there was a bug within the RedHat operating system which had caused the problem. This issue proved to be a good lesson, a tape backup copy is only useful if you can actually restore it!

Question for you. When was the last time you tried to restore a large quantity of data from tape?

After some head scratching, patching and a review of the related systems, a second attempt at restoring all the TIFF content from tape commenced and this time all went well and the files were restored to the network storage. Hurrah!

JHOVE to validate those TIFFs

I decided that for the initial validation of the TIFF files, checking the files were well-formed and valid, JHOVE would provide a good baseline report.

As I mentioned in another blog post Customizable JHOVE TIFF output handler anyone? JHOVE’s XML output is rather unwieldy and so I planned to transform the XML using xsltproc (a command line xslt processor) with a custom XSLT stylesheet, allowing us to select any of attributes from the file which we might want to report on later, this would then produce a simple CSV output.

On a side note, work on adding a CSV output handler to JHOVE is in progress! This would mean the above process would be much simpler and quicker.

Parallel processing for the win.

What’s better than one JHOVE process validating TIFF content? Two! (well actually for us, sixteen at once works out quite nicely.)

It was clear from some initial testing with a 10,000 sample set of TIFF files that a single JHOVE process was going to take a long time to process 520,000+ images (around two and half days!)

So I started to look for a simple way to run many JHOVE processes in parallel. Using GNU Parallel seemed like a good way to go.

I created a command line BASH script which would take a list of directories to scan and then utilise GNU Parallel to fire off many JHOVE + XSLT processes to result in a CSV output, one line per TIFF file processed.

As our validation server was virtualised, it meant that I could scale the memory and CPU cores in this machine to do some performance testing. Below is a chart showing the number of images that the parallel processing system could handle per minute vs. the number of CPU cores enabled on the virtual server. (For all of the testing the memory in the server remained at 4 GB.)

So with 16 CPU cores, the estimate was that it would take around 6-7 hours to process all the Polonksy TIFF content, so a nice improvement on a single process.

At the start of this week, I ran a full production test, validating all 520,000+ TIFF files. 4 and half hours later the process was complete and 100 MB+ CSV file was generated with 520,000+ rows of data. Success!

For Part Three of this story I will write up how I plan to visualise the CSV data in Qlik Sense and the further analysis of those few files which failed the initial validation.

Over 20 years of digitization at the Bodleian Libraries

Policy and Planning Fellow Edith writes an update on some of her findings from the DPOC project’s survey of digitized images at the Bodleian Libraries.

During August-December 2016 I have been collating information about Bodleian Libraries’ digitized collections. As an early adopter of digitization technology, Bodleian Libraries have made digital surrogates of its collections available online since the early 1990’s. A particular favourite of mine, and a landmark among the Bodleian Libraries’ early digital projects, is the Toyota Transport Digitization Project (1996). [Still up and running here]

At the time of the Toyota Project, digitization was still highly specialised and the Bodleian Libraries opted to outsource the digital part to Laser Bureau London. Laser Bureau ‘digitilised’ 35mm image negatives supplied by Bodleian Libraries’ imaging studio and sent the files over on a big bundle of CDs. 1244 images all in all – which was a massive achievement at the time. It is staggering to think that we could now produce the same many times over in just a day!

Since the Toyota projects completion twenty years ago, Bodleian Libraries have continued large scale digitization activities in-house via its commercial digitization studio, outsourced to third party suppliers, and in project partnerships. With generous funding from the Polonsky Foundation the Bodleian Libraries are now set to add over half a million image surrogates of Special Collection manuscripts to its image portal – Digital.Bodleian.

What happens to 20 years’ worth of digitized material? Since 1996 both Bodleian Libraries and digitization standards have changed massively. Early challenges around storage alone have meant that content inevitably has been squirreled away in odd locations and created to the varied standards of the time. Profiling our old digitized collections is the first step to figuring out how these can be brought into line with current practice and be made more visible to library users.

“So what is the extent of your content?”, librarians from other organisations have asked me several times over the past few months. In the hope that it will be useful for other organisations trying to profile their legacy digitized collections, I thought I would present some figures here on the DPOC blog.

When tallying up our survey data, I came to a total of approximately 134 million master images in primarily TIFF and JP2 format. From very early digitization projects however, the idea of ‘master files’ was not yet developed and master and access files will, in these cases, often be one and the same.

The largest proportion of content, some 127,000,000 compressed JP2s, were created as part of the Google Books project up to 2009 and are available via Search Oxford Libraries Online. These add up to 45 TB of data. The library further holds three archives of 5.8million/99.4TB digitized image content primarily created by the Bodleian Libraries’ in-house digitization studio in TIFF. These figures does not include back-ups – with which we start getting in to quite big numbers.

Of the remaining 7 million digitized images which are not from the Google Books project, 2,395,000 are currently made available on a Bodleian Libraries website. In total the survey examined content from 40 website applications and 24 exhibition pages. 44% of the images which are made available online were, at the time of the survey, hosted on Digital.Bodleian, 4% on ODL Greenstone and 1% on Luna.The latter two are currently in the processes of being moved onto Digital.Bodleian. At least 6% of  content from the sample was duplicated across multiple website applications and are candidates for deduplication. Another interesting fact from the survey is that JPEG, JP2 (transformed to JPEG on delivery) and GIF are by far the most common access/derivative formats on Bodleian Libraries’ website applications.

The final digitized image survey report has now been reviewed by the Digital Preservation Coalition and is being looked at internally. Stay tuned to hear more in future blog posts!

Validating half a million TIFF files. Part One.

Oxford Technical Fellow, James, reports on the validation work he is doing with JHOVE and DPF Manager in Part One of this blog series on validation tools for auditing the Polonsky Digitization Project’s TIFF files.

In 2013, The Bodleian Libraries of the University of Oxford and the Biblioteca Apostolica Vaticana (Vatican Library) joined efforts in a landmark digitization project. The aim was to open up their repositories of ancient texts including Hebrew manuscripts, Greek manuscripts, and incunabula, or 15th-century printed books. The goal was to digitize over one and half million pages. All of this was made possible by funding from the Polonsky Foundation.

As part of our own Polonsky funded project, we have been preparing the ground to validate over half a million TIFF files which have been created from digitization work here at Oxford.

Many in the Digital Preservation field have already written articles and blogs on the tools available for validating TIFF files, Yvonne Tunnat (from ZBW Leibniz Information Centre for Economics) wrote a blog for the Open Preservation Foundation regarding the tools. I also had the pleasure of hearing from Yvonne and Michelle Lindlar (from TIB Leibniz Information Centre for Science and Technology) talk at IDCC 2017 conference on this very subject in more detail when discussing JHOVE in their talk, How Valid Is Your Validation? A Closer Look Behind The Curtain Of JHOVE

The go-to validator for TIFF files?

Preparation for validation

In order to validate the master TIFF files, firstly we needed to retrieve these from our tape storage system; fortunately around two-thirds of the images had already been restored to spinning disk storage as part of another internal project. When the master TIFF files were written to tape this included MD5 hashes of the files, so as part of this validation work we will confirm the fixity of all the files. Our network storage system had plenty of room to accommodate all the required files, so we began auditing what still needed to be recovered.

Whilst the auditing and retrieval was progressing, I set about investigating validating a sample set of master TIFF files using both JHOVE and DPF Manager to get an estimate on the time it would take to process the approximate 50 TB of files. I was also interested to compare the results of both tools when faced with invalid or corrupted sample sets of files.

We setup a new virtual machine server in order to carry out the validation workload; this allowed us to scale this machine’s performance as required. Both validation tools were going to be run on a RedHat Linux environment and both would be run from the command line.

It quickly became clear that JHOVE was going to be able to validate the TIFF files a lot quicker than DPF Manager. If DPF Manager is being used as part of one of your workflows, you may not have noticed any real-time penalty when processing small numbers of files, however with a large batch, the time difference with the two tools was noticeable.

Potential alternative for TIFF validation?

During the testing I noticed there were several issues with DPF Manager, including the lack of being able to specify the number of threads the process could use, which I suspect resulted in the poor initial performance. I dutifully reported the bug to the DPF community GitHub and was pleased to see an almost instant response stating that it would be resolved in the next monthly release. I do love Open Source projects, and I think this highlights the importance of those using the tools being responsible for improving them. Without community engagement, these projects are liable to run out of steam and slowly die.

I’m going to reserve judgement on the tools until the next release of DPF Manager. We will then also be in a position to report back on our findings from this validation case study. So check back with our blog for Part Two.

I would be interested to hear from anyone else who might have been faced with validating large batches of files, what tools are you using? what challenges have you faced? Do let me know!

Visit to the National Archives: herons and brutalism

An update from Edith Halvarsson about the DPOC team’s trip to visit the National Archives last week. Prepare yourself for a discussion about digital preservation, PRONOM, dark archives, and wildlife!

Last Thursday DPOC visited the National Archives in London. David Clipsham kindly put much time into organising a day of presentations with the TNA’s developers, digitization experts and digital archivists. Thank you Diana, David & David, Ron, Ian & Ian, Anna and Alex for all your time and interesting thoughts!

After some confusion, we finally arrived at the picturesque Kew Gardens station. The area around Kew is very sleepy, and our first thought on arrival was “is this really the right place?” However, after a bit more circling around Kew, you definitely cannot miss it. The TNA is located in an imposing brutalist building, surrounded by beautiful nature and ponds built as flood protection for the nation’s collections. They even have a tame heron!

After we all made it on site, the day the kicked off with an introduction from Diana Newton (Head of Digital Preservation). Diana told us enthusiastically about the history of the TNA and its Digital Records Infrastructure. It was really interesting to hear how much has changed in just six years since DRI was launched – both in terms of file format proliferation and an increase in FOI requests.

We then had a look at TNA’s ingest workflows into Preservica and storage model with Ian Hoyle (Senior Developer) and David Underdown (Senior Digital Archivist). It was particularly interesting to hear about the TNA’s decision to store all master file content on offline tape, in order to bring down the archive’s carbon footprint.

After lunch with Ron Davies (Senior Project Manager), Anna de Sousa and Ian Henderson spoke to us about their work digitizing audiovisual material and 2D images. Much of our discussion focused on standards and formats (particularly around A/V). Alex Green and David Clipsham then finished off the day talking about born-digital archive accession streams and PRONOM/DROID developments. This was the first time we had seen the clever way a file format identifier is created – there is much detective work required on David’s side. David also encouraged us and anyone else who relies on DROID to have a go and submit something to PRONOM – he even promised its fun! Why not read Jenny Mitcham’s and Andrea Byrne’s articles for some inspiration?

Thanks for a fantastic visit and some brilliant discussions on how digital preservation work and digital collecting is done at the TNA!

The things we find…

Sarah shares some finds from Edith’s Digitized image survey of the Bodleian Libraries’ many digitization projects and initiatives over the years.

We’ve been digitizing our collections for a long time. And that means we have a lot of things, in a lot of places. Part of the Policy & Planning Fellow’s task is to find them, count them, and make sure we’re looking after them. That includes making decisions to combat the obsolescence of the hardware they are stored on, the software they rely on (this includes the website that has been designed to display them), and the files themselves so they do not become victim to bit rot.

At Oxford, Edith has been hard at work searching, counting, emailing, navigating countless servers and tape managers, and writing up the image survey report. But while she has been hard at work, she has been sharing some of her best finds with the team and I thought it was time we share them with you.

Below are some interesting finds from Edith’s image survey work. Some of them a real gems:

What? a large and apparently hungry dragon from Oracula, folio 021v (Shelfmark: Barocci 170) Found? On the ODL (Oxford Digital Library) site here.

What? Toby the Sapient Pig. Found? On the Bodleian Treasures website. Currently on display in the Treasures gallery at the Weston library and open to the public. The digital version is available 24/7.

What? A very popular and beautiful early manuscript: an illustrated guide to Oxford University and its colleges, prepared for Queen Elizabeth I in 1566. This page is of the Bodleian Libraries’ Divinity School. Found? On the ODL (Oxford Digital Library) site here.

What? Corbyn in the early years (POSTER 1987-23). Found? Part of the CPA Poster Collection here.

What? And this brilliant general election poster (POSTER 1963-04). Found? Part of the CPA Poster Collection here.

What? Cosmographia, 1482, a map of the known World (Auct. P 1.4). Found? In Medieval and Renaissance Manuscripts here.

What? Gospels, folio 28v (Auct. D. 2.16). Found? Medieval and Renaissance Manuscripts here.

There are just a few of the wonderful and weird finds in our rich and diverse collections. One thing is certain, digitized collections provide hours of discovery to anyone with a computer and Internet access. It is one of the most exciting things about digitization–access to almost anyone, anywhere.

Of course providing access means preserving the digital images. Knowing what we have and where we have it, is one step to ensuring that they will be preserved for future access and discovery of the beautiful, the weird, and the wonderful.

Polonsky Fellows visit Western Bank Library at Sheffield University

Overview of DPOC’s visit to the Western Bank Library at Sheffield University by James Mooney, Technical Fellow at Bodleian Libraries, Oxford.

The Polonsky Fellows were invited to the Western Bank Library at Sheffield University to speak with Laura Peaurt and other members of the Library. The aim of the meeting was to discuss the experiences of using and implementing Ex Libris’ Rosetta product.

After arriving by train, it was just a quick tram ride to Western Bank campus at Sheffield University, then we had the fun of using the paternoster lift in the Western Bank Library to arrive at our meeting, it’s great to see this technology has been preserved and still in use.

Paternoster lifts still in use at the Western Library. Image Credit: James Mooney

We met with Laura Peaurt (Digital Preservation Manager), Chris Jones (Library Systems Manager) and Angus Taggart (Library Systems Manager – Research).

Andy Bussey, Head of Digital Services & Systems was kind enough to give us an hour of his time at the start of the meeting, allowing us to discuss parts of the procurement and implementation process.

When working out the requirements for the system, Sheffield was able to collaborate with the White Rose University Consortium (the Universities of Leeds, Sheffield and York) to work out an initial scope.

When reviewing the options both open source and proprietary products were considered. For the Western Library and the University back in 2014, after a skills audit, the open source options had to be ruled out due to a lack of technical and developmental skills to customise or support them. I’m sure if this was revisited today the outcome may well have been different as the team has grown and gained experience and expertise. Many organisations may find it easier to budget for a software package and support contract with a vendor than to pursue the creation of several new employment positions.

With that said, as part of the implementation of Rosetta, Laura’s role was created as there was an obvious need for a Digital Preservation manager, we then went on to discuss the timeframe of the project and then moved onto the configuration of the product with Laura providing a live demonstration of the product whilst talking about the current setup, the scalability of the instances and the granularity of the sections within Rosetta.

During the demonstrations we discussed what content was held in Rosetta, how people had been trained with Rosetta and what feedback they had received so far. We reviewed the associated metadata which had been stored with the items that had been ingested and went over the options regarding integration with a Catalogue and/or Archival Management System.

After lunch we went on discuss the workflows currently being used with further demonstrations so we could see an end-to-end examples including what ingest rules and polices were in place along with what tools were in use and what processes were carried out. We then looked at how problematic items were dealt with in the Technical Analysis Workbench, covering the common issues and how additional steps in the ingest process can minimise certain issues.

As part of reviewing the sections of Rosetta we also inspected of Rosetta’s metadata model, the DNX (Digital Normalised XML) and discussed ingesting born-digital content and associated METS files.

Western Library. Image Credit: A J Buildings Library.

We visited Sheffield with many questions and during the course of the discussions throughout the day many of these were answered but as the day came to a close we had to wrap up the talks and head back to the train station. We all agreed it had been an invaluable meeting and sparked further areas of discussion. Having met face to face and with an understanding of the environment at Sheffield will make future conversations that much easier.

DPOC visits the Wellcome Library in London

A brief summary by Edith Halvarsson, Policy and Planning Fellow at the Bodleian Libraries, of the DPOC project’s recent visit to the Wellcome Library.

Last Friday the Polonsky Fellows had the pleasure of spending a day with Rioghnach Ahern and David Thompson at the Wellcome Library. With a collection of over 28.6 million digitized images, the Wellcome is a great source of knowledge and experience in working with digitisation at a large scale. Themes of the day centred around pragmatic choices, achieving consistency across time and scale, and horizon scanning for emerging trends.

The morning started with an induction from Christy Henshaw, the Wellcome’s Digital Production Manager. We discussed digitisation collection development and Jpeg2000 profiles, but also future directions for the library’s digitised collection. One point which particularly stood out to me, was changes in user requirements around delivery of digitised collections. The Wellcome has found that researchers are increasingly requesting delivery of material for “use as data”. (As a side note: this is something which the Bodleian Libraries have previously explored in their Blockbooks project, which used facial recognition algorithms traditionally associated with security systems, to trace provenance of dispersed manuscripts). As the possibilities for large scale analysis using these types of algorithms multiply, the Wellcome is considering how delivery will need to change to accommodate new scholarly research methods.


Brain teaser: Spot the Odd One Out (or is it a trick question?). Image credit: Somaya Langley

Following Christy’s talk we were given a tour of the digitization studios by Laurie Auchterlonie. Laurie was in the process of digitising recipe books for the Wellcome Library’s Recipe Book Project. He told us about some less appetising recipes from the collection (such as three-headed pig soup, and puppy dishes) and about the practical issues of photographing content in a studio located on top of one of the busiest underground lines in London!

After lunch with David and Rioghnach at the staff café, we spent the rest of the afternoon looking at Goobi plug-ins, Preservica and the Wellcome’s hybrid-cloud storage model. Despite talking digitisation – metadata was a reoccurring topic in several of the presentations. Descriptive metadata is particularly challenging to manage as it tends to be a work in progress – always possible to improve and correct. This creates a tension between curators and cataloguers doing their work, and the inclination to store metadata together with digital objects in preservation systems to avoid orphaning files. Wellcome’s solution has been to articulate their three core cataloguing systems as the canonical bibliographic source, while allowing potentially out of data metadata to travel with objects in both Goobi and Preservica for in-house use only. As long as there is clarity around which is the canonical metadata record, these inconsistencies are not problematic to the library. ‘You would be surprised how many institutions have not made a decision around which their definitive bibliographic records is’, says David.


Presentation on the Wellcome Library’s digitisation infrastructure. Image credit: Somaya Langley

The last hour was spent pondering the future of digital preservation and I found the conversations very inspiring and uplifting. As we work with the long-term in mind, it is invaluable to have these chances to get out of our local context and discuss wider trends with other professionals. Themes included: digital preservation as part of archival masters courses, cloud storage and virtualisation, and the move from repository software to dispersed micro-services.

The fellow’s field trip to the Wellcome is one of a number of visits that DPOC will make during 2017 talk to institutions around the UK about their work around digital preservation. Watch for more updates.