An approach to selecting case studies

Cambridge Policy & Planning Fellow, Somaya, writes about a case study approach developed by the Cambridge DPOC Fellows for CUL. Somaya’s first blog post about the case studies looks at the selection methodology the Cambridge DPOC fellows used to choose their final case studies.

Physical format digital carriers. Photo: Somaya Langley

Background & approach

Cambridge University Library (CUL) has moved to a ‘case study’ approach to the project. The case studies will provide an evidence-based foundation for writing a policy and strategy, developing a training programme and writing technical requirements within the time constraints of the project.The case studies we choose for the DPOC project will enable us to test hands-on day-to-day tasks necessary for working with digital collection materials at CUL. They also need to be representative of our existing collections and future acquisitions, our Collection Development Policy FrameworkStrategic Plan,  our current and future audiences, while considering the ‘preservation risk’ of the materials.

Classes of material

Based on the digital collections surveying work I’ve been doing, our digital collections fall into seven different ‘classes’:

  1. Unpublished born-digital materials – personal and corporate papers, digital archives of significant individuals or institutions
  2. Born-digital university archives – selected records of the University of Cambridge
  3. Research outputs – research data and publications (including compliance)
  4. Published born-digital materials – physical format carriers (optical media), eBooks, web archives, archival and access copies of electronic subscription services, etc.
  5. Digitised image materials – 2D photography (and 3D imaging)
  6. Digital (and analogue) audiovisual materials – moving image (film and video) and sound recordings
  7. In-house created content – photography and videography of events, lectures, photos of conservation treatments, etc.

Proposed case studies

Approximately 40 potential case studies suggested by CUL and Affiliated Library staff were considered. These proposed case studies were selected from digital materials in our existing collections, current acquisition offers, and requests for assistance with digital collection materials, from across Cambridge University. Each proposed case study would allow us to trial different tools (and digital preservation systems), approaches, workflow stages, and represent different ‘classes’ of material.

Digital lifecycle stages

The selected stages are based on a draft Digital Stewardship End-to-End Workflow I am developing. The workflow includes approximately a dozen different stages. It is based on the Digital Curation Centre’s Curation Lifecycle Model, and is also aligned with the Digital POWRR (Preserving Digital Objects with Restricted Resources) Tool Evaluation Grid.

There are also additional essential concerns, including:

  • data security
  • integration (with CUL systems and processes)
  • preservation risk
  • remove and/or delete
  • reporting
  • resources and resourcing
  • system configuration

Selected stages for Cambridge’s case studies

Dave, Lee and I discussed the stages and cut it down to the bare-minimum required to test out various tasks as part of the case studies. These stages include:

  1. Appraise and Select
  2. Acquire / Transfer
  3. Pre-Ingest (including Preconditioning and Quality Assurance)
  4. Ingest (including Generate Submission Information Package)
  5. Preservation Actions (sub-component of Preserve)
  6. Access and Delivery
  7. Integration (with Library systems and processes) and Reporting

Case study selection

In order to produce a shortlist, I needed to work out a parameter best suited in order to rank the proposed case studies from a digital preservation perspective. The initial parameter we decided on was complexity. Did the proposed case study provide enough technical challenges to fully test out what we needed to research?

We also took into account a Streams Matrix (still in development) that outlines different tasks taken at each of the at each of the selected digital life cycle stages. This would ensure different variations of activities were factored in at each stage.

We revisited the case studies once in ranked order and reviewed them, taking into account additional parameters. The additional parameters included:

  • Frequency and/or volume – how much of this type of material do we have/are we likely to acquire (i.e. is this a type of task that would need to be carried out often)?
  • Significance – how significant is the collection in question?
  • Urgency – does this case study fit within strategic priorities such as the current Cambridge University Library Strategic Plan and Collection Development Policy Framework etc.?
  • Uniqueness – is the case study unique and would it be of interest to our users (e.g. the digital preservation field, Cambridge University researchers)?
  • Value to our users and/or stakeholders – is this of value to our current and future users, researchers and/or stakeholders?

This produced a shortlist of eight case studies. We concluded that each provided different long-term digital preservation issues and were experiencing considerable degrees of ‘preservation risk’.


This was a challenging and time-consuming approach, however it ensures fairness in the selection process. The case studies will enable us to have tangible evidence in which to ground the work of the rest of the project. The Cambridge University Library Polonsky Digital Preservation Project Board have agreed that we will undertake three case studies, including a digitisation case study, a born-digital case study and one more – the details of which are still being discussed. Stay tuned for more updates.

Customizable JHOVE TIFF output handler anyone?

Technical Fellow, James, talks about the challenges with putting JHOVE’s full XML output into a reporting tool and how he found a work around. We would love feedback about how you use JHOVE’s TIFF output. What workarounds have you tried to extract the data for use in reporting tools and what do you think about having a customizable TIFF output handler for JHOVE? 

As mentioned in my last blog post, I’ve been looking to validate a reasonably large collection of TIFF master image files from a digitization project. On a side note from that, I would like to talk about the output from JHOVE’s TIFF module.

The JHOVE TIFF module allows you to specify an output handler as either Text, a XML audit, or a full XML output format.

Text provides a straight forward line by line breakdown of the various characteristics and properties of each TIFF processed. But not being a structured document means that processing the output when many files are characterized is not ideal.

The XML audit output provides a very minimal XML document which will simply report if the TIFF files were valid and well formed or not; this is great to a quick check, but lacks some other metadata properties that I was looking for.

The full XML output provides the same information that was provided in text output format, but with the advantage of being a structural document. However, I’ve found some of the additional metadata structuring in the full XML rather cumbersome to process with further reporting tools.

As result, I’ve been struggling a bit to extract all of the properties I would like from the full XML output into a reporting tool. I then started to wonder about having a more customizable output handler which would simply report the the properties I required in a neat and easier to parse XML format.

I had looked at using an XSLT transformation on the XML output but, as mentioned, I found it rather complicated to extract some of the metadata property values I wanted due to the excessive nesting of these and the property naming structure. I think I need to brush up on my XSLT skills perhaps?

In the short term, I’ve converted the XML output to a CSV file, using a little freeware program called XML2CSV from A7Soft. Using the tool, I selected the various fields required (filename, last modified date, size, compression scheme, status, TIFF version, image width & height, etc) for my reporting. Then, the conversion program extracted the selected values, which provided a far simpler and smaller document to process in the reporting tool.

I would be interested to know what others have done when confronted with the XML output and wonder if there is any mileage in a more customizable output handler for the TIFF module…


Update 31st May 2017

Thanks to Ross Spencer, Martin Hoppenheit and others from Twitter. I’ve now created a basic JHOVE XML to CSV XSLT stylesheet. Draft version on my GitHub should anyone want to do something similar.

Skills interviewing using the DPOC skills interview toolkit

Cambridge Outreach & Training Fellow, Lee, shares his experiences in skills auditing.

As I am nearing the end of my fourteenth transcription and am three months into skills interview process, now is a good time to pause and reflect. This post will look at the experience of the interview process using the DPOC digital preservation skills toolkit. this toolkit is currently under development; we are learning and improving it as we trial it at Cambridge and Oxford.

Step 1: Identify your potential participants

To understand colleagues’ use of technology and training needs, a series of interviews were arranged. We agreed that a maximum sample of 25 participants would give us plenty (perhaps too much?) of material to work with. Before invitations were sent out, a list was made up of potential participants. In building the list, a set of criteria ensured that a broad range of colleagues were captured. This criteria consisted of:

  • in what department or library do they work?
  • is there a particular bias of colleagues from a certain department or library and can this be redressed?
  • what do they do?
  • is there a suitable practitioner to manager ratio?

The criteria relies on you having a good grasp of your institution, its organisation and the people within it. If you are unsure, start asking managers and colleagues who do know your institution very well—you will learn a lot! It is also worth having a longer list than your intended maximum in case you do not get responses, or people are not available or do not wish to participate.

Step 2: Inviting your potential participants

Prior to sending out invitations, the intended participant’s managers were consulted to see if they would agree to their staff time being used in this way. This was also a good opportunity to continue awareness raising of the project as well as getting buy-in to the the interview process.

The interviews were arranged in blocks of five to make planning around other work easier.

Step 3: Interviewing

The DPOC semi-structured skills interview questions were put to the test at this step. Having developed the questions beforehand ensured I covered the necessary digital preservation skills during the interview.

Here are some tips I gained from the interview process which helped to get some great responses.

  • Offer refreshments before the interview. Advise beforehand that a generous box of chocolate biscuits will be available throughout proceeding. This also gives you an excellent chance to talk informally to your subject and put them at ease, especially if they appear nervous.
  • If using, make sure your recording equipment is working. There’s nothing worse than thinking you have fifty minutes of interview gold only to find that you’ve not pressed play or the device has run out of power. Take a second device, or if you don’t want the technological hassle, use pen(cil) and paper.
  • Start with colleagues that you know quite well. This will help you understand the flow of the questions better and they will not shy away from honest feedback.
  • Always have printed copies of interview questions. Technology almost always fails you.

My next post will be about transcribing and analysing interviews.

Over 20 years of digitization at the Bodleian Libraries

Policy and Planning Fellow Edith writes an update on some of her findings from the DPOC project’s survey of digitized images at the Bodleian Libraries.

During August-December 2016 I have been collating information about Bodleian Libraries’ digitized collections. As an early adopter of digitization technology, Bodleian Libraries have made digital surrogates of its collections available online since the early 1990’s. A particular favourite of mine, and a landmark among the Bodleian Libraries’ early digital projects, is the Toyota Transport Digitization Project (1996). [Still up and running here]

At the time of the Toyota Project, digitization was still highly specialised and the Bodleian Libraries opted to outsource the digital part to Laser Bureau London. Laser Bureau ‘digitilised’ 35mm image negatives supplied by Bodleian Libraries’ imaging studio and sent the files over on a big bundle of CDs. 1244 images all in all – which was a massive achievement at the time. It is staggering to think that we could now produce the same many times over in just a day!

Since the Toyota projects completion twenty years ago, Bodleian Libraries have continued large scale digitization activities in-house via its commercial digitization studio, outsourced to third party suppliers, and in project partnerships. With generous funding from the Polonsky Foundation the Bodleian Libraries are now set to add over half a million image surrogates of Special Collection manuscripts to its image portal – Digital.Bodleian.

What happens to 20 years’ worth of digitized material? Since 1996 both Bodleian Libraries and digitization standards have changed massively. Early challenges around storage alone have meant that content inevitably has been squirreled away in odd locations and created to the varied standards of the time. Profiling our old digitized collections is the first step to figuring out how these can be brought into line with current practice and be made more visible to library users.

“So what is the extent of your content?”, librarians from other organisations have asked me several times over the past few months. In the hope that it will be useful for other organisations trying to profile their legacy digitized collections, I thought I would present some figures here on the DPOC blog.

When tallying up our survey data, I came to a total of approximately 134 million master images in primarily TIFF and JP2 format. From very early digitization projects however, the idea of ‘master files’ was not yet developed and master and access files will, in these cases, often be one and the same.

The largest proportion of content, some 127,000,000 compressed JP2s, were created as part of the Google Books project up to 2009 and are available via Search Oxford Libraries Online. These add up to 45 TB of data. The library further holds three archives of 5.8million/99.4TB digitized image content primarily created by the Bodleian Libraries’ in-house digitization studio in TIFF. These figures does not include back-ups – with which we start getting in to quite big numbers.

Of the remaining 7 million digitized images which are not from the Google Books project, 2,395,000 are currently made available on a Bodleian Libraries website. In total the survey examined content from 40 website applications and 24 exhibition pages. 44% of the images which are made available online were, at the time of the survey, hosted on Digital.Bodleian, 4% on ODL Greenstone and 1% on Luna.The latter two are currently in the processes of being moved onto Digital.Bodleian. At least 6% of  content from the sample was duplicated across multiple website applications and are candidates for deduplication. Another interesting fact from the survey is that JPEG, JP2 (transformed to JPEG on delivery) and GIF are by far the most common access/derivative formats on Bodleian Libraries’ website applications.

The final digitized image survey report has now been reviewed by the Digital Preservation Coalition and is being looked at internally. Stay tuned to hear more in future blog posts!

Validating half a million TIFF files. Part One.

Oxford Technical Fellow, James, reports on the validation work he is doing with JHOVE and DPF Manager in Part One of this blog series on validation tools for auditing the Polonsky Digitization Project’s TIFF files.

In 2013, The Bodleian Libraries of the University of Oxford and the Biblioteca Apostolica Vaticana (Vatican Library) joined efforts in a landmark digitization project. The aim was to open up their repositories of ancient texts including Hebrew manuscripts, Greek manuscripts, and incunabula, or 15th-century printed books. The goal was to digitize over one and half million pages. All of this was made possible by funding from the Polonsky Foundation.

As part of our own Polonsky funded project, we have been preparing the ground to validate over half a million TIFF files which have been created from digitization work here at Oxford.

Many in the Digital Preservation field have already written articles and blogs on the tools available for validating TIFF files, Yvonne Tunnat (from ZBW Leibniz Information Centre for Economics) wrote a blog for the Open Preservation Foundation regarding the tools. I also had the pleasure of hearing from Yvonne and Michelle Lindlar (from TIB Leibniz Information Centre for Science and Technology) talk at IDCC 2017 conference on this very subject in more detail when discussing JHOVE in their talk, How Valid Is Your Validation? A Closer Look Behind The Curtain Of JHOVE

The go-to validator for TIFF files?

Preparation for validation

In order to validate the master TIFF files, firstly we needed to retrieve these from our tape storage system; fortunately around two-thirds of the images had already been restored to spinning disk storage as part of another internal project. When the master TIFF files were written to tape this included MD5 hashes of the files, so as part of this validation work we will confirm the fixity of all the files. Our network storage system had plenty of room to accommodate all the required files, so we began auditing what still needed to be recovered.

Whilst the auditing and retrieval was progressing, I set about investigating validating a sample set of master TIFF files using both JHOVE and DPF Manager to get an estimate on the time it would take to process the approximate 50 TB of files. I was also interested to compare the results of both tools when faced with invalid or corrupted sample sets of files.

We setup a new virtual machine server in order to carry out the validation workload; this allowed us to scale this machine’s performance as required. Both validation tools were going to be run on a RedHat Linux environment and both would be run from the command line.

It quickly became clear that JHOVE was going to be able to validate the TIFF files a lot quicker than DPF Manager. If DPF Manager is being used as part of one of your workflows, you may not have noticed any real-time penalty when processing small numbers of files, however with a large batch, the time difference with the two tools was noticeable.

Potential alternative for TIFF validation?

During the testing I noticed there were several issues with DPF Manager, including the lack of being able to specify the number of threads the process could use, which I suspect resulted in the poor initial performance. I dutifully reported the bug to the DPF community GitHub and was pleased to see an almost instant response stating that it would be resolved in the next monthly release. I do love Open Source projects, and I think this highlights the importance of those using the tools being responsible for improving them. Without community engagement, these projects are liable to run out of steam and slowly die.

I’m going to reserve judgement on the tools until the next release of DPF Manager. We will then also be in a position to report back on our findings from this validation case study. So check back with our blog for Part Two.

I would be interested to hear from anyone else who might have been faced with validating large batches of files, what tools are you using? what challenges have you faced? Do let me know!