Validating half a million TIFF files. Part Two.

Back in May, I wrote a blog post about preparing the groundwork for the process of validating over 500,000 TIFF files which were created as part of a Polonsky Digitization Project which started in 2013. You can read Part One here on the blog.

Restoring the TIFF files from tape

Stack of backup tapes. Photo: Amazon

For the digitization workflow we used Goobi and within that process, the master TIFF files from the project were written to tape. In order to actually check these files, it was obvious we would need to restore all the content to spinning disk. I duly made a request to our system administration team and waited.

As I mentioned in Part One, we had setup a new virtualised server which had access to a chunk of network storage. The Polonsky TIFF files were restored to this network storage, however midway through the restoration from tape, the tape server’s operating system crashed…disaster.

After reviewing the failure, it appeared there was a bug within the RedHat operating system which had caused the problem. This issue proved to be a good lesson, a tape backup copy is only useful if you can actually restore it!

Question for you. When was the last time you tried to restore a large quantity of data from tape?

After some head scratching, patching and a review of the related systems, a second attempt at restoring all the TIFF content from tape commenced and this time all went well and the files were restored to the network storage. Hurrah!

JHOVE to validate those TIFFs

I decided that for the initial validation of the TIFF files, checking the files were well-formed and valid, JHOVE would provide a good baseline report.

As I mentioned in another blog post Customizable JHOVE TIFF output handler anyone? JHOVE’s XML output is rather unwieldy and so I planned to transform the XML using xsltproc (a command line xslt processor) with a custom XSLT stylesheet, allowing us to select any of attributes from the file which we might want to report on later, this would then produce a simple CSV output.

On a side note, work on adding a CSV output handler to JHOVE is in progress! This would mean the above process would be much simpler and quicker.

Parallel processing for the win.

What’s better than one JHOVE process validating TIFF content? Two! (well actually for us, sixteen at once works out quite nicely.)

It was clear from some initial testing with a 10,000 sample set of TIFF files that a single JHOVE process was going to take a long time to process 520,000+ images (around two and half days!)

So I started to look for a simple way to run many JHOVE processes in parallel. Using GNU Parallel seemed like a good way to go.

I created a command line BASH script which would take a list of directories to scan and then utilise GNU Parallel to fire off many JHOVE + XSLT processes to result in a CSV output, one line per TIFF file processed.

As our validation server was virtualised, it meant that I could scale the memory and CPU cores in this machine to do some performance testing. Below is a chart showing the number of images that the parallel processing system could handle per minute vs. the number of CPU cores enabled on the virtual server. (For all of the testing the memory in the server remained at 4 GB.)

So with 16 CPU cores, the estimate was that it would take around 6-7 hours to process all the Polonksy TIFF content, so a nice improvement on a single process.

At the start of this week, I ran a full production test, validating all 520,000+ TIFF files. 4 and half hours later the process was complete and 100 MB+ CSV file was generated with 520,000+ rows of data. Success!

For Part Three of this story I will write up how I plan to visualise the CSV data in Qlik Sense and the further analysis of those few files which failed the initial validation.

Customizable JHOVE TIFF output handler anyone?

Technical Fellow, James, talks about the challenges with putting JHOVE’s full XML output into a reporting tool and how he found a work around. We would love feedback about how you use JHOVE’s TIFF output. What workarounds have you tried to extract the data for use in reporting tools and what do you think about having a customizable TIFF output handler for JHOVE? 

As mentioned in my last blog post, I’ve been looking to validate a reasonably large collection of TIFF master image files from a digitization project. On a side note from that, I would like to talk about the output from JHOVE’s TIFF module.

The JHOVE TIFF module allows you to specify an output handler as either Text, a XML audit, or a full XML output format.

Text provides a straight forward line by line breakdown of the various characteristics and properties of each TIFF processed. But not being a structured document means that processing the output when many files are characterized is not ideal.

The XML audit output provides a very minimal XML document which will simply report if the TIFF files were valid and well formed or not; this is great to a quick check, but lacks some other metadata properties that I was looking for.

The full XML output provides the same information that was provided in text output format, but with the advantage of being a structural document. However, I’ve found some of the additional metadata structuring in the full XML rather cumbersome to process with further reporting tools.

As result, I’ve been struggling a bit to extract all of the properties I would like from the full XML output into a reporting tool. I then started to wonder about having a more customizable output handler which would simply report the the properties I required in a neat and easier to parse XML format.

I had looked at using an XSLT transformation on the XML output but, as mentioned, I found it rather complicated to extract some of the metadata property values I wanted due to the excessive nesting of these and the property naming structure. I think I need to brush up on my XSLT skills perhaps?

In the short term, I’ve converted the XML output to a CSV file, using a little freeware program called XML2CSV from A7Soft. Using the tool, I selected the various fields required (filename, last modified date, size, compression scheme, status, TIFF version, image width & height, etc) for my reporting. Then, the conversion program extracted the selected values, which provided a far simpler and smaller document to process in the reporting tool.

I would be interested to know what others have done when confronted with the XML output and wonder if there is any mileage in a more customizable output handler for the TIFF module…


Update 31st May 2017

Thanks to Ross Spencer, Martin Hoppenheit and others from Twitter. I’ve now created a basic JHOVE XML to CSV XSLT stylesheet. Draft version on my GitHub should anyone want to do something similar.

Validating half a million TIFF files. Part One.

Oxford Technical Fellow, James, reports on the validation work he is doing with JHOVE and DPF Manager in Part One of this blog series on validation tools for auditing the Polonsky Digitization Project’s TIFF files.

In 2013, The Bodleian Libraries of the University of Oxford and the Biblioteca Apostolica Vaticana (Vatican Library) joined efforts in a landmark digitization project. The aim was to open up their repositories of ancient texts including Hebrew manuscripts, Greek manuscripts, and incunabula, or 15th-century printed books. The goal was to digitize over one and half million pages. All of this was made possible by funding from the Polonsky Foundation.

As part of our own Polonsky funded project, we have been preparing the ground to validate over half a million TIFF files which have been created from digitization work here at Oxford.

Many in the Digital Preservation field have already written articles and blogs on the tools available for validating TIFF files, Yvonne Tunnat (from ZBW Leibniz Information Centre for Economics) wrote a blog for the Open Preservation Foundation regarding the tools. I also had the pleasure of hearing from Yvonne and Michelle Lindlar (from TIB Leibniz Information Centre for Science and Technology) talk at IDCC 2017 conference on this very subject in more detail when discussing JHOVE in their talk, How Valid Is Your Validation? A Closer Look Behind The Curtain Of JHOVE

The go-to validator for TIFF files?

Preparation for validation

In order to validate the master TIFF files, firstly we needed to retrieve these from our tape storage system; fortunately around two-thirds of the images had already been restored to spinning disk storage as part of another internal project. When the master TIFF files were written to tape this included MD5 hashes of the files, so as part of this validation work we will confirm the fixity of all the files. Our network storage system had plenty of room to accommodate all the required files, so we began auditing what still needed to be recovered.

Whilst the auditing and retrieval was progressing, I set about investigating validating a sample set of master TIFF files using both JHOVE and DPF Manager to get an estimate on the time it would take to process the approximate 50 TB of files. I was also interested to compare the results of both tools when faced with invalid or corrupted sample sets of files.

We setup a new virtual machine server in order to carry out the validation workload; this allowed us to scale this machine’s performance as required. Both validation tools were going to be run on a RedHat Linux environment and both would be run from the command line.

It quickly became clear that JHOVE was going to be able to validate the TIFF files a lot quicker than DPF Manager. If DPF Manager is being used as part of one of your workflows, you may not have noticed any real-time penalty when processing small numbers of files, however with a large batch, the time difference with the two tools was noticeable.

Potential alternative for TIFF validation?

During the testing I noticed there were several issues with DPF Manager, including the lack of being able to specify the number of threads the process could use, which I suspect resulted in the poor initial performance. I dutifully reported the bug to the DPF community GitHub and was pleased to see an almost instant response stating that it would be resolved in the next monthly release. I do love Open Source projects, and I think this highlights the importance of those using the tools being responsible for improving them. Without community engagement, these projects are liable to run out of steam and slowly die.

I’m going to reserve judgement on the tools until the next release of DPF Manager. We will then also be in a position to report back on our findings from this validation case study. So check back with our blog for Part Two.

I would be interested to hear from anyone else who might have been faced with validating large batches of files, what tools are you using? what challenges have you faced? Do let me know!