How I got JHOVE running in a debugger

Cambridge’s Technical Fellow, Dave, steps through how he got JHOVE running in a debugger, including the various troubleshooting steps. As for what he found when he got under the skin of JHOVE—stay tuned.


Over the years of developing apps, I have come to rely upon the tools of the trade; so rather than read programming documentation, I prefer getting code running under a debugger and stepping through it, to let it show me what an app does. In my defence, Object Oriented code tends to get quite complicated, with various methods of one class calling unexpected methods of another… To avoid this, you can use Design Patterns and write Clean Code, but it’s also very useful to let the debugger show you the path through the code, too.

This was the approach I took when I took a closer look at JHOVE. I wanted to look under the hood of this application to help James with validating a major collection of TIFFs for a digitisation project by Bodleian Libraries and The Vatican Library.

Step 1: Getting the JHOVE code into an IDE

Jargon alert: ‘IDE’ – stands for ‘Integrated Development Environment’, which means: “… piece of software for writing, managing, sharing, testing and (in this instance) debugging code”.

So I had to pick the correct IDE to use… I already knew that JHOVE was a Java app: the fact it’s compiled as a Java Archive (JAR) was the giveaway, though if I’d needed confirmation, checking the coloured bar on the homepage of its GitHub repository would have told me, too.

Github project language analysis

Coding language analysis in a GitHub project

My Java IDE of choice is JetBrains’s IntelliJ IDEA, so the easiest way to get the code was to start a new project by Checking Out from Version Control, selecting the GitHub option and adding the URL for the JHOVE project (https://github.com/openpreserve/JHOVE). This copied (or ‘cloned’) all the code to my local machine.

Load from GitHub

Loading a project into IntelliJ IDEA directly from GitHub

GitHub makes it quite easy to manage code branches, i.e.: different versions of the codebase that can be developed in parallel with each other – so you can, say, fix a bug and re-release the app quickly in one branch, while taking longer to add a new feature in another.

The Open Preservation Foundation (who manage JHOVE’s codebase now) have (more or less) followed a convention of ‘branching on release’ – so you can easily debug the specific version you’re running in production by switching to the relevant branch… (…though version 1.7 seems to be missing a branch?) It’s usually easy to switch branches within your IDE – doing so simply pulls the code from the different branch down and loads it into your IDE, and your local Git repository in the background.

GitHub branches

Finding the correct code branch in GitHub. Where’s 1.7 gone?

Step 2: Finding the right starting point for the debugger

Like a lot of apps that have been around for a while, JHOVE’s codebase is quite large, and it’s therefore not immediately obvious where the ‘starting point’ is. At least, it isn’t obvious if you don’t READ the README file in the codebase’s root. Once you finally get around to doing that, there’s a clue buried quite near the bottom in the Project Structure section:

JHOVE-apps: The JHOVE-apps module contains the command-line and GUI application code and builds a fat JAR containing the entire Java application.

… so the app starts from within the jhove-apps folder somewhere. A little extra sniffing about and I found a class file in the src/main/java folder called Jhove.java, which contained the magic Java method:

public static void main (String [] args) {}

…which is the standard start point for any Java app (and several other languages too).

However, getting the debugger running successfully wasn’t just a case of finding the right entry point and clicking ‘run’ – I also had to setup the debugger configuration to pass the correct command-line arguments to the application, or it fell at the first hurdle. This is achieved in IntelliJ IDEA by editing the Run / Debug configuration. I set this up initially by right-clicking on the Jhove.java file and selecting Run JHOVE.main().

Running Jhove in IntelliJ

Running the Jhove class to start the application

The run failed (because I hadn’t added the command line arguments) but at least IntelliJ was clever enough to setup a new Run / Debug configuration (called Jhove after the class I’d run) that I could then add the Program Arguments to – in this case, the same command line arguments you’d run JHOVE with normally (e.g. the module you want to run, the handler you’d want to output the result with, the file you want to characterise etc etc).

Edit the run config

Editing the Run configuration in IntelliJ

I could then add a breakpoint to the code in the Jhove.main() method and off I went… Or did I?

Step 3: setting up a config file

So this gave me what I needed to start stepping through the code. Unfortunately, my first attempt didn’t get any further than the initial Jhove.main() method… It got all the way through, but then the following error occurred:

Cannot instantiate module: com.mcgath.jhove.module.PngModule

The clue for how to fix this was actually provided by the debugger as it ran, however, and provides a good example of the kind of insight you get from running code in debug mode in your IDE. Because the initial set of command-line parameters I was passing in from the Run / Debug configuration didn’t contain a “-c” parameter to set a config file, JHOVE was automagically picking up its configuration from a default location: i.e. the JHOVE/config folder in my user directory – which existed, with a config file, because I’d also installed JHOVE on my machine the easy way beforehand…)

Config file variable in debugger

Debugger points towards the config file mix-up

A quick look at this config showed that JHOVE was expecting all sorts of modules to be available to load, one of which was the ‘external’ module for PNG characterisation mentioned in the error message. This is included in the JHOVE codebase, but in a separate folder (jhove-ext-modules): the build script that pulls JHOVE together for production deployment clearly copes with copying the PNG module from this location to the correct place, but the IDE couldn’t find it when debugging.

So the solution? Put a custom config file in place, and remove the parts that referenced the PNG module. This worked a treat, and allowed me to track the code execution all the way through for a test TIFF file.

Adding a config file parameter

Adding an extra -c config file parameter and a custom config file.

Conclusion

Really, all the above, while making it possible to get under the skin of JHOVE, is just the start. Another blog post may follow regarding what I actually found when I ran through its processes and started to get and idea of how it worked (though as a bit of a spoiler, it wasn’t exactly pretty)…

But, given that JHOVE is more or less ubiquitous in digital preservation (i.e. all the major vended solutions wrap it up in their ingest processes in one way or another), hopefully more people will be encouraged to dive into it and learn how it works in more detail. (I guess you could just ‘read the manual’ – but if you’re a developer, doing it this way is more insightful, and more fun, too).

Validating half a million TIFF files. Part Two.

Back in May, I wrote a blog post about preparing the groundwork for the process of validating over 500,000 TIFF files which were created as part of a Polonsky Digitization Project which started in 2013. You can read Part One here on the blog.

Restoring the TIFF files from tape

Stack of backup tapes. Photo: Amazon

For the digitization workflow we used Goobi and within that process, the master TIFF files from the project were written to tape. In order to actually check these files, it was obvious we would need to restore all the content to spinning disk. I duly made a request to our system administration team and waited.

As I mentioned in Part One, we had setup a new virtualised server which had access to a chunk of network storage. The Polonsky TIFF files were restored to this network storage, however midway through the restoration from tape, the tape server’s operating system crashed…disaster.

After reviewing the failure, it appeared there was a bug within the RedHat operating system which had caused the problem. This issue proved to be a good lesson, a tape backup copy is only useful if you can actually restore it!

Question for you. When was the last time you tried to restore a large quantity of data from tape?

After some head scratching, patching and a review of the related systems, a second attempt at restoring all the TIFF content from tape commenced and this time all went well and the files were restored to the network storage. Hurrah!

JHOVE to validate those TIFFs

I decided that for the initial validation of the TIFF files, checking the files were well-formed and valid, JHOVE would provide a good baseline report.

As I mentioned in another blog post Customizable JHOVE TIFF output handler anyone? JHOVE’s XML output is rather unwieldy and so I planned to transform the XML using xsltproc (a command line xslt processor) with a custom XSLT stylesheet, allowing us to select any of attributes from the file which we might want to report on later, this would then produce a simple CSV output.

On a side note, work on adding a CSV output handler to JHOVE is in progress! This would mean the above process would be much simpler and quicker.

Parallel processing for the win.

What’s better than one JHOVE process validating TIFF content? Two! (well actually for us, sixteen at once works out quite nicely.)

It was clear from some initial testing with a 10,000 sample set of TIFF files that a single JHOVE process was going to take a long time to process 520,000+ images (around two and half days!)

So I started to look for a simple way to run many JHOVE processes in parallel. Using GNU Parallel seemed like a good way to go.

I created a command line BASH script which would take a list of directories to scan and then utilise GNU Parallel to fire off many JHOVE + XSLT processes to result in a CSV output, one line per TIFF file processed.

As our validation server was virtualised, it meant that I could scale the memory and CPU cores in this machine to do some performance testing. Below is a chart showing the number of images that the parallel processing system could handle per minute vs. the number of CPU cores enabled on the virtual server. (For all of the testing the memory in the server remained at 4 GB.)

So with 16 CPU cores, the estimate was that it would take around 6-7 hours to process all the Polonksy TIFF content, so a nice improvement on a single process.

At the start of this week, I ran a full production test, validating all 520,000+ TIFF files. 4 and half hours later the process was complete and 100 MB+ CSV file was generated with 520,000+ rows of data. Success!

For Part Three of this story I will write up how I plan to visualise the CSV data in Qlik Sense and the further analysis of those few files which failed the initial validation.

Customizable JHOVE TIFF output handler anyone?

Technical Fellow, James, talks about the challenges with putting JHOVE’s full XML output into a reporting tool and how he found a work around. We would love feedback about how you use JHOVE’s TIFF output. What workarounds have you tried to extract the data for use in reporting tools and what do you think about having a customizable TIFF output handler for JHOVE? 


As mentioned in my last blog post, I’ve been looking to validate a reasonably large collection of TIFF master image files from a digitization project. On a side note from that, I would like to talk about the output from JHOVE’s TIFF module.

The JHOVE TIFF module allows you to specify an output handler as either Text, a XML audit, or a full XML output format.

Text provides a straight forward line by line breakdown of the various characteristics and properties of each TIFF processed. But not being a structured document means that processing the output when many files are characterized is not ideal.

The XML audit output provides a very minimal XML document which will simply report if the TIFF files were valid and well formed or not; this is great to a quick check, but lacks some other metadata properties that I was looking for.

The full XML output provides the same information that was provided in text output format, but with the advantage of being a structural document. However, I’ve found some of the additional metadata structuring in the full XML rather cumbersome to process with further reporting tools.

As result, I’ve been struggling a bit to extract all of the properties I would like from the full XML output into a reporting tool. I then started to wonder about having a more customizable output handler which would simply report the the properties I required in a neat and easier to parse XML format.

I had looked at using an XSLT transformation on the XML output but, as mentioned, I found it rather complicated to extract some of the metadata property values I wanted due to the excessive nesting of these and the property naming structure. I think I need to brush up on my XSLT skills perhaps?

In the short term, I’ve converted the XML output to a CSV file, using a little freeware program called XML2CSV from A7Soft. Using the tool, I selected the various fields required (filename, last modified date, size, compression scheme, status, TIFF version, image width & height, etc) for my reporting. Then, the conversion program extracted the selected values, which provided a far simpler and smaller document to process in the reporting tool.

I would be interested to know what others have done when confronted with the XML output and wonder if there is any mileage in a more customizable output handler for the TIFF module…

 

Update 31st May 2017

Thanks to Ross Spencer, Martin Hoppenheit and others from Twitter. I’ve now created a basic JHOVE XML to CSV XSLT stylesheet. Draft version on my GitHub should anyone want to do something similar.

Validating half a million TIFF files. Part One.

Oxford Technical Fellow, James, reports on the validation work he is doing with JHOVE and DPF Manager in Part One of this blog series on validation tools for auditing the Polonsky Digitization Project’s TIFF files.


In 2013, The Bodleian Libraries of the University of Oxford and the Biblioteca Apostolica Vaticana (Vatican Library) joined efforts in a landmark digitization project. The aim was to open up their repositories of ancient texts including Hebrew manuscripts, Greek manuscripts, and incunabula, or 15th-century printed books. The goal was to digitize over one and half million pages. All of this was made possible by funding from the Polonsky Foundation.

As part of our own Polonsky funded project, we have been preparing the ground to validate over half a million TIFF files which have been created from digitization work here at Oxford.

Many in the Digital Preservation field have already written articles and blogs on the tools available for validating TIFF files, Yvonne Tunnat (from ZBW Leibniz Information Centre for Economics) wrote a blog for the Open Preservation Foundation regarding the tools. I also had the pleasure of hearing from Yvonne and Michelle Lindlar (from TIB Leibniz Information Centre for Science and Technology) talk at IDCC 2017 conference on this very subject in more detail when discussing JHOVE in their talk, How Valid Is Your Validation? A Closer Look Behind The Curtain Of JHOVE

The go-to validator for TIFF files?

Preparation for validation

In order to validate the master TIFF files, firstly we needed to retrieve these from our tape storage system; fortunately around two-thirds of the images had already been restored to spinning disk storage as part of another internal project. When the master TIFF files were written to tape this included MD5 hashes of the files, so as part of this validation work we will confirm the fixity of all the files. Our network storage system had plenty of room to accommodate all the required files, so we began auditing what still needed to be recovered.

Whilst the auditing and retrieval was progressing, I set about investigating validating a sample set of master TIFF files using both JHOVE and DPF Manager to get an estimate on the time it would take to process the approximate 50 TB of files. I was also interested to compare the results of both tools when faced with invalid or corrupted sample sets of files.

We setup a new virtual machine server in order to carry out the validation workload; this allowed us to scale this machine’s performance as required. Both validation tools were going to be run on a RedHat Linux environment and both would be run from the command line.

It quickly became clear that JHOVE was going to be able to validate the TIFF files a lot quicker than DPF Manager. If DPF Manager is being used as part of one of your workflows, you may not have noticed any real-time penalty when processing small numbers of files, however with a large batch, the time difference with the two tools was noticeable.

Potential alternative for TIFF validation?

During the testing I noticed there were several issues with DPF Manager, including the lack of being able to specify the number of threads the process could use, which I suspect resulted in the poor initial performance. I dutifully reported the bug to the DPF community GitHub and was pleased to see an almost instant response stating that it would be resolved in the next monthly release. I do love Open Source projects, and I think this highlights the importance of those using the tools being responsible for improving them. Without community engagement, these projects are liable to run out of steam and slowly die.

I’m going to reserve judgement on the tools until the next release of DPF Manager. We will then also be in a position to report back on our findings from this validation case study. So check back with our blog for Part Two.

I would be interested to hear from anyone else who might have been faced with validating large batches of files, what tools are you using? what challenges have you faced? Do let me know!