How I got JHOVE running in a debugger

Cambridge’s Technical Fellow, Dave, steps through how he got JHOVE running in a debugger, including the various troubleshooting steps. As for what he found when he got under the skin of JHOVE—stay tuned.


Over the years of developing apps, I have come to rely upon the tools of the trade; so rather than read programming documentation, I prefer getting code running under a debugger and stepping through it, to let it show me what an app does. In my defence, Object Oriented code tends to get quite complicated, with various methods of one class calling unexpected methods of another… To avoid this, you can use Design Patterns and write Clean Code, but it’s also very useful to let the debugger show you the path through the code, too.

This was the approach I took when I took a closer look at JHOVE. I wanted to look under the hood of this application to help James with validating a major collection of TIFFs for a digitisation project by Bodleian Libraries and The Vatican Library.

Step 1: Getting the JHOVE code into an IDE

Jargon alert: ‘IDE’ – stands for ‘Integrated Development Environment’, which means: “… piece of software for writing, managing, sharing, testing and (in this instance) debugging code”.

So I had to pick the correct IDE to use… I already knew that JHOVE was a Java app: the fact it’s compiled as a Java Archive (JAR) was the giveaway, though if I’d needed confirmation, checking the coloured bar on the homepage of its GitHub repository would have told me, too.

Github project language analysis

Coding language analysis in a GitHub project

My Java IDE of choice is JetBrains’s IntelliJ IDEA, so the easiest way to get the code was to start a new project by Checking Out from Version Control, selecting the GitHub option and adding the URL for the JHOVE project (https://github.com/openpreserve/JHOVE). This copied (or ‘cloned’) all the code to my local machine.

Load from GitHub

Loading a project into IntelliJ IDEA directly from GitHub

GitHub makes it quite easy to manage code branches, i.e.: different versions of the codebase that can be developed in parallel with each other – so you can, say, fix a bug and re-release the app quickly in one branch, while taking longer to add a new feature in another.

The Open Preservation Foundation (who manage JHOVE’s codebase now) have (more or less) followed a convention of ‘branching on release’ – so you can easily debug the specific version you’re running in production by switching to the relevant branch… (…though version 1.7 seems to be missing a branch?) It’s usually easy to switch branches within your IDE – doing so simply pulls the code from the different branch down and loads it into your IDE, and your local Git repository in the background.

GitHub branches

Finding the correct code branch in GitHub. Where’s 1.7 gone?

Step 2: Finding the right starting point for the debugger

Like a lot of apps that have been around for a while, JHOVE’s codebase is quite large, and it’s therefore not immediately obvious where the ‘starting point’ is. At least, it isn’t obvious if you don’t READ the README file in the codebase’s root. Once you finally get around to doing that, there’s a clue buried quite near the bottom in the Project Structure section:

JHOVE-apps: The JHOVE-apps module contains the command-line and GUI application code and builds a fat JAR containing the entire Java application.

… so the app starts from within the jhove-apps folder somewhere. A little extra sniffing about and I found a class file in the src/main/java folder called Jhove.java, which contained the magic Java method:

public static void main (String [] args) {}

…which is the standard start point for any Java app (and several other languages too).

However, getting the debugger running successfully wasn’t just a case of finding the right entry point and clicking ‘run’ – I also had to setup the debugger configuration to pass the correct command-line arguments to the application, or it fell at the first hurdle. This is achieved in IntelliJ IDEA by editing the Run / Debug configuration. I set this up initially by right-clicking on the Jhove.java file and selecting Run JHOVE.main().

Running Jhove in IntelliJ

Running the Jhove class to start the application

The run failed (because I hadn’t added the command line arguments) but at least IntelliJ was clever enough to setup a new Run / Debug configuration (called Jhove after the class I’d run) that I could then add the Program Arguments to – in this case, the same command line arguments you’d run JHOVE with normally (e.g. the module you want to run, the handler you’d want to output the result with, the file you want to characterise etc etc).

Edit the run config

Editing the Run configuration in IntelliJ

I could then add a breakpoint to the code in the Jhove.main() method and off I went… Or did I?

Step 3: setting up a config file

So this gave me what I needed to start stepping through the code. Unfortunately, my first attempt didn’t get any further than the initial Jhove.main() method… It got all the way through, but then the following error occurred:

Cannot instantiate module: com.mcgath.jhove.module.PngModule

The clue for how to fix this was actually provided by the debugger as it ran, however, and provides a good example of the kind of insight you get from running code in debug mode in your IDE. Because the initial set of command-line parameters I was passing in from the Run / Debug configuration didn’t contain a “-c” parameter to set a config file, JHOVE was automagically picking up its configuration from a default location: i.e. the JHOVE/config folder in my user directory – which existed, with a config file, because I’d also installed JHOVE on my machine the easy way beforehand…)

Config file variable in debugger

Debugger points towards the config file mix-up

A quick look at this config showed that JHOVE was expecting all sorts of modules to be available to load, one of which was the ‘external’ module for PNG characterisation mentioned in the error message. This is included in the JHOVE codebase, but in a separate folder (jhove-ext-modules): the build script that pulls JHOVE together for production deployment clearly copes with copying the PNG module from this location to the correct place, but the IDE couldn’t find it when debugging.

So the solution? Put a custom config file in place, and remove the parts that referenced the PNG module. This worked a treat, and allowed me to track the code execution all the way through for a test TIFF file.

Adding a config file parameter

Adding an extra -c config file parameter and a custom config file.

Conclusion

Really, all the above, while making it possible to get under the skin of JHOVE, is just the start. Another blog post may follow regarding what I actually found when I ran through its processes and started to get and idea of how it worked (though as a bit of a spoiler, it wasn’t exactly pretty)…

But, given that JHOVE is more or less ubiquitous in digital preservation (i.e. all the major vended solutions wrap it up in their ingest processes in one way or another), hopefully more people will be encouraged to dive into it and learn how it works in more detail. (I guess you could just ‘read the manual’ – but if you’re a developer, doing it this way is more insightful, and more fun, too).

Digital Preservation futurology

I fancy attempting futurology, so here’s a list of things I believe could happen to ‘digital preservation systems’ over the next decade. I’ve mostly pinched these ideas from folks like Dave Thompson, Neil Jefferies, and my fellow Fellows. But if you see one of your ideas, please claim it using the handy commenting mechanism. And because it’s futurology, it doesn’t have to be accurate, so kindly contradict me!

Ingest becomes a relationship, not a one-off event

Many of the core concepts underpinning how computers are perceived to work are crude, paper-based metaphors – e.g. ‘files’, ‘folders’, ‘desktops’, ‘wastebaskets’ etc – that don’t relate to what your computer’s actually doing. (The early players in office computing were typewriter and photocopier manufacturers, after all…) These metaphors have succeeded at getting everyone to use computers, but they’ve also suppressed various opportunities to work smarter, too.

The concept of ingesting (oxymoronic) ‘digital papers’ is obviously heavily influenced by this paper paradigm.  Maybe the ‘paper paradigm’ has misled the archival community about computers a bit, too, given that they were experts at handling ‘papers’ before computers arrived?

As an example of what I mean: in the olden days (25 whole years ago!), Professor Plum would amass piles of important papers until the day he retired / died, and then, and only then, could these personal papers be donated and archived. Computers, of course, make it possible for the Prof both to keep his ‘papers’ where he needs them, and donate them at the same time, but the ‘ingest event’ at the centre of current digital preservation systems still seems to be underpinned by a core concept of ‘piles of stuff needing to be dealt with as a one-off task’. In future, the ‘ingest’ of a ‘donation’ will actually become a regular, repeated set of occurrences based upon ongoing relationships between donors and collectors, and forged initially when Profs are but lowly postgrads. Personal Digital Archiving and Research Data Management will become key; and ripping digital ephemera from dying hard disks will become less necessary as they become so.

The above depends heavily upon…

Object versioning / dependency management

Of course, if Dr. Damson regularly donates materials from her postgrad days onwards, some of these may be updates to things donated previously. Some of them might have mutated so much since the original donation that they can be considered ‘child’ objects, which may have ‘siblings’ with ‘common ancestors’ already extant in the archive. Hence preservation systems need to manage multiple versions of ‘digital objects’, and the relationships between them.

Some of the preservation systems we’ve looked at claim to ‘do versioning’ but it’s a bit clunky – just side-by-side copies of immutable ‘digital objects’, not records of the changes from one version to the next, and with no concept of branching siblings from a common parent. Complex structures of interdependent objects are generally problematic for current systems. The wider computing world has been pushing at the limits of the ‘paper-paradigm’ immutable object for a while now (think Git, Blockchain, various version control and dependency management platforms, etc). Digital preservation systems will soon catch up.

Further blurring of the object / metadata boundary

What’s more important, the object or the metadata? The ‘paper-paradigm’ has skewed thinking towards the former (the sacrosanct ‘digital object’, comparable to the ‘original bit of paper’), but after you’ve digitised your rare book collection, what are Humanities scholars going to text-mine? It won’t be images of pages – it’ll be the transcripts of those (i.e. the ‘descriptive metadata’)*. Also, when seminal papers about these text mining efforts are published, how is this history of the engagement with your collection going to be recorded? Using a series of PREMIS Events (that future scholars can mine in turn), perhaps?

The above talk of text mining and contextual linking of secondary resources raises two more points…

* While I’m here, can I take issue with the term ‘descriptive metadata’? All metadata is descriptive. It’s tautological; like saying ‘uptight Englishman’. Can we think of a better name?

Ability to analyse metadata at scale

‘Delivery’ no longer just means ‘giving users a viewer to look at things one-by-one with’ – it now also means ‘letting people push their Natural Language or image processing algorithms to where the data sits, and then coping with vast streams of output data’.

Storage / retention informed by well-understood usage patterns

The fact that everything’s digital, and hence easier to disseminate and link together than physical objects, also means better understanding how people use our material. This doesn’t just mean ‘wiring things up to Google Analytics’ – advances in bibliometrics that add social / mainstream media analysis, and so forth, to everyday citation counts present opportunities to judge the impact of our ‘stuff’ on the world like never before. Smart digital archives will inform their storage management and retention decisions with this sort of usage information, potentially in fully or semi-automated ways.

Ability to get data out, cleanly – all systems are only ever temporary!

Finally – it’s clear that there are no ‘long-term’ preservation system options. The system you procure today will merely be ‘custodian’ of your materials for the next ten or twenty years (if you’re lucky). This may mean moving heaps of content around in future, but perhaps it’s more pragmatic to think of future preservation systems as more like ‘lenses’ that are laid on top of more stable data stores to enable as-yet-undreamt-of functionality for future audiences?

(OK – that’s enough for now…)

Validating half a million TIFF files. Part One.

Oxford Technical Fellow, James, reports on the validation work he is doing with JHOVE and DPF Manager in Part One of this blog series on validation tools for auditing the Polonsky Digitization Project’s TIFF files.


In 2013, The Bodleian Libraries of the University of Oxford and the Biblioteca Apostolica Vaticana (Vatican Library) joined efforts in a landmark digitization project. The aim was to open up their repositories of ancient texts including Hebrew manuscripts, Greek manuscripts, and incunabula, or 15th-century printed books. The goal was to digitize over one and half million pages. All of this was made possible by funding from the Polonsky Foundation.

As part of our own Polonsky funded project, we have been preparing the ground to validate over half a million TIFF files which have been created from digitization work here at Oxford.

Many in the Digital Preservation field have already written articles and blogs on the tools available for validating TIFF files, Yvonne Tunnat (from ZBW Leibniz Information Centre for Economics) wrote a blog for the Open Preservation Foundation regarding the tools. I also had the pleasure of hearing from Yvonne and Michelle Lindlar (from TIB Leibniz Information Centre for Science and Technology) talk at IDCC 2017 conference on this very subject in more detail when discussing JHOVE in their talk, How Valid Is Your Validation? A Closer Look Behind The Curtain Of JHOVE

The go-to validator for TIFF files?

Preparation for validation

In order to validate the master TIFF files, firstly we needed to retrieve these from our tape storage system; fortunately around two-thirds of the images had already been restored to spinning disk storage as part of another internal project. When the master TIFF files were written to tape this included MD5 hashes of the files, so as part of this validation work we will confirm the fixity of all the files. Our network storage system had plenty of room to accommodate all the required files, so we began auditing what still needed to be recovered.

Whilst the auditing and retrieval was progressing, I set about investigating validating a sample set of master TIFF files using both JHOVE and DPF Manager to get an estimate on the time it would take to process the approximate 50 TB of files. I was also interested to compare the results of both tools when faced with invalid or corrupted sample sets of files.

We setup a new virtual machine server in order to carry out the validation workload; this allowed us to scale this machine’s performance as required. Both validation tools were going to be run on a RedHat Linux environment and both would be run from the command line.

It quickly became clear that JHOVE was going to be able to validate the TIFF files a lot quicker than DPF Manager. If DPF Manager is being used as part of one of your workflows, you may not have noticed any real-time penalty when processing small numbers of files, however with a large batch, the time difference with the two tools was noticeable.

Potential alternative for TIFF validation?

During the testing I noticed there were several issues with DPF Manager, including the lack of being able to specify the number of threads the process could use, which I suspect resulted in the poor initial performance. I dutifully reported the bug to the DPF community GitHub and was pleased to see an almost instant response stating that it would be resolved in the next monthly release. I do love Open Source projects, and I think this highlights the importance of those using the tools being responsible for improving them. Without community engagement, these projects are liable to run out of steam and slowly die.

I’m going to reserve judgement on the tools until the next release of DPF Manager. We will then also be in a position to report back on our findings from this validation case study. So check back with our blog for Part Two.

I would be interested to hear from anyone else who might have been faced with validating large batches of files, what tools are you using? what challenges have you faced? Do let me know!