How I got JHOVE running in a debugger

Cambridge’s Technical Fellow, Dave, steps through how he got JHOVE running in a debugger, including the various troubleshooting steps. As for what he found when he got under the skin of JHOVE—stay tuned.


Over the years of developing apps, I have come to rely upon the tools of the trade; so rather than read programming documentation, I prefer getting code running under a debugger and stepping through it, to let it show me what an app does. In my defence, Object Oriented code tends to get quite complicated, with various methods of one class calling unexpected methods of another… To avoid this, you can use Design Patterns and write Clean Code, but it’s also very useful to let the debugger show you the path through the code, too.

This was the approach I took when I took a closer look at JHOVE. I wanted to look under the hood of this application to help James with validating a major collection of TIFFs for a digitisation project by Bodleian Libraries and The Vatican Library.

Step 1: Getting the JHOVE code into an IDE

Jargon alert: ‘IDE’ – stands for ‘Integrated Development Environment’, which means: “… piece of software for writing, managing, sharing, testing and (in this instance) debugging code”.

So I had to pick the correct IDE to use… I already knew that JHOVE was a Java app: the fact it’s compiled as a Java Archive (JAR) was the giveaway, though if I’d needed confirmation, checking the coloured bar on the homepage of its GitHub repository would have told me, too.

Github project language analysis

Coding language analysis in a GitHub project

My Java IDE of choice is JetBrains’s IntelliJ IDEA, so the easiest way to get the code was to start a new project by Checking Out from Version Control, selecting the GitHub option and adding the URL for the JHOVE project (https://github.com/openpreserve/JHOVE). This copied (or ‘cloned’) all the code to my local machine.

Load from GitHub

Loading a project into IntelliJ IDEA directly from GitHub

GitHub makes it quite easy to manage code branches, i.e.: different versions of the codebase that can be developed in parallel with each other – so you can, say, fix a bug and re-release the app quickly in one branch, while taking longer to add a new feature in another.

The Open Preservation Foundation (who manage JHOVE’s codebase now) have (more or less) followed a convention of ‘branching on release’ – so you can easily debug the specific version you’re running in production by switching to the relevant branch… (…though version 1.7 seems to be missing a branch?) It’s usually easy to switch branches within your IDE – doing so simply pulls the code from the different branch down and loads it into your IDE, and your local Git repository in the background.

GitHub branches

Finding the correct code branch in GitHub. Where’s 1.7 gone?

Step 2: Finding the right starting point for the debugger

Like a lot of apps that have been around for a while, JHOVE’s codebase is quite large, and it’s therefore not immediately obvious where the ‘starting point’ is. At least, it isn’t obvious if you don’t READ the README file in the codebase’s root. Once you finally get around to doing that, there’s a clue buried quite near the bottom in the Project Structure section:

JHOVE-apps: The JHOVE-apps module contains the command-line and GUI application code and builds a fat JAR containing the entire Java application.

… so the app starts from within the jhove-apps folder somewhere. A little extra sniffing about and I found a class file in the src/main/java folder called Jhove.java, which contained the magic Java method:

public static void main (String [] args) {}

…which is the standard start point for any Java app (and several other languages too).

However, getting the debugger running successfully wasn’t just a case of finding the right entry point and clicking ‘run’ – I also had to setup the debugger configuration to pass the correct command-line arguments to the application, or it fell at the first hurdle. This is achieved in IntelliJ IDEA by editing the Run / Debug configuration. I set this up initially by right-clicking on the Jhove.java file and selecting Run JHOVE.main().

Running Jhove in IntelliJ

Running the Jhove class to start the application

The run failed (because I hadn’t added the command line arguments) but at least IntelliJ was clever enough to setup a new Run / Debug configuration (called Jhove after the class I’d run) that I could then add the Program Arguments to – in this case, the same command line arguments you’d run JHOVE with normally (e.g. the module you want to run, the handler you’d want to output the result with, the file you want to characterise etc etc).

Edit the run config

Editing the Run configuration in IntelliJ

I could then add a breakpoint to the code in the Jhove.main() method and off I went… Or did I?

Step 3: setting up a config file

So this gave me what I needed to start stepping through the code. Unfortunately, my first attempt didn’t get any further than the initial Jhove.main() method… It got all the way through, but then the following error occurred:

Cannot instantiate module: com.mcgath.jhove.module.PngModule

The clue for how to fix this was actually provided by the debugger as it ran, however, and provides a good example of the kind of insight you get from running code in debug mode in your IDE. Because the initial set of command-line parameters I was passing in from the Run / Debug configuration didn’t contain a “-c” parameter to set a config file, JHOVE was automagically picking up its configuration from a default location: i.e. the JHOVE/config folder in my user directory – which existed, with a config file, because I’d also installed JHOVE on my machine the easy way beforehand…)

Config file variable in debugger

Debugger points towards the config file mix-up

A quick look at this config showed that JHOVE was expecting all sorts of modules to be available to load, one of which was the ‘external’ module for PNG characterisation mentioned in the error message. This is included in the JHOVE codebase, but in a separate folder (jhove-ext-modules): the build script that pulls JHOVE together for production deployment clearly copes with copying the PNG module from this location to the correct place, but the IDE couldn’t find it when debugging.

So the solution? Put a custom config file in place, and remove the parts that referenced the PNG module. This worked a treat, and allowed me to track the code execution all the way through for a test TIFF file.

Adding a config file parameter

Adding an extra -c config file parameter and a custom config file.

Conclusion

Really, all the above, while making it possible to get under the skin of JHOVE, is just the start. Another blog post may follow regarding what I actually found when I ran through its processes and started to get and idea of how it worked (though as a bit of a spoiler, it wasn’t exactly pretty)…

But, given that JHOVE is more or less ubiquitous in digital preservation (i.e. all the major vended solutions wrap it up in their ingest processes in one way or another), hopefully more people will be encouraged to dive into it and learn how it works in more detail. (I guess you could just ‘read the manual’ – but if you’re a developer, doing it this way is more insightful, and more fun, too).

Digital preservation is a mature concept, but we need to pitch it better

Cambridge Technical Fellow, Dave, presents his thoughts on the OAIS and his own elevator pitch about digital preservation from the Pericles/DPC Acting on Change conference in London, last week.


Some of the best discussions at the Pericles / DPC Acting on Change conference came during the morning panel sessions. In the first, provocatively titled “Beyond the OAIS”, Barbara Sierman, from The KB National Library of the Netherlands, admitted that the OAIS can be confusing for newcomers… and as a newcomer to digital preservation, I agree!

Fellow panellist Barbara Reed, from Recordkeeping Innovation, suggested the OAIS’s Administration function as a potentially-confusing area, and this too struck a chord. I’ve gained some systems analysis and modelling experience over the years, and my first thought looking at the OAIS was that the Admin function looked like a place where much of the hard-to-model, human stuff had been separated from the technical, tool-based parts. (I’ve seen this happen before in other domains…)

There’s actually a hint that this is happening in the standard’s diagram for the Admin function – it’s busier and more information-packed than the other function diagrams, which tends to be a sign that it’s a bit of a ‘bucket’ which needs more modelling. This led me to an immediate concern that Admin doesn’t sit easily within the overall standard, and I think Barbara Reed had picked up on this too, suggesting that two more focused documents – one ‘technical’, one ‘human’ – might make the standard easier to use.

Then Artefactual Systems’ Dan Gillean asked who we should be talking to about the OAIS outside of the community? Barbara Reed answered ‘Enterprise Architects’; and two of the things Enterprise Architects use in their work are domain models and pattern languages. I was glad Barbara made this point, because I had already come to a similar conclusion.

AV Preserve’s Kara Van Malssen replied ‘communications experts’ to Dan’s question, suggesting Marketing in particular, though perhaps skilled science communicators might be even better? (Both Cambridge and Oxford – among others – put a lot of effort into public engagement with research, and there is a healthy body of research literature about it).

And the importance of communication was further emphasised by Nancy McGovern (MIT Libraries) and Neil Beagrie (Charles Beagrie Ltd) during the second day’s panel session (Preparing for Change). Nancy used the phrase ‘Technical Author’ at one stage – and it occurred that such input might be a very quick win for the OAIS Reference Implementation? Meanwhile, Neil talked about needing a short, pithy statement that explains what we do to funders…

So here’s an attempt at an Elevator Pitch:

Digital Preservation means sourcing computer-based material that is worthy of preservation, getting that material under control, and then maintaining the usefulness of that material, forever.

This Elevator Pitch is part of the pattern language I’m working on with my fellow Polonsky Fellows, and (I hope, soon) the broader Digital Preservation community. (We’re still thinking about that last ‘forever’, but considering how old some of the things in our libraries are, ‘forever’ seems an easy way of thinking about it).

The key point that Nancy McGovern made, however, was that we’re ready to take Digital Preservation to a wider audience. I think she’s right. The OAIS is confusing – it’s a real head-scrambler for a newcomer like me – but it has reached a level of maturity: it’s clear how much deep thought and expertise underpins it. And, of course, the same goes for the technology it has influenced over the previous decades. This supports what Arkivum’s Matthew Addis said in the second day’s keynote – the digital preservation community is ready to take their ideas to the world: we perhaps just need to pitch them a little better?

A digital preservation pattern language

Technical Fellow, Dave, shares his final update from PASIG NYC in October. It includes his opinions on digital preservation terminology and his development of an interpretation model for mapping processes.


Another of the sessions at the PASIG NYC conference we attended concerned standardisation. It started with Avoiding the 927 Problem: Standards, Digital Preservation, and Communities of Practice by Artefactual Systems’ Dan Gillean, which explained the relationships between De Jure / De Facto, and Open / Proprietary standards, and which introduced the major Digital Preservation standards. Then later in the session, Sibyl Schaefer (@archivelle) from the UCSD Chronopolis Network presented Here we go again down this road: Certification and Recertification, which covered the ISO standardisation terminology (e.g. Certification vs Accreditation) and went deeper into the formal (De Jure) standards, in particular the Open Archival Information System (OAIS) reference model (ISO 14721) and the Audit and Certification of Trustworthy Digital Repositories (ISO 16363).

One aspect of Dan Gillean’s presentation that resonated with me was his discussion of the Communities of Practice that had emerged around the Digital Preservation standards. This reminded me of a software development concept called design patterns, which has its roots in (real) architecture, and in particular a book called A Pattern Language: towns, buildings, construction, by Christopher Alexander (et al). This proposes that planners and architects develop a ‘language’ of architecture so that they can learn from each other and contribute their ideas to a more harmonious, better-planned whole of well-designed cities, towns and countryside. The key concept they propose is that of the ‘pattern’:

The elements of this [architectural] language are entities called patterns. Each pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice (Alexander et al, 1977:x).

Each pattern has a common structure, including details of the problem it solves, the forces at work, the start and end states of related resources, and relationships to other patterns. (James Coplein has provided a short overview of a typical pattern structure). The idea is to build up a playbook of (de facto) standard approaches to common problems, and the types of behaviour that might solve them, as a way of sharing and reusing knowledge.

I asked around at PASIG to see if anyone had created a reusable set of Digital Preservation Patterns (somebody please tell me if so, it’ll save me heaps of work!), but I drew a blank. So I grabbed the Alexander book (I work in a building containing 18 million books!), and also had a quick look online. The best online resource I found was http://www.hillside.net/ – which contained lots of familiar names related to programming design patterns (e.g. Erich Gamma, Grady Booch, Martin Fowler, Ward Cunningham). But the original Alexander book also gave me an insight into patterns that I’d never heard of before, in particular the very straightforward way that its patterns related to each other from the general / high level (e.g. patterns about regional, city and town planning), via mid-level patterns (for neighbourhoods, streets and building design), to the extremely detailed (e.g. patterns for where to put beds, baths and kitchen equipment).

This helped me consider what I think are two issues with Digital Preservation: firstly, there’s a lot of jargon (e.g. ‘fixity’, ‘technical metadata’ or ‘file format migration’ – none of which are terms fit for normal conversation). Secondly, many of the Digital Preservation models mismatch concepts at different levels of abstraction and complexity: for example the OAIS places a discrete process labelled Data Management alongside another labelled Ingest, where Ingest is quite a specific, discrete step in the overall picture, but where there’s also a strong case for saying that the whole of Digital Preservation is ‘data management’, including Ingest itself.

Such issues of defining and labelling concepts are common in most computer-technology-related domains, of course, and they’re often harmful (contributing to the common story of failed IT projects and angry developers / customers etc). But the way in which A Pattern Language arranges its patterns at the same levels of abstraction and detail, and in doing so enables drilling-down through region / city / town / neighbourhood / street / building / room, provides an elegant example of how to avoid this trap.

Hence I’ve been working on a model of the Digital Preservation domain that has ‘elevator pitch’ and ‘plain English’ levels of detail before I get to the nitty-gritty of technical details. My intention is to group similarly-sized and equally-complex sets of Digital Preservation processes together in ways that help describe them in clear, jargon-free ways, hence forming a reusable set of patterns that help people work out how to implement Digital Preservation in their own organisational contexts. I will have an opportunity to share this model, and the patterns I derive from it, as it develops. Watch this space.

Alexander, C., Ishikawa, S., Silverstein, M., Jacobson, M., Fiksdahl-King, I. and Angel, S. (1977) A Pattern Language: towns, buildings, construction. 1st edn. New York: Oxford University Press.


Do you know of any work that’s been done to create a Digital Preservation Pattern Language? Would you like to contribute your ideas towards Dave’s idea of creating a playbook of Digital Preservation design patterns? Please let Dave know using the form below…