Customizable JHOVE TIFF output handler anyone?

Technical Fellow, James, talks about the challenges with putting JHOVE’s full XML output into a reporting tool and how he found a work around. We would love feedback about how you use JHOVE’s TIFF output. What workarounds have you tried to extract the data for use in reporting tools and what do you think about having a customizable TIFF output handler for JHOVE? 


As mentioned in my last blog post, I’ve been looking to validate a reasonably large collection of TIFF master image files from a digitization project. On a side note from that, I would like to talk about the output from JHOVE’s TIFF module.

The JHOVE TIFF module allows you to specify an output handler as either Text, a XML audit, or a full XML output format.

Text provides a straight forward line by line breakdown of the various characteristics and properties of each TIFF processed. But not being a structured document means that processing the output when many files are characterized is not ideal.

The XML audit output provides a very minimal XML document which will simply report if the TIFF files were valid and well formed or not; this is great to a quick check, but lacks some other metadata properties that I was looking for.

The full XML output provides the same information that was provided in text output format, but with the advantage of being a structural document. However, I’ve found some of the additional metadata structuring in the full XML rather cumbersome to process with further reporting tools.

As result, I’ve been struggling a bit to extract all of the properties I would like from the full XML output into a reporting tool. I then started to wonder about having a more customizable output handler which would simply report the the properties I required in a neat and easier to parse XML format.

I had looked at using an XSLT transformation on the XML output but, as mentioned, I found it rather complicated to extract some of the metadata property values I wanted due to the excessive nesting of these and the property naming structure. I think I need to brush up on my XSLT skills perhaps?

In the short term, I’ve converted the XML output to a CSV file, using a little freeware program called XML2CSV from A7Soft. Using the tool, I selected the various fields required (filename, last modified date, size, compression scheme, status, TIFF version, image width & height, etc) for my reporting. Then, the conversion program extracted the selected values, which provided a far simpler and smaller document to process in the reporting tool.

I would be interested to know what others have done when confronted with the XML output and wonder if there is any mileage in a more customizable output handler for the TIFF module…

 

Update 31st May 2017

Thanks to Ross Spencer, Martin Hoppenheit and others from Twitter. I’ve now created a basic JHOVE XML to CSV XSLT stylesheet. Draft version on my GitHub should anyone want to do something similar.