Audiovisual creation and preservation: part 2

Paul Heslin, Digital Collection Infrastructure Support Officer/Film Preservation Officer at the National Film and Sound Archive of Australia (NFSA) has generously contributed the following blog post. Introduction by Cambridge Policy and Planning Fellow, Somaya.

Introduction

As Digital Preservation is such a wide-ranging field, people working in this field can’t be an absolute expert on absolutely everything. It’s important to have areas of expertise and to connect and collaborate with others who can share their knowledge and experience.

While I have a background in audio, broadcast radio, multimedia and some video editing, moving image preservation is not my area of speciality. It is for this reason I invited Paul Heslin to compose a follow-up to my Audiovisual creation and preservation blog post. Paul Heslin is a Digital Archivist at the NFSA, currently preoccupied with migrating the digital collection to a new generation of LTO tapes.

I am incredibly indebted to Paul and the input from his colleagues and managers (some of whom are also my former colleagues, from when I worked at the NFSA).


Background to moving image preservation

A core concern for all archives is the ongoing accessibility of their collections. In this regard film archives have traditionally been spoilt: a film print does not require any intermediate machinery for assessment, and conceptually a projector is not a complicated device (at least in regards to presenting the visual qualities of the film). Film material can be expected to last hundreds of years if kept in appropriate vault conditions; other moving image formats are not so lucky. Many flavours of videotape are predicted to be extinct within a decade, due to loss of machinery or expertise, and born-digital moving image items can arrive at the archive in any possible format. This situation necessitates digitisation and migration to formats which can be trusted to continue to be suitable. But not only suitable!

Optimistically, the digital preservation of these formats carries the promise of these items maintaining their integrity perpetually. Unlike analogue preservation, there is no assumption of degradation over time, however there are other challenges to consider. The equipment requirements for playing back a digital audiovisual file can be complicated, especially as the vast majority of such files are compressed using encoding/decoding systems called codecs. There can be very interesting results when these systems go wrong!

Example of Bad Compression (in Paris). Copyright Paul Heslin

Example of Bad Compression (in Paris). Copyright Paul Heslin

Codecs

Codecs can be used in an archival context for much the same reason as the commercial world. Data storage is expensive and money saved can certainly be spent elsewhere. However, a key difference is that archives require truly lossless compression. So, it is important here to distinguish between lossless codecs which are mathematically lossless and those which are visually lossless. The later claims to encode in a way which is visually indistinguishable from an original source file, but it still dispenses with ‘superfluous’ data. This is not appropriate for archival usage, as this data loss cannot be recovered, and accumulated migration will ultimately result in visual and aural imperfections.

Another issue for archivists is that many codecs are proprietary or commercially owned: Apple’s ProRes format is a good example. While it is ubiquitously used within the production industry, it is an especially troubling example given signs that Apple will not be providing support into the future, especially for non-Mac platforms. This is not a huge issue for production companies who will have moved on to new projects and codecs, but for archives collecting these materials this presents a real problem. For this reason there is interest in dependable open standards which exist outside the commercial sphere.

FFV1

One of the more interesting developments in this area has been the emergence of the FFV1 codec. FFV1 started life in the early 2000s as a lossless codec associated with the FFMPEG free software project and has since gained some traction as a potential audiovisual preservation codec for the future. The advantages of the codec are:

  • It is non-proprietary, unlike the many other popular codecs currently in use.
  • It makes use of truly lossless compression, so archives can store more material in less space without compromising quality.
  • FFV1 files are ALWAYS losslessly compressed, which avoids accidents that can result from using formats which can either encode losslessly or lossily (like the popular JPEG-2000 archival format).
  • It internally holds checksums for each frame, allowing archivists to check that everything is as it should be. Frame checksums are especially useful in identifying where error has specifically occurred.
  • Benchmark tests indicate that conversion speeds are quicker than JPEG-2000. This makes a difference for archives dealing with large collections and limited computing resources.

The final, and possibly most exciting, attribute of FFV1 is that it is developing out of the needs of the archival community, rather than relying on specifications designed for industry use. Updates from the original developer, Michael Niedermayer, have introduced beneficial features for archival use and so far the codec has been implemented in different capacities by the The National Archives in the UK, the Austrian National Archives, and the Irish Film Institute, as well as being featured in the FIAF Journal Of Film Preservation.

Validating half a million TIFF files. Part Two.

Back in May, I wrote a blog post about preparing the groundwork for the process of validating over 500,000 TIFF files which were created as part of a Polonsky Digitization Project which started in 2013. You can read Part One here on the blog.

Restoring the TIFF files from tape

Stack of backup tapes. Photo: Amazon

For the digitization workflow we used Goobi and within that process, the master TIFF files from the project were written to tape. In order to actually check these files, it was obvious we would need to restore all the content to spinning disk. I duly made a request to our system administration team and waited.

As I mentioned in Part One, we had setup a new virtualised server which had access to a chunk of network storage. The Polonsky TIFF files were restored to this network storage, however midway through the restoration from tape, the tape server’s operating system crashed…disaster.

After reviewing the failure, it appeared there was a bug within the RedHat operating system which had caused the problem. This issue proved to be a good lesson, a tape backup copy is only useful if you can actually restore it!

Question for you. When was the last time you tried to restore a large quantity of data from tape?

After some head scratching, patching and a review of the related systems, a second attempt at restoring all the TIFF content from tape commenced and this time all went well and the files were restored to the network storage. Hurrah!

JHOVE to validate those TIFFs

I decided that for the initial validation of the TIFF files, checking the files were well-formed and valid, JHOVE would provide a good baseline report.

As I mentioned in another blog post Customizable JHOVE TIFF output handler anyone? JHOVE’s XML output is rather unwieldy and so I planned to transform the XML using xsltproc (a command line xslt processor) with a custom XSLT stylesheet, allowing us to select any of attributes from the file which we might want to report on later, this would then produce a simple CSV output.

On a side note, work on adding a CSV output handler to JHOVE is in progress! This would mean the above process would be much simpler and quicker.

Parallel processing for the win.

What’s better than one JHOVE process validating TIFF content? Two! (well actually for us, sixteen at once works out quite nicely.)

It was clear from some initial testing with a 10,000 sample set of TIFF files that a single JHOVE process was going to take a long time to process 520,000+ images (around two and half days!)

So I started to look for a simple way to run many JHOVE processes in parallel. Using GNU Parallel seemed like a good way to go.

I created a command line BASH script which would take a list of directories to scan and then utilise GNU Parallel to fire off many JHOVE + XSLT processes to result in a CSV output, one line per TIFF file processed.

As our validation server was virtualised, it meant that I could scale the memory and CPU cores in this machine to do some performance testing. Below is a chart showing the number of images that the parallel processing system could handle per minute vs. the number of CPU cores enabled on the virtual server. (For all of the testing the memory in the server remained at 4 GB.)

So with 16 CPU cores, the estimate was that it would take around 6-7 hours to process all the Polonksy TIFF content, so a nice improvement on a single process.

At the start of this week, I ran a full production test, validating all 520,000+ TIFF files. 4 and half hours later the process was complete and 100 MB+ CSV file was generated with 520,000+ rows of data. Success!

For Part Three of this story I will write up how I plan to visualise the CSV data in Qlik Sense and the further analysis of those few files which failed the initial validation.

Visit to the Parliamentary Archives: Training and business cases

Edith Halvarsson, Policy and Planning Fellow at Bodleian Libraries, writes about the DPOC project’s recent visit to the Parliamentary Archives.


This week the DPOC fellows visited the Parliamentary Archives in London. Thank you very much to Catherine Hardman (Head of Preservation and Access), Chris Fryer (Digital Archivist) and Grace Bell (Digital Preservation Trainee) for having us. Shamefully I have to admit that we have been very slow to make this trip; Chris first invited us to visit all the way back in September last year! However, our tardiness to make our way to Westminster was in the end aptly timed with the completion of year one of the DPOC project and planning for year 2.

Like CUL and Bodleian Libraries, the Parliamentary Archives also first began their own Digital Preservation Project back in 2010. Their project has since transitioned into digital preservation in a more programmatic capacity as of 2015. As CUL and Bodleian Libraries will be beginning to draft business cases for moving from project to programme in year 2; meeting with Chris and Catherine was a good opportunity to talk about how you start making that tricky transition.

Of course, every institution has its own drivers and risks which influence business cases for digital preservation, but there are certain things which will sound familiar to a lot of organisations. For example, what Parliamentary Archives have found over the past seven years, is that advocacy for digital collections and training staff in digital preservation skills is an ongoing activity. Implementing solutions is one thing, whereas maintaining them is another. This, in addition to staff who have received digital preservation training eventually moving on to new institutions, means that you constantly need to stay on top of advocacy and training. Making “the business case” is therefore not a one-off task.

Another central challenge in terms of building business cases, is how you frame digital preservation as a service rather than as “an added burden”. The idea of “seamless preservation” with no human intervention is a very appealing one to already burdened staff, but in reality workflows need to be supervised and maintained. To sell digital preservation, that extra work must therefore be perceived as something which adds value to collection material and the organisation. It is clear that physical preservation adds value to collections, but the argument for digital preservation can be a harder sell.

Catherine had, however, some encouraging comments on how we can attempt to turn advice about digital preservation into something which is perceived as value adding.  Being involved with and talking to staff early on in the design of new project proposals – rather than as an extra add on after processes are already in place – is an example of this.

Image by James Mooney

All in all, it has been a valuable and encouraging visit to the Parliamentary Archives. The DPOC fellows look forward to keeping in touch – particularly to hear more about the great work Parliamentary Archive have been doing to provide digital preservation training to staff!