How I got JHOVE running in a debugger

Cambridge’s Technical Fellow, Dave, steps through how he got JHOVE running in a debugger, including the various troubleshooting steps. As for what he found when he got under the skin of JHOVE—stay tuned.


Over the years of developing apps, I have come to rely upon the tools of the trade; so rather than read programming documentation, I prefer getting code running under a debugger and stepping through it, to let it show me what an app does. In my defence, Object Oriented code tends to get quite complicated, with various methods of one class calling unexpected methods of another… To avoid this, you can use Design Patterns and write Clean Code, but it’s also very useful to let the debugger show you the path through the code, too.

This was the approach I took when I took a closer look at JHOVE. I wanted to look under the hood of this application to help James with validating a major collection of TIFFs for a digitisation project by Bodleian Libraries and The Vatican Library.

Step 1: Getting the JHOVE code into an IDE

Jargon alert: ‘IDE’ – stands for ‘Integrated Development Environment’, which means: “… piece of software for writing, managing, sharing, testing and (in this instance) debugging code”.

So I had to pick the correct IDE to use… I already knew that JHOVE was a Java app: the fact it’s compiled as a Java Archive (JAR) was the giveaway, though if I’d needed confirmation, checking the coloured bar on the homepage of its GitHub repository would have told me, too.

Github project language analysis

Coding language analysis in a GitHub project

My Java IDE of choice is JetBrains’s IntelliJ IDEA, so the easiest way to get the code was to start a new project by Checking Out from Version Control, selecting the GitHub option and adding the URL for the JHOVE project (https://github.com/openpreserve/JHOVE). This copied (or ‘cloned’) all the code to my local machine.

Load from GitHub

Loading a project into IntelliJ IDEA directly from GitHub

GitHub makes it quite easy to manage code branches, i.e.: different versions of the codebase that can be developed in parallel with each other – so you can, say, fix a bug and re-release the app quickly in one branch, while taking longer to add a new feature in another.

The Open Preservation Foundation (who manage JHOVE’s codebase now) have (more or less) followed a convention of ‘branching on release’ – so you can easily debug the specific version you’re running in production by switching to the relevant branch… (…though version 1.7 seems to be missing a branch?) It’s usually easy to switch branches within your IDE – doing so simply pulls the code from the different branch down and loads it into your IDE, and your local Git repository in the background.

GitHub branches

Finding the correct code branch in GitHub. Where’s 1.7 gone?

Step 2: Finding the right starting point for the debugger

Like a lot of apps that have been around for a while, JHOVE’s codebase is quite large, and it’s therefore not immediately obvious where the ‘starting point’ is. At least, it isn’t obvious if you don’t READ the README file in the codebase’s root. Once you finally get around to doing that, there’s a clue buried quite near the bottom in the Project Structure section:

JHOVE-apps: The JHOVE-apps module contains the command-line and GUI application code and builds a fat JAR containing the entire Java application.

… so the app starts from within the jhove-apps folder somewhere. A little extra sniffing about and I found a class file in the src/main/java folder called Jhove.java, which contained the magic Java method:

public static void main (String [] args) {}

…which is the standard start point for any Java app (and several other languages too).

However, getting the debugger running successfully wasn’t just a case of finding the right entry point and clicking ‘run’ – I also had to setup the debugger configuration to pass the correct command-line arguments to the application, or it fell at the first hurdle. This is achieved in IntelliJ IDEA by editing the Run / Debug configuration. I set this up initially by right-clicking on the Jhove.java file and selecting Run JHOVE.main().

Running Jhove in IntelliJ

Running the Jhove class to start the application

The run failed (because I hadn’t added the command line arguments) but at least IntelliJ was clever enough to setup a new Run / Debug configuration (called Jhove after the class I’d run) that I could then add the Program Arguments to – in this case, the same command line arguments you’d run JHOVE with normally (e.g. the module you want to run, the handler you’d want to output the result with, the file you want to characterise etc etc).

Edit the run config

Editing the Run configuration in IntelliJ

I could then add a breakpoint to the code in the Jhove.main() method and off I went… Or did I?

Step 3: setting up a config file

So this gave me what I needed to start stepping through the code. Unfortunately, my first attempt didn’t get any further than the initial Jhove.main() method… It got all the way through, but then the following error occurred:

Cannot instantiate module: com.mcgath.jhove.module.PngModule

The clue for how to fix this was actually provided by the debugger as it ran, however, and provides a good example of the kind of insight you get from running code in debug mode in your IDE. Because the initial set of command-line parameters I was passing in from the Run / Debug configuration didn’t contain a “-c” parameter to set a config file, JHOVE was automagically picking up its configuration from a default location: i.e. the JHOVE/config folder in my user directory – which existed, with a config file, because I’d also installed JHOVE on my machine the easy way beforehand…)

Config file variable in debugger

Debugger points towards the config file mix-up

A quick look at this config showed that JHOVE was expecting all sorts of modules to be available to load, one of which was the ‘external’ module for PNG characterisation mentioned in the error message. This is included in the JHOVE codebase, but in a separate folder (jhove-ext-modules): the build script that pulls JHOVE together for production deployment clearly copes with copying the PNG module from this location to the correct place, but the IDE couldn’t find it when debugging.

So the solution? Put a custom config file in place, and remove the parts that referenced the PNG module. This worked a treat, and allowed me to track the code execution all the way through for a test TIFF file.

Adding a config file parameter

Adding an extra -c config file parameter and a custom config file.

Conclusion

Really, all the above, while making it possible to get under the skin of JHOVE, is just the start. Another blog post may follow regarding what I actually found when I ran through its processes and started to get and idea of how it worked (though as a bit of a spoiler, it wasn’t exactly pretty)…

But, given that JHOVE is more or less ubiquitous in digital preservation (i.e. all the major vended solutions wrap it up in their ingest processes in one way or another), hopefully more people will be encouraged to dive into it and learn how it works in more detail. (I guess you could just ‘read the manual’ – but if you’re a developer, doing it this way is more insightful, and more fun, too).

Electronic lab notebooks and digital preservation: part I

Outreach and Training Fellow, Sarah, writes about a trial of electronic lab notebooks (ELN) at Oxford. She discusses the requirements and purpose of the ELN trial and raises lingering questions around preserving the data from ELNs. This is part I of what will be a 2-part series.


At the end of June, James and I attended a training course on electronic lab notebooks (ELN). IT Services at the University of Oxford is currently running a trial of Lab Archives‘ ELN offering. This course was intended to introduce departments and researchers to the trial and to encourage them to start their own ELN.

Screenshot of a LabArchives electronic lab notebook

When selecting an ELN for Oxford, IT Services considered a number of requirements. Those that were most interesting from a preservation perspective included:

  • the ability to download the data to store in an institutional repository, like ORA-data
  • the ability to upload and download data in arbitrary formats and to have it bit-preserved
  • the ability to upload and download images without any unrequested lossy compression

Moving from paper-based lab notebooks to an ELN is intended to help a lot with compliance as well as collaboration. For example, the government requires every scientist to keep a record of every chemical used for their lifetime. This has a huge impact on the Chemistry Department; the best way to search for a specific chemical is to be able to do so electronically. There are also costs associated with storing paper lab notebooks. There’s also the risk of damage to the notebook in the lab. In some ways, an electronic lab notebook can solve some of those issues. Storage will likely cost less and the risk of damage in a lab scenario is minimised.

But how to we preserve that electronic record for every scientist for at least the duration of their life? And what about beyond that?

One of the researchers presenting on their experience using LabArchives’ ELN stated, “it’s there forever.” Even today, there’s still an assumption that data online will remain online forever. Furthermore, there’s an overall assumption that data will last forever. In reality, without proper management this will almost certainly not be the case. While IT Services will be exporting the ELNs for back up purposes, but management and retention periods for those exports were not detailed.

There’s also a file upload limit of 250MB per individual file, meaning that large datasets will need to be stored somewhere else. There’s no limit to the overall size of the ELN at this point, which is useful, but individual file limits may prove problematic for many researchers over time (this has already been an issue for me when uploading zip files to SharePoint).

After learning how researchers (from PIs to PhD students) are using ELNs for lab work and having a few demos on the many features of LabArchives’ ELN, we were left with a few questions. We’ve decided to create our own ELN (available to us for free at during the trial period) in order to investigate these questions further.

The questions around preserving ELNs are:

  1. Authenticity of research – are timestamps and IP addresses retained when the ELN is exported from LabArchives?
  2. Version/revision history – Can users export all previous versions of data? If not users, then can IT Services? Can the information on revision history be exported, even if not the data?
  3. Commenting on the ELN – are comments on the ELN exported? Are they retained if deleted in revision history?
  4. Export – What exactly can be exported by a user? What does it look like? What functionality do you have with the data? What is lost?

While there’s potential for ELNs to open up collaboration and curation in lab work by allowing notes and raw data to be kept together, and facilitating sharing and fast searching. However, the long-term preservation implications are still unclear and many still seem complacent about the associated risks.

We’re starting our LabArchives’ ELN now, with the hope of answering some of those questions. We also hope to make some recommendations for preservation and highlight any concerns we find.


Anyone have an experience preserving ELNs? What challenges and issues did you come across? What recommendations would you have for researchers or repository staff to facilitate preservation? 

Digital Preservation at Oxford Open Days

Oxford Fellow, Sarah, describes the DPOC team’s pop-up exhibition “Saving Digital,” held at the Radcliffe Science Library during Oxford Open Days #OxOpenDay. The post describes from the equipment and games the team showcased over the two days and some of the goals they had in mind for this outreach work.


On 27 June and 28 June, Oxford ran Open Days for prospective students. The city was alive with open doors and plenty of activity. It was the perfect opportunity for us to take our roadshow kit out and meet with prospective students with a pop-up exhibition called “Saving Digital”. The Radcliffe Science Library (RSL) on Parks Road kindly hosted the DPOC team and all of our obsolete media for two day in their lounge area.

The pop-up exhibition hosted at the RSL

We set up our table with a few goals in mind:

  • to educate prospective students about the rapid pace of technology and the concern about how we’re going to read digital data off them in the future (we educated a few parents as well!)
  • to speak with library and university staff about their digital dilemmas and what we at the digital preservation team could do about it
  • to raise awareness about the urgency and need of digital preservation in all of our lives and to inform more people about our project (#DP0C)

To achieve this, we first drew people in with two things: retro gaming and free stuff.

Last minute marketing to get people to the display. It worked!

Our two main games were the handheld game, Galaxy Invader 1000, and Frak! for the BBC Micro.

Frak! on the BBC Micro. The yellow handheld console to the right is Galaxy Invader 1000.

Galaxy Invader 1000 by CGL (1980) is a handheld game, which plays a version of Space Invaders. This game features a large multi-coloured display and 3 levels of skill. The whole game was designed to fit in 2 kilobytes of memory. 

Frak! (1984) was a game released for the BBC Micro in 1984 under the Aardvark software label. It was praised for excellent graphics and game play. In the side scrolling game, you play a caveman named Trogg. The aim of the game is to cross a series of platforms while avoiding dangers that include various monsters named Poglet and Hooter. Trogg is armed with a yo-yo for defence. 

Second, we gave them some digestible facts, both in poster form and by talking with them:

Saving Digital poster

Third, we filled the rest of the table with obsolete media and handheld devices from about the last forty years—just a small sample of what was available! This let them hold some of the media of the past, marvel over how little it could hold, but how much it could do for the time. And then we asked them how would they read the data off it today. That probably concerned parents more than their kids as several of them admitted to having important digital stuff either still on VHS or miniDV tapes, or on 3.5-inch disks! It got everyone thinking at least.

A lot of obsolete media all in one place.

Lastly, an enthusiastic team with some branded t-shirts made to emulate our most popular 1st generation badge, which was pink with a 3.5-inch disk in the middle. We gave away our last one during Open Days! But don’t worry, we have some great 2nd generation badges to collect now.

An enthusiastic team always helps. Especially if they are willing to demo the equipment.


A huge thank you to the RSL for hosting us for two days—we’ll be back on the 16th of July if you missed us and want to visit the exhibition! We’ll have a few extra retro games on hand and some more obsolete storage media!

Our poster was found on display in the RSL.

Update on the training programme pilot

Sarah, Oxford’s Outreach and Training Fellow, has been busy since the new year designing and a running a digital preservation training programme pilot in Oxford. It consisted of one introductory course on digital preservation and six other workshops. Below is an update on what she did for the pilot and what she has learnt over the past few months.


It’s been a busy few months for me, so I have been quiet on the blog. Most of my time and creative energy has been spent working on this training programme pilot. In total, there were seven courses and over 15 hours of material. In the end, I trialled the courses on over 157 people from Bodleian Libraries and the various Oxford college libraries and archives. Many attendees were repeats, but some were not.

The trial gave me an opportunity to test out different ideas and various topics. Attendees were good at giving feedback, both during the course and after via an online survey. It’s provided me with further ideas and given me the chance to see what works or what doesn’t. I’ve been able to improve the experience each time, but there’s still more work to be done. However, I’ve already learned a lot about digital preservation and teaching.

Below are some of the most important lessons I’ve learned from the training programme pilot.

Time: You always need more

I found that I almost always ran out of time at the end of a course; it left no time for questions or to finish that last demo. Most of my courses could have either benefited from less content, shorter exercises, or just being 30 minutes longer.

Based on feedback from attendees, I’ll be making adjustments to every course. Some will be longer. Some will have shorter exercises with more optional components and some will have slightly less content.

While you might budget 20 minutes for an activity, you will likely use 5-10 minutes more. But it varies every time due to the attendees. Some might have a lot of questions, but others will be quieter. It’s almost better to overestimate the time and end early than rush to cover everythhing. People need a chance to process the information you give them.

Facilitation: You can’t go it alone

In only one of my courses did I have to facilitate alone. I was run off my feet for the 2 hours because it was just me answering questions during  exercises for 15 attendees. It doesn’t sound like a lot, but I had a hoarse voice by the end from speaking for almost 2 hours!

Always get help with facilitation—especially for workshops. Someone to help:

  • answer questions during exercises,
  • get some of the group idea exercises/conversations started,
  • make extra photocopies or print outs, and
  • load programs and files onto computers—and then help delete them after.

It is possible to run training courses alone, but having an extra person makes things run smoother and saves a lot of time. Edith and James have been invaluable support!

Demos: Worth it, but things often go wrong

Demos were vital to illustrate concepts, but they were also sometimes clunky and time consuming to manage. I wrote up demo sheets to help. The demos relied on software or the Internet—both which can and will go wrong. Patience is key; so is accepting that sometimes things will not go right. Processes might take a long time to run or the course concludes before the demo is over.

The more you practice on the computer you will be using, the more likely things will go right. But that’s not always an option. If it isn’t, always have a back up plan. Or just apologise, explain what should have happened and move on. Attendees are generally forgiving and sometimes it can be turned into a really good teaching moment.

Exercises: Optional is the way to go

Unless you put out a questionnaire beforehand, it is incredibly hard to judge the skill level of your attendees. It’s best to prepare for all levels. Start each exercise slow and have a lot of optional work built in for people that work faster.

In most of my courses I was too ambitious for the time allowed. I wanted them to learn and try everything. Sometimes I wasn’t asking the right questions on the exercises either. Testing exercises and timing people is the only way to tailor them. Now that I have run the workshops and seen the exercises in action, I have a clearer picture of what I want people to learn and accomplish—now I just have to make the changes.

Future plans

There were courses I would love to run in the future (like data visualisation and digital forensics), but I did not have the time to develop. I’d like to place them on a roadmap for future training. As well as reaching out more to the Oxford colleges, museums and other departments. I would also like to tailor the introductory course a bit more for different audiences.

I’d like to get involved with developing courses like Digital Preservation Carpentry that the University of Melbourne is working on. The hands-on workshops excited and challenged me the most. Not only did others learn a lot, but so did I. I would like to build on that.

At the end of this pilot, I have seven courses that I will finalise and make available through a creative commons licence. What I learned when trying to develop these courses is that there isn’t always a lot of good templates available on the Internet to use as a starting point—you have to ask around for people willing to share.

So, I am hoping to take the work that I’ve done and share it with the digital preservation community. I hope they will be useful resources that can be reused and repurposed. Or at the very least, I hope it can be used as a starting point for inspiration (basic speakers notes included).

These will be available via the DPOC website sometime this summer, once I have been able to make the changes necessary to the slides and exercises—along with course guidance material. It has been a rewarding experience (as well as an exhausting one); I look forward to developing and delivering more digital preservation training in the future.

Digital preservation with limited resources

What should my digital preservation strategy be, if I do not have access to repository software or a DAMS system?

At Oxford, we recently received this question from a group of information professionals working for smaller archives. This will be a familiar scenario for many – purchasing and running repository software will require a regular dedicated budget, which many archives in the UK do not currently have available to them.

So what intermediate solutions could an archive put in place to better its chances of not losing digital collection content until such a time? This blog summarises some key points from meeting with the archivists, and we hope that these may be useful for other organisations who are asking the same question.


Protect yourself against human error

CC-BY KateMangoStar, Freepik

Human error is one of the major risks to digital content. It is not uncommon that users will inadvertently drag files/folders or delete content by mistake. It is therefore important to have strict user restrictions in place which limits who can delete, move, and edit digital collections. For this purpose you need to ensure that you have defined an “archives directory” which is separate from any “working directories” where users can still edit and actively work with content.

If you have IT support available to you, then speak to them about setting up new user restrictions.


Monitor fixity

CC-BY Dooder, Freepik

However, even with strong user restrictions in place, human error can occur. In addition to enforcing stronger user restrictions in the “archives directory”, tools like Fixity from AVP can be used to spot if content has been moved between folders, deleted, or edited. By running regular Fixity reports an archivist can spot any suspicious looking changes.

We are aware that time constraints are a major factor which inhibits staff from adding additional tasks to their workload, but luckily Fixity can be set to run automatically on a weekly basis, providing users with an email report at the end of the week.


Understand how your organisation does back-ups

CC-BY Shayne_ch13, Freepik

A common IT retention period for back-ups of desktop computers is 14 days. The two week period enables disaster recovery of working environments, to ensure that business can continue as usual. However, a 14 day back-up is not the same as preservation storage and it is not a suitable solution for archival collections.

In this scenario, where content is stored on a file system with no versioning, the archivist only has 14 days to spot any issues and retrieve an older back-up before it is too late. So please don’t go on holiday or get ill for long! Even with tools like Fixity, fourteen days is an unrealistic turn-around time (if the issue is at all spotted in the first place).

If possible, try and make the case to your organisation that you require more varied types of back-ups for the “archival directory”. These should include back-ups which are at least retained for a year. Using a mix of tape storage and/or cloud service providers can be a less expensive way of storing additional back-ups which do not require ongoing access. It is an investment which is worth making.

As a note of warning though – you are still only dealing in back-ups. This is not archival storage. If there are issues with multiple back-ups (due to for example transfer or hardware errors) you can still lose content. The longer term goal, once better back-ups are in place, should be to monitor the fixity of multiple copies of content from the “archival directory”. (For more information about the difference between back-ups used for regular IT purposes and storage for digital preservation see the DPC Handbook)


Check that your back-ups work
Once you have got additional copies of your collection content, remember to check that you can retrieve them again from storage.

Many organisations have been in the positions where they think they have backed up their content – only to find out that their back-ups have not been created properly when they need them. By testing retrieval you can protect your collections against this particular risk.


But… what do I do if my organisation does not do back-ups at all?
Although the 14 day back-up retention is common in many businesses, it is far from the reality which certain types of archives operate within. A small community organisation may for example do all its business on a laptop or workstation which is shared by all staff (including the archive).

This is a dangerous position to be in, as hardware failure can cause immediate and total loss. There is not a magic bullet for solving this issue, but some of the advice which Sarah (Training and Outreach Fellow at Bodleian Libraries) has provided in her Personal Digital Archiving Course could apply.

Considerations from Sarah’s course include:

  • Create back-ups on additional removable hard drive(s) and store them in a different geographical location from the main laptop/workstation
  • Make use of free cloud storage limits (do check the licenses though to see what you are agreeing to – it’s not where you would want to put your HR records!)
  • Again – remember to check your back-ups!
  • For digitized images and video, consider using the Internet Archive’s Gallery as an additional copy (note that this is open to the public, and requires assigning a CC-BY license)  (If you like the work that the Internet Archive does – you can donate to them here )
  • Apply batch-renaming tools to file names to ensure that they contain understandable metadata in case they are separated from their original folders

(Email us if you would like to get a copy of Sarah’s lecture slides with more information)


Document all of the above

CC-BY jcomp, Freepik

Make sure to write down all the decisions you have made regarding back-ups, monitoring, and other activities. This allows for succession planning and ensures that you have a paper trail in place.


Stronger in numbers

CC-BY, Kjpargeter, Freepik

Licenses, contracts and ongoing management is expensive. Another venue to consider is looking to peer organisations to lower some of these costs. This could include entering into joint contracts with tape storage providers, or consortium models for using repository software. An example of an initiative which has done this is the NEA (Network Electronic Archive) group which has been an established repository for over ten years supporting 28 small Danish archives.


Summary:
These are some of the considerations which may lower the risk of losing digital collections. Do you have any other ideas (or practical experience) of managing and preserving digital collections with limited resources, and without using a repository or DAMS system?

Closing the digitization gap

MS. Canon. Misc. 378, fol. 136r

Bodleian Digital Library’s Digitization Assistant, Tim, guest blogs about the treasures he finds while migrating and preparing complete, high-fidelity digitised items for Digital Bodleian. The Oxford DPOC Fellows feel lucky to sit across the office from the team that manages Digital Bodleian and so many of our amazing digitized collections.


We might spend most of our time on an industrial estate here at BDLSS, but we still get to do a bit of treasure-hunting now and then. Our kind has fewer forgotten ruins or charming wood-panelled reading rooms than we might like, admittedly – it’s more a rickety MySQL databases and arcane php scripts affair. But the rewards can be great. Recent rummages have turned up a Renaissance masterpiece, a metaphysical manuscript, and the legacy of a Polish queen.

Back in October, Emma wrote about our efforts to identify digital images held by the Bodleian which would make good candidates for Digital Bodleian, but for one reason or another haven’t yet made it onto the site. Since that post was published, we have been making good progress migrating images from our legacy websites, including the Oxford Digital Library and – coming soon to Digital Bodleian – our Luna collection of digitized slides. Many of the remaining undigitized images in our archive are unsuitable for the site, as they don’t constitute full image sets: we’re trying to keep Digital Bodleian a reserve for complete, high-fidelity digitized items, rather than a dumping-ground for fragmentary facsimiles. But among the millions of images are a few sets of fully-photographed books and manuscripts still waiting to be showcased to the public on our digital platform.


A recent Digital Bodleian addition: the Notitia Dignitatum, a hugely important Renaissance copy of a late-Roman administrative text (MS. Canon. Misc. 378).

Identifying these full-colour, complete image sets isn’t as easy as we’d like, thanks to some slightly creaky legacy databases, and the sheer volume of material versus limited staff time. An approach mentioned by Emma has, however, yielded some successes. Taking suggestions from our curators – and, more recently, our Twitter followers  – we’ve been able draw up a digitization wishlist, which also serves as a list of targets for when we go ferreting around in the archive. Most haven’t been fully photographed, but we’ve turned up a clutch of exciting items from these efforts.

Finding the images is only half the hunt, though. To present the digital facsimiles usefully, we need to give them some descriptive metadata. Digital Bodleian isn’t intended to be a catalogue, but we like to provide some information about an item where we have it, and make our digitized collections discoverable, as well as giving context for non-experts. But as with finding images, locating useful metadata isn’t always simple.

Most of the items on Digital Bodleian sit within the Bodleian’s Special Collections. Each object is unique, requiring the careful attention of an expert to be properly catalogued. For this reason, modern cataloguing efforts focus on subsets of the collections. For those not covered by these, often the only published descriptions (if any) are in 19th century surveys – which can be excellent, but can be terse, or no longer up-to-date. Other descriptions and scholarly analyses are spread around a variety of published and unpublished material, some of it available in a digital form, most of it not. This all presents a challenge when it comes to finding information to go along with items on Digital Bodleian: much as we’d like to be, Emma and I aren’t yet experts on the entirety of all the periods, areas and traditions represented in the Bodleian’s holdings.


Another item pulled from the Bodleian’s image archive: a finely decorated 16th-century Book of Hours (MS. Douce 112).

Happily, our colleagues responsible for curating these collections are engaged in constant, dogged efforts to make descriptions more accessible. Especially useful to those of us unable to pop into the Weston to rifle through printed finding aids are a set of TEI-based electronic catalogues*, developed in conjunction with BDLSS. These aim to provide systematically-structured digital catalogue entries for a variety of Western and Oriental Special Collections. They’re fantastic resources, but they represent ongoing cataloguing campaigns, rather than finished products. Nor do they cover all the Special Collections.

Our most valuable resource therefore remains the ever-patient curators themselves. They kindly help us track down information about the items we’re putting on Digital Bodleian from a sometimes-daunting array of potential sources, put us in touch with other experts where required, and are always ready to answer our questions when we need something clarified. This has been enormously helpful in providing descriptions for our new additions to the site.

With this assistance, and the help of our colleagues in the Imaging Studio, who provide similar expertise in tracking down the images, and try hard to squeeze in time to photograph items from the aforementioned wishlist, we’ve managed to get 25 new treasures onto Digital Bodleian since Emma’s post, on top of all the ongoing new photography and migration projects. This totals around 9,300 images altogether, and we have more items on the way (due soon are a couple of Mesoamerican codices and an Old Sundanese text printed on palm leaves from Java). Slowly, we’re closing the gap.

A selection of recent items we’ve dug up from our archives:

MS. Ashmole 304
MS. Ashmole 399
MS. Auct. D. inf. 2. 11
MS. Canon. Bibl. Lat. 61
MS. Canon. Misc. 213
MS. Canon. Misc. 378
MS. Douce 112
MS. Douce 134
MS. Douce 40
MS. Holkham misc. 49
MS. Lat. liturg. e. 17
MS. Lat. liturg. f. 2
MS. Laud Misc. 108
MS. Tanner 307

 

*Currently live are catalogues of medieval manuscripts, Hebrew manuscripts, Genizah fragments,  and union catalogues of Islamicate manuscripts and Shan Buddhist manuscripts in the United Kingdom. Catalogues of Georgian and Armenian manuscripts, to an older TEI standard, are still online and are currently undergoing conversion work. Similar, non-TEI-based resources for Incunables and some of our Chinese Special collections are also available.

Project update

A project update from Edith Halvarsson, Policy and Planning Fellow at Bodleian Libraries. 


Ms Arm.e.1, Folio 23v

Bodleian Libraries’ new digital preservation policy is now available to view on our website, after having been approved by Bodleian Libraries’ Round Table earlier this year.

The policy articulates Bodleian Libraries’ approach and commitment to digital preservation:

“Bodleian Libraries preserves its digital collections with the same level of commitment as it has preserved its physical collections over many centuries. Digital preservation is recognized as a core organizational function which is essential to Bodleian Libraries’ ability to support current and future research, teaching, and learning activities.”

 

Click here to read more of Bodleian Libraries’ policies and reports.

In other related news we are currently in the process of ratifying a GLAM (Gardens, Libraries and Museums) digital preservation strategy which is due for release after the summer. Our new digitization policy is also in the pipelines and will be made publicly available. Follow the DPOC blog for future updates.

Gathering the numbers: a maturity and resourcing survey for digital preservation

The ability to compare ourselves to peer institutions is key when arguing the case for digital preservation within our own organisations. However, finding up-to-date and correct information is not always straight forward.

The Digital Preservation at Oxford and Cambridge (DPOC) project has joined forces with the Digital Preservation Coalition (DPC) to gather some of the basic numbers that can assist staff in seeking to build a business case for digital preservation in their local institution.

We need your input to make this happen!

The DPOC and the DPC have developed a survey aimed at gathering basic data about maturity levels, staff resources, and the policy and strategy landscapes of institutions currently doing or considering digital preservation activities. (The survey intentionally does not include questions about the type or size of the data organisations are required to preserve.)

Completing the survey will only take 10-20 minutes of your time, and will help us better understand the current digital preservation landscape. The survey can be taken at: https://cambridge.eu.qualtrics.com/jfe/form/SV_brWr12R8hMwfIOh

Deadline for survey responses is: Thursday 31 May 2018.

For those wanting to know upfront what questions are asked in the survey – here is the full set of Survey Questions (PDF). Please keep in mind the survey is interactive and you may not see all of the questions when filling this in online (as the questions only appear in relation to your previous responses). Responses must be submitted through the online survey.

Anonymised data gathered as part of this maturity and resourcing survey will be made available via this DPOC website.

For any questions about the survey and its content, please contact: digitalpreservation@lib.cam.ac.uk

The Ethics of Working in Digital Preservation

Since joining the DPOC project in 2016, I have been espousing the need for holistic approaches to digital preservation. This has very much been about how skills development, policy, strategy, workflows and much more need to be included as part of a digital preservation offering. Digital preservation is never just about the tech. There is a concern I must raise: how we play nice together.

Since first drafting this post in October 2017, there have been several events I would be remiss not to mention. Ethics and how we conduct ourselves in professional contexts have been brought into the current social consciousness by the #metoo movement and the recent matter regarding Chris Bourg’s keynote at the Code4Lib conference.

Working Together

We know digital preservation can’t be done alone, and I believe the digital preservation community is well on the way to accepting this. One single person cannot hold all the information about every type of file, standard, operating system, disk file system, policy, carrier, hardware, peripheral, protocol, copyright, legislation as well as undertake advocacy, suitably negotiate with donors etc.

Dream Team – Library of Congress Digital Preservation Outreach and Education Training Materials

For each digital preservation activity, we need a ‘dream team’. This is a term Emma Jolley (Curator of Digital Archives, National Library of Australia) incorporated into the 2015 Library of Congress Digital Preservation Outreach & Education (DPOE) Train the Trainer education programme I took part in. This understanding of the needs of complementary skills, knowledge and approaches very much underpins the Polonsky Digital Preservation Project.

Step by Step, Hand in Hand

If I think back to my time working in digital preservation in the mid-2000s, it was a far more isolating experience than it is now. Remembering the challenges we were discussing back then, it doesn’t feel as if the field has progressed all that much. It may just be slow going. Or perhaps it’s fear of making a wrong decision?

As humans, we know we have the capacity to learn from mistakes. We’ve likely had someone tell us about the time they (temporarily or permanently) lost data. The short-term lifespan of media carriers, inter-dependencies between different components, changes to services where data may be stored ‘in the cloud’ and the limited availability of devices (hardware or software) to read and interpret the data means that digital content is fragile (for many reasons, not only technical) and is continually at risk.

There are enough lessons of data loss out there in the wider world that it is imperative we acknowledge these situations and learn from them. Nor should we have to face these kinds of stressful situations alone; it should be done step-by-step, hand-in-hand, supporting each other.

Acknowledging Failure

Over recent years, the international arts and cultural sector has begun to share examples of failures. While it is easy to share successes, it’s far harder to openly share information about failures. Failure in current western society is definitely not a desirable outcome. Yet we learn from failure. As a response to ‘ideas’ festivals and TED talks, events such as Failure Lab have been gaining momentum.

The need to share (in considered ways) about failures in digital preservation is somewhat new, however it’s not an entirely new concept. (The now infamous story of how parts of Toy Story 2 were deleted have helped illustrate the need for regularly checking backup functions.) More recently, at PASIG 2017, one of the most memorable presentations of the whole conference was Eduardo Del Valle’s Sharing my loss to protect your data: A story of unexpected data loss and how to do real preservation. I believe I speak for many of the PASIG conference attendees when I state how valuable a presentation this was.

In May 2017, the Digital Preservation Coalition ran possibly the most useful event I attended in all of 2017: Digital Preservationists Anonymous (aka Fail Club). We were able to share our war stories within the safety and security of Chatham House Rules and learn a lot from each other that will be able to take us forward in our work at our respective institutions. Hearing another organisation that is further ahead, inform us about the tricky things they’ve encountered helps us progress better, faster.

iPres 2017 and the Operational Pragmatism Panel

Yet there are other problematic issues within the field of digital preservation. It’s not always an easy field to work in; it doesn’t yet have the diversity it needs, nor necessarily respect the diversity of views already present.

Operational Pragmatism in Digital Preservation: Establishing context-aware minimum viable baselines was a panel session I facilitated at iPres 2017, held in September 2017 in Kyoto, Japan. The discussion was set out as a series of ‘provocations’ (developed collaboratively by the panellists) about different aspects digital preservation. (Future blog posts are yet to published about the topics and views presented during the panel discussion.) I had five experienced panellists representing a range of different countries they’ve worked in around the world (Canada, China, France, Kenya, the Netherlands, the UK and the USA) plus myself (originally from Australia). Another eight contributors (from Australia, Germany, New Zealand, the UK and the USA) also fed into forming the panel topic, panel member makeup or the provocations. Each panellist was allocated a couple of minutes to present their point of view in response to each provocation. Then the discussion was opened up to the wider audience. It was never going to be an easy panel. I was asking a lot of my panellists. They were each having to respond to one challenging question after another, providing a simple answer to each question (that could be used to inform decisions about what the ‘bare minimum’ work could be done for each digital preservation scenario). This was no small feat.

Rather than the traditional panel presentation, where only a series of experts get to speak, it was intended as a more inclusive discussion. The discussion was widened to include the audience in good faith, so that audience members could share openly throughout, if they wished. However, it became apparent that there were some other dynamics at play.

One Person Alone is Never Enough

Since I first commenced working in digital preservation in 2005, I have witnessed the passion and commitment to viewpoints that individuals within this field hold. I expected a lively discussion and healthy debate, potentially with opposing views between the panellists (who had been selected to represent different GLAM sectors, organisation sizes, nations, cultures, backgrounds and approaches to digital preservation).

As I was facilitating the panellists for this demanding session, I had organised an audience facilitator (someone well-established within the digital preservation community). Unfortunately, due to circumstances out of our control, this person was unable to be present (and an experienced replacement was unable to be found at short notice). This situation left my panellists open to criticism. One panellist was on the receiving end of a disproportionate amount of scrutiny from the audience. Despite attempts, as a lone facilitator, I was unable to defuse the situation. After the panel session finished, several audience members remarked that they didn’t feel comfortable participating in the discussion.

Facilitating a safe environment for both panellists and for the wider audience to debate topics they are passionate about is vitally important, yet this failed to occur in this instance. As a result, the panel were unable to summarise and present conclusions about possible ‘minimum baselines’ for each of the provocations. It’s clear in this instance that a single facilitator was not enough.

Community Responsibility

In this respect, we have failed as a community. While we may have vastly differing viewpoints, it is essential we cultivate environments where people feel safe to express their views and have them received in a professional and respected manner. The digital preservation community is growing – in both size and diversity. We are aware we need to put in place, improve or refresh our technical infrastructures. Now is also the time to look at how we handle our social infrastructure. It is my opinion that there is a place for a wide range of individuals, with a vast variety of backgrounds and skills needed in the digital preservation field.

There are people who are already working in digital preservation and who have great skills. They might not all be software developers, but they know how to project manage, speak, write, problem-solve, and are subject matter experts in a wide range of areas. The value of diversity has been proven. If we only have coders, computer scientists or individuals from any one background working in the field of digital preservation, then surely, we will fail.

Moving Forward

In the hours and days following the panel, I reached out to my communities online for pointers to Codes of Ethics, Codes of Conduct and other articles discussing challenging situations in similar industries. Borrowing from other industries and adapting to fit the context at hand has always been important to me. I don’t want to reinvent the wheel and would prefer to learn from others’ experiences. The panel ‘provocations’ presented were not contentious, yet how the discussion evolved throughout the duration of the panel somewhat echoes other events that have occurred within the tech industry.

At the time of publishing this post, neither the digital preservation community nor iPres has a Code of Conduct or Code of Ethics. There have been mentions of the lack of an iPres Code of Conduct in previous years. For iPres 2018, developing a Code of Conduct has become a priority. However, it shouldn’t have taken us this long to put in place some frameworks of this type, given we all know we must work collaboratively if we are to succeed. Back in 1997, UNESCO suggested that if Audiovisual Archiving was a profession, it would also require a Code of Ethics (Audiovisual archives: a practical reader – section 4, pages 15-17).

Codes of Conduct and Codes of Ethics are a starting point. Several examples include:

There’s a longer list of Codes of Conduct and Codes of Ethics that have been compiled over the past six months since iPres 2017. Even the Loop electronic music makers summit (an initiative of the Ableton software company) I attended last November in Berlin, had in place a thorough Code of Conduct.

Building Better Communities

Codes are not enough. This is about building better communities.

A 2016 article emerging from the tech community has a list of suggestions for facilitating the development of ‘plumbers’ (and therefore functional infrastructure) rather than ‘rock stars’, under the section titled: “How do we as a community prevent rock stars?”.

Building and maintaining infrastructure is typically not fun nor sexy – but this is what digital preservation demands. Without us working collaboratively and inclusively, we will not be able to acquire, preserve or provide access to the digital content we are the stewards of. This is because we won’t fully understand the contexts of the individuals producing the content, if we don’t have the same kind of diversity within our own field of digital preservation.

Diversity may not be easy, but neither is digital preservation. While it might not be rocket science per se, we’re accustomed to working on hard and complex things. Here are some suggestions to help us take the next step(s):

  • Organisers: encourage, model and – where necessary – enforce ‘good practice’ behaviours codes
  • Participants: recognise, appreciate and celebrate the privilege of being able to debate digital preservation as part of what we do. Allow and encourage minority, less confident and new voices to hold an equal place in our discussions
  • Everyone: recognise and work towards addressing our own unconscious biases and privileges

Like Kenney and McGovern’s Three-Legged Stool for Digital Preservation (a model our DPOC project is very much based on), where the organisational infrastructure, resources framework and technological infrastructure are of equal importance, recognising that the complexity of the digital preservation challenge is best addressed through multiple perspectives is essential. We must model and welcome the benefits of our diversity. Each of us brings something unique and every skill or bit of knowledge is valuable.

Email preservation 2: it is hard, but why?

A post from Sarah (Oxford) with input from Somaya (Cambridge) about the 24 January 2018 DPC event on email archiving from the Task Force on Technical Approaches to Email Archives.

The discussion of the day circulated around what they had learnt during the year of the task force, that personal and public stories are buried in email, considerable amounts of email have been lost over previous decades, that we should be treating email as data (it allows us to understand other datasets), that current approaches to collecting and preserving email don’t work as they’re not scalable and the need for the integration of artificial intelligence and machine learning (this is already taking place in legal professions with ‘predictive coding’ and clustering technologies) to address email archives, including natural language processing functions is important.


Back in July, Edith attended the first DPC event on email preservation, presented by the Task Force on Technical Approaches to Email Archives. She blogged about here. In January this year, Somaya and I attended the second event hosted again by the DPC.

Under the framework of five working groups, this task force has spent 12 months (2017) focused on five separate areas of the final report, which is due out in around May this year:

  • The Why: Overview / Introduction
  • The When/Who/Where: Email Lifecycles Perspectives
  • The What: The Needs of Researchers
  • The How: Technical Approaches and Solutions
  • The Path Forward: Sustainability & Community Development

The approach being taken is technical, rather than on policy. Membership of the task force includes the DPC, representatives from universities and national institutions from around the world and technology companies including Google and Microsoft.

For Chris Prom (from University of Illinois Urbana Champaign, who authored the 2011 DPC Technology Watch Report on Preserving Email) and Kate Murray’s (Library of Congress and contributor to FADGI) presentation about the work they have been doing, you can view their slides here. Until the final report is published, I have been reviewing the preliminary draft (of June 2017) and available documents to help develop my email preservation training course for Oxford staff in April.

So, when it comes to email preservation, most of the tools and discussions focus on processing email archives. Very little of the discussion has to do with the preservation of email archives over time. There’s a very good reason for this. Processing email archives is the bottleneck in the process, the point at which most institutions are still stuck at. It is hard to make decisions around preservation, when there is no means for collecting email archives or processing them in a timely manner.

There were many excellent questions and proposed solutions from the speakers at the January event. Below are some of the major points from the day that have informed my thinking of how to frame training on email preservation:

Why are email archives so hard to process?

  1. They are big. Few people cull their emails and over time they build up. Reply and ‘reply all’ functions expand out emails chains and attachments are growing in size and diversity. It takes a donor a while to prepare their email archives, much less for an institution to transfer and process them.
  2. They are full of sensitive information. Which is hard to find. Many open source technology assisted review (TAR) tools miss sensitive information. Software used for ‘predictive coding’ and machine learning for reviewing email archives are well out of budget for heritage institutions. Manual review is far too labour intensive.
  3. There is no one tool that can do it all. Email preservation requires ‘tool chaining’ in order to transfer, migrate and process email archives. There are a very wide variety of email software programs which in turn create a many different email file format types. Many of the tools used in email archive processing are not compatible with each of the different email file types; this requires a multiple file format migrations to allow for processing. For a list of some of the current available tools, see the Task Force’s list here.

What are some of the solutions?

  1. Tool chaining will continue. It appears for now, tool chaining is here to stay, often mixing proprietary with open source tools to get workflows running smoothly. This means institutions will need to invest in establishing email processing workflows: the software, people who know about how to handle different email formats etc.
  2. What about researchers? Access to emails is tightly controlled due to sensitivity restraints, but is there space to get researchers to help with the review? If they use the collection for research, could they also be responsible for flagging anything deemed as sensitive? How could this be done ethically?
  3. More automation. Better tool development to assisted with TAR. Reviewing processes must become more automated if email archives are ever to be processed. The scale of work is increasing and traditional appraisal approaches (handling one document at a time) and record schedules are no longer suitable.
  4. Focus on bit-level preservation first. Processing of email archives can come later, but preserving it needs to start on transfer. (But we know users want access and our institutions want to provide this access to email archives.)
  5. Perfection is no longer possible. While archivists would like to be precise, in ‘scaling up’ email archive processing we need to think about it as ‘big data’ and take a ‘good enough’ approach.