Devising Your Digital Preservation Policy: Learnings from the DPOC project

On December 4th the DPOC Policy and Planning Fellows ran a joint workshop in London presenting learnings and experiences of policy writing at CUL and Bodleian Libraries. Supporting the event were also Kirsty Lingstadt (Head of the Digital Library at the University of Edinburgh) and Jenny Mitcham (Head of Good Practice at the Digital Preservation Coalition). Kirsty and Jenny talked about their experience of policy writing in other organisational settings, illustrating how policy writing must be tailored to fit specific institutional contexts but that the broad principles remain the same.

In total 30 attendees partook in the workshop which mixed presentations with round table discussions. To make the event as interactive as possible Mentimeter was used to poll attendees on their own experiences of policy writing. Although the survey only represents a small selection of organisations in the process of writing digital preservation policy, the Fellows wanted to share some of the results in the hope that it will facilitate further discussion. Feel free to use the comments section below to let the project team know if the results from the poll seem familiar (or perhaps unfamiliar).


Question: Do you know who to consult on a digital preservation policy (in your organisation)?

Most workshop participants knew who they needed to consult on digital preservation in their organisation and also had a good working relationship with them. This is the first step when starting a new policy – knowing your organisational culture and context.

Being new to their organisations, the DPOC Fellows spent a lot of time of time early on in the project reaching out to staff across the libraries. If you are also new to your institution, getting to know those who have been there a long time is an important starting point to understanding what type of policy will suit your organisation’s culture before you begin any writing.

Question: What barriers can you see to developing a digital preservation policy (in your organisation)?

‘Time’ was identified as by far the largest barriers to writing new digital preservation policy by participants. And it is true that policy development does take a lot of time if you want the resulting document to be more than ‘just a paper’ which is filed away at the end of the process.

To get staff onboard with new policy, allocating resources for policy consultation is therefore crucial and the effort involved is not always appreciated by senior management. For example, it took the Fellows between 1-2 years to develop a new digital preservation policy for their organisations, illustrating why it is important to give staff sufficient time to write policy. While policy consultation took a long time, the DPOC Fellows felt that this was a worthwhile investment for their organisations, as time spent consulting on policy was also a great outreach and learning opportunity for the organisations as a whole.

Question: Does your organisation have a policy template?

Most participants did not have an organisation wide policy template. However, templates are part of policy best practice. A policy template is a skeleton document which outlines high level sections and headlines  which should be included in every organisational policy regardless of topic – from an HR policy to a digital preservation policy, they should all follow the same structure. The purpose of having these standardised headlines is to ensure that staff can easily digest and recognise any policy at a quick glance. Templates can also enforce good document management practices.

If you are interested in finding out more, a high level policy template which was developed for the DPOC project can be requested through the DPOC blog contact form or by emailing the Digital Preservation Coalition.

Questions: Where are institutional policies publishes (in your organisation)?

Once the policy is signed off, it is time to publicise it wider. Among the workshop participants the most common places to publish policies were either on an institutional website or intranet (although there are other options listed in the word cloud).

As a word of caution, make sure that your organisation is consistent in where it publish policies and ensure that documents are versioned. The international digital preservation policy review which the Fellows undertook in 2016 (analysing 50 different policies) found that most digital preservation policies do not use any document versioning. No versioning, in combination with the proliferation of different policy publication routes in an organisation, will soon become a real issue when staff try to locate up to date documents. (Again, if your organisation has a good policy template in place you can better enforce versioning!)

One option which was listed several times in the word cloud is to publish policy in an institutional repository; this is primarily useful if you do not have a reliable records management system in your organisation. Using a repository means that you can assign a DOI to the policy for persistent referencing and also has the added benefit of becoming the clear canonical copy of the policy

Question: How long will it take to…?

Participants were asked how long (using multiples of months) they think it would take their organisations to:

  • Draft a policy
  • Have it approved
  • Begin implementation of the policy
  • See real impact and benefits in the organisation

As seen from the chart, the drafting of a policy document is only one small aspect of policy and planning work. This is important to remember if you want to avoid your policy becoming just another ‘piece of paper’ that is filed away and not looked at again after its been written. Advocacy, communication and implementation plans continue for years to come after the original document has been drafted. 


Where next…

To find out more about policy writing during the DPOC project have a look at this recent blog post from CUL’s Policy and Planning Fellow Somaya Langley and at the workshop presentation slides available through the DPC. The Fellows are also happy to take questions through the blog and encourage use of the comments section.

Memory Makers: Digital preservation skills and how to get them

The Memory Makers Conference was hosted at Amsterdam Museum in the Netherlands 29th-30th November. Bodleian Libraries’ Policy and Planning Fellow, Edith Halvarsson, attended.


The Memory Makers conference in Amsterdam brought together training providers from the private, higher education and continuing education sector to discuss digital preservation skills, how to get them (and how to retain them).

In my experience, research on skills development is often underrepresented at digital preservation conferences, and when such talks are included the attendance tend to be lower than for technology based strands. However, taking a 1.5 day deep dive into this topic is one of the most interesting and thought-provoking activities I’ve done this year and I am happy that NDE and DPC decided to highlight this area by giving it its own conference. So in this blog I wanted to summarise some of the thoughts that have stayed with me since coming back from Amsterdam

The expectation gap

‘The expectation gap’ is something which we have discussed in a roundabout way among the Fellows over the past years, but it was a presentation by Dr Sarah Higgins which really put words onto this phenomena for me. The notion of an ‘expectation gap’ also nicely frames why we need to think seriously about lifelong learning and competency frameworks.

Sarah has been teaching Information Management to Masters Students at Aberystwyth University (Wales) for almost a decade and has been observing both the development of the programme and the career trajectories of students graduating into the field. In this time there’s been a growing gap between what employers expect of students in terms of digital preservation skills and what certified MA programmes can offer.

The bodies which certify Information Management courses in the UK (CILIP and ARA) still only require minimal digital skills as part of their competency frameworks. This has made it challenging to argue for new and mandatory digital preservation related modules on UK MA programmes. MA programmes have definitely shifted to begin meeting the digital preservation challenge, but they are still at an early stage.

So while UK Information Management courses continue to frame a lot of teaching around physical collections, the expectations of digital skills from organisations hiring recent graduates from these programmes has skyrocketed. This has made the gap between reality and fantasy even larger.  There has been a growing trend for organisations to hire new graduates and expecting them to be the magic bullet; the readymade lone experts in all areas of digital preservation who do not require any further development or support ever again. Many of Sarah’s graduates who began working on digital preservation/curation/archiving projects after graduation were essentially ‘set up to fail’ – not a nice or fair place to be at in your first job.

Dr Natalie Harrower: https://twitter.com/natalieharrower/status/1068124988358709254

Developing skills frameworks

To meet the challenge of unclear competency expectations, Sharon McMeekin (Head of Training and Skills at DPC) called for continued development of skills frameworks such as DigCurV. While DigCurV has been immensely valuable (we have for example drawn on it continuously in the DPOC project), the digital preservation field has matured a lot over the past couple of years and new learnings could now be incorporated into the model. A useful new addition to DigCurV, Sharon argued, would be to create more practitioner levels which reflects the expected skills progressions over 1-10 years for new graduates entering the field.

If such frameworks were taken on by certifying bodies, it could potentially temper both unrealistic job descriptions and help staff argue for professional development opportunities.

Lifelong learning

In her talk, Sarah strongly argued that we should expect recent Information Management graduates to also require more workplace based training after graduation. A two-year MA programme is not the endpoint for learning, especially in a quickly moving and developing field. This means that ongoing learning opportunities must also be considered by hiring organisations.

It was refreshing to hear form the British Library who strongly subscribe to this idea. The British Library team teach introductory courses on digital preservation and drop in lab sessions for all library staff on a yearly basis.

Micky Lindlar: https://twitter.com/MickyLindlar/status/1068155027108306944

But the digital preservation team also engages with a wide range of training opportunities that are perhaps not considered traditional Information Management skills. Maureen Pennock (Head of Digital Preservation at the BL) argued that skills for digital preservation are not necessarily unique to the field, and can be acquired in places which you may not initially have consider. Such skills include project management, social media management, presentation delivery, and statistical analysis. Although it should be noted that Maureen also strongly stated that no one person should be expected to be an expert in all these areas at the same time.

Learning collaboratively

Another set of presentations which I really enjoyed was focused on “collaborative learning”. Puck Huijtsing (Netwerk Oorlogsbronnen) challenged why we are so attached to lecture style learning which we are familiar with from school and higher education. She argued that collaborative learning has been shown to be a successful model when training people to take on a new craft (and she believes that digital preservation is a craft). Puck went on to elaborate on Amsterdam’s strong history of craft guilds and how these taught and shared new skills, arguing that it could potentially be a more accessible and sustainable model for workplace based training.

A number of successful training models presented by the Netherland Institute for Sound and Visions then illustrated how collaborative hands-on workshops can be delivered in practices. In one workshop series delivered by the institute, participants were asked to undertake small projects which focused on discreet digital collection material which they had a pre-existing relationship with. The institutes research indicates that this model is successful in aiding retention and uptake of digital preservation and archiving skills. These are workshops which we are also keen to test out at Bodleian Libraries next year to see if they are received well by staff.

Summary

It is clear from the Memory Makers conference that there are a lot of people out there who care about learning and professional development in the digital preservation field. This blog only summarises a small section of all the excellent work that was presented over 1.5 days, and I would encourage others to look at presentation slides and the Twitter hash for the event (#MemoryMakers18) if this is a topic which interests you as well.

Cambridge University Libraries inaugural Digital Preservation Policy

The inaugural Cambridge University Libraries Digital Preservation Policy has been published last week. Somaya Langley (Cambridge Policy & Planning Fellow) provides some insight into the policy development process and announces a policy event in London, presented in collaboration with Edith (Oxford Policy & Planning Fellow) to be held in early December 2018.


In December 2016, I started the digital preservation policy development process for Cambridge University Library (CUL), which has finally culminated in a published policy.

Step one

Commencing with a ‘quick and dirty’ policy gap analysis at CUL, what I discovered was not so much that there were some gaps in their existing policy landscape but rather that there was a dearth of much-needed policies. The gap analysis at CUL found that a few key policies did exist for different audiences (some intended to guide CUL, some to guide researchers and some meant for all staff and researchers working at the University of Cambridge). While my counterpart at Oxford found there was duplication in their policies across Bodleian Libraries and the University of Oxford, I mostly found chasms.

Next step

The second step in the policy development process was attempting to meet an immediate need from staff, by adding some “placeholder” digital preservation statements into the Collection Care and Conservation Policy that was currently under review. In the longer term, while it might be ideal to combine a preservation policy into one (encompassing the conservation and preservation of physical and digital collection items), CUL’s digital preservation maturity and skill capabilities are too low at present. Focus needed to be really drawn to how to manage digital content, hence the need for a separate Cambridge University Libraries Digital Preservation Policy.

That said, like everything else I’ve been doing at Cambridge, it needed to be addressed holistically. And policy is no exception. Being able to undertake about two full weeks of work (spanning several months in early 2017) contributing to the review of the Collection Care and Conservation Policy has meant including some statements in this policy that will support better care for digital (and audiovisual) content still remaining on carriers (that are yet to be transferred).

Collaborative development

Then in June 2017 we moved onto undertaking policy development collaboratively. Part of this was to do an international digital preservation policy review – looking at dozens of different policies (and some strategies). Edith wrote about the policy development process back in middle of last year.

The absolute lion’s share of the work was carried out by my Oxford counterparts, Edith and Sarah. Due to other work priorities, I didn’t have much available time during this stage. This is why it is so important to have a team – whether this is a co-located team or distributed across an organisation or multiple organisations – when working in the digital preservation space. I really can’t thank them enough for carrying the load for this task.

Policy template

My contribution was to develop a generic policy template, for use in both our organisations. For those that know me, you will know I prefer to ‘borrow and adapt’ rather than reinvent the wheel. So I used the layout of policies from a previous workplace and constructed a template for use by CUL and the Bodleian Libraries. I was particularly keen to ensure what I developed was generic, so that it could be used for any type of policy development in future.

This template has now been provided to the Digital Preservation Coalition, who will make it available with other documents in the coming years – so that some of this groundwork doesn’t have to be carried out by every other organisation still needing to do digital preservation policy (or other policy) development. We found in our international digital preservation maturity and resourcing survey (another blog post on this is still to follow), that there’s still at least 42% of organisations internationally, that do not have a digital preservation policy.

Who has a digital preservation policy?

What next?

Due to other work priorities, drafting the digital preservation policy didn’t properly commence until earlier this year. But by this point I had a good handle on my organisation’s specific:

  • Challenges and issues related to digital content (not just preservation and management concerns)
  • High-level ‘profile’ of digital collections, right across all content ‘classes’
  • Gaps in policy, standards, procedures and guidelines (PSPG) as well as strategy
  • Appreciation of a wide-range of digital preservation policies (internationally)
  • Digital preservation maturity (holistic, not just technical) – based on maturity assessments using several digital preservation maturity models
  • Governance (related to policy and strategy)
  • Language relevant to my organisation
  • Responsibilities across the organisation
  • Relevant legislation (UK/EU)

This formed my approach of how to draft the digital preservation policy, that would meet CUL’s needs.

Approach

I realised that CUL required a comprehensive policy, that would fill the many gaps that ideally other policies would cover. I should note that there are many ways of producing a policy, and it does have to be tailored to meet the needs of your organisation. (You can compare with Edith’s digital preservation policy for the Bodleian Libraries, Oxford.)

The next steps involved:

  • Gathering requirements (this had already taken place during 2017)
  • Setting out a high-level structure/list of points to address
  • Defining the stakeholder group membership (and ways of engaging with them)
  • Setting the frame of the task ahead
  • Agreeing on the scope (this changed from ‘Cambridge University Library’ to ‘Cambridge University Libraries’ – encompassing CUL’s affiliate and dependent libraries‘)

Then came the iterative process of:

  1. Drafting policy statements and principles
  2. Meeting with the stakeholder group and discussing the draft
  3. Gathering feedback on the policy draft (internally and externally)
  4. Incorporating feedback
  5. Circulating a new version of the draft
  6. Developing associated documentation (to support the policy)

Once a final version had been reached, this was followed by the approvals and ratification process.

What do we have?

Last week, the inaugural Cambridge University Libraries Digital Preservation Policy was published (which was not without a few more hurdles).

It has been an ‘on again, off again’ process that has taken 23 months in total. Now we can say that for CUL and the University of Cambridge, that:

“Long-term preservation of digital content is essential to the University’s mission of contributing to society through the pursuit of education, learning, and research.”

Which compliments some of our other CUL policies.

What now?

This is never the end of a policy process. Policy should be a ‘live and breathing’ process, with the policy document itself purely being there to keep a record of the agreed upon decisions and principles.

So, of course there is more to do. “But what’s that?”, I hear you say.

Join us

There is so much more that Edith and I would like to share with you about our policy development journey over the past two years of the Digital Preservation at Oxford and Cambridge (DPOC) project.

So much so that we’re running an event in London on Tuesday 4th December 2018 on Devising Your Digital Preservation Policy, hosted by the DPC. (There is one seat left – if you’re quick, that could be you).

We’re also lucky to be joined by two ‘provocateurs’ for the day:

  • Kirsty Lingstadt, Head of Digital Library and Deputy Director of Library and University Collections, University of Edinburgh
  • Jenny Mitcham, Head of Good Practice and Standards, Digital Preservation Coalition (who has just landed in her new role – congrats & welcome to Jenny!)

There is so much more I could say about policy development in relation to digital content, but I’ll leave it there. I do hope you get to hear Edith and I wax lyrical about this.

Thank-yous

Finally, I must thank my Cambridge Polonsky team members, Edith Halvarsson (my Oxford counterpart), plus Paul Wheatley and William Kilbride from the DPC. Policy can’t be developed in a void and their contributions and feedback have been invaluable.

Reflections on the International Conference on Digital Preservation (iPres) 2018

The iPres conference celebrated its fifteenth birthday in 2018. Bodleian Libraries’ Policy and Planning Fellow, Edith, discusses her take on this year’s conference theme.  


In 2003 a small international meeting, hosted by the Chinese Academy of Science, prompted the creation of what is today iPres (the International Conference on Digital Preservation). The conference has since grown massively; this year almost 500 delegates attended. To celebrate its fifteenth birthday, iPres 2018 had a self-reflecting theme, considering how the theory of digital preservation has today matured into a community of practice.

In the three years that I’ve worked in the digital preservation field, I have often felt that I have the same conversations on repeat. Which is not to say that I do not love having them! However, the opportunity to reflect on significant developments in digital preservation since 2003 is comforting and shows how these conversations eventually do have lasting impact. Knowing how far the community has come in the past fifteen years opens up my imagination around where digital preservation might be by 2033. And despite current world challenges I am very optimistic!


So what did iPres 2018 have to say about developments since 2003?

1) We now have a joint vocabulary

Barbara Sierman, of the Koninklijke Bibliotheek, commented that a development which is particularly striking to her is that digital preservation today has a shared vocabulary. In the early 2000’s even defining the issues around preservation was a barrier when speaking to colleagues. The fact that we now have a shared vocabulary, comments Sierman, means that practitioners are able to present their research and practices at conferences such as iPres.

This is something hugely valuable and does show that digital preservation is emerging as a distinct discipline. Importantly, having established a vocabulary and theories also enables the digital preservation community to challenge and test these very notions and use them as a reference point for new ones.

Twitter – @euanc – https://twitter.com/euanc/status/1044941732155215873


2) More people see the value of digital preservation

“The ability to authenticate and validate turns out to be a superpower in an era where data and truth has become a key economic product.”

This was a comment from William Kilbride (Digital Preservation Coalition) on growing interest in the field. I agree that public awareness of digital collecting and digital preservation is something which appears to have changed rapidly in the last year or so. I think there is a growing consciousness that the internet is not permanent and that your digital life has value. My personal observation has been that recent events, (such as Cambridge Analytica as well as the stricter General Data Protection Regulation in the EU), have prompted more people to see their social media and other data as something they can make decisions about. This is for example the first year when friends have started asking me how to extract and preserve their social media!


3) Digital preservation is becoming more Business-as-Usual (but we are not completely there yet)

Twitter-@karirene69, https://twitter.com/karirene69/status/1045014419045064704

In the panel Taking Stock after 15 Years Maureen Pennock, of the British Library, reflected on the role of research in developing digital preservation as a field. Many of the research projects undertaken in the late 1990’s to 2000’s profoundly shaped the field and without them we would today not have sustainable digital collecting programmes in place in some organisations.

Having the space to undertake innovative research will always be important to ensure that digital preservation can address emerging challenges. It is also highly encouraging that BAU digital preservation programmes are now becoming more common and that organisations are collecting at large and automated scales. However, Pennock warns that there is a difference between research and practice and that the latter needs to function outside the remit of discreet research project funding. This still an ongoing challenge to BAU practices for digital preservation.


And what about the future?

It is always hard to predict which topics are “fads” and which ones make a more lasting impact. However, a hot topic this year (which divided opinions) was whether or not digital preservation should develop into a separate profession with its own code of ethics. The development of digital preservation as a profession could be an important advocacy tool. Conversely, it also runs the risks of isolating digital preservation activities by framing them as something separate from other professions such as archivists, records managers and librarians.

Twitter – @mopennock – https://twitter.com/mopennock/status/1044944038170972161

Now that we have the vocabularies, theories, practices, and attention of the media (as outlined above) – should we instead be making a more concerted effort to integrate with library, archives and other research conferences? This will no doubt be a continued area of discussion for iPres 2019 and beyond!

Electronic lab notebooks and digital preservation: part II

In her previous blog post on electronic lab notebooks (ELNs), Sarah outlined a series of research questions that she wanted to pursue to see what could be preserved from an ELN. Here are some of her results.


In my last post, I had a number of questions that I wanted to answering regarding the use of ELNs at Oxford, since IT Services is currently running a pilot with LabArchives.

Those questions were:

  1. Authenticity of research – are timestamps and IP addresses retained when the ELN is exported from LabArchives?
  2. Version/revision history – Can users export all previous versions of data? If not users, then can IT Services? Can the information on revision history be exported, even if not the data?
  3. Commenting on the ELN – are comments on the ELN exported? Are they retained if deleted in revision history?
  4. Export – What exactly can be exported by a user? What does it look like? What functionality do you have with the data? What is lost?

What did I find out?

I started first with looking at the IT Services’ webpage on ELNs. It mentions what you can download (HTML or PDF), but it doesn’t offer much more about the long-term retention of it. There’s a lot of useful advice on getting started with ELNs though and how to use the notebook.

In the Professional version that staff and academics can use offers two modes of export:

  • Notebook to PDF
  • Offline Notebook – HTML

When you request one of these functions, LabArchives will email it to the email address associated with your work. It should happen within 60 minutes. Then you will have 24 hours to download the file. So, the question is: what do you get with each?

PDF

There are two options when you go to download your PDF: 1) including comments and 2) including empty folders.

So, this means that comments are retained in the PDF and they look something like this:

It also means that where possible, previews of images and documents show up in the PDF. As do the latest timestamps.

What you lose is:

  • previous versions and revision history
  • the ability to use files – these will have to be downloaded and saved separately (but this was expected from a PDF)

What you get:

  • a tidy, printable version of a lab notebook in its most recent iteration (including information on who generated the PDF and when)

What the PDF cover of a lab notebook looks like.

Offline HTML version

In this version, you are delivered a zip file which contains a number of folders and documents.

All of the attachments are stored under the attachments folder, both as original and thumbnails (which are just low res jpegs used by LabArchives).

How does the HTML offline version stack up? Overall, the functionality for browsing is pretty good and latest timestamps are retained. You can also directly download the attachments on each page.

In this version, you do not get the comments. You also do not get any previous versions, only the latest files, updates and timestamps. But unlike the PDF, it is easy to navigate and the uploaded attachments can be opened, which have not been compressed or visibly changed.

I would recommend taking a copy of both versions, since each one offers some different functions. However, neither offer a comprehensive export. Still, the most recent timestamps are useful for authenticity, though checksums for files generated on upload and given you to in an HTML export in a manifest file would be even better.

Site-wide backup

Neither export option open to academics or staff allows a comprehensive version of the ELN. Something is lost in the export. But, what LabArchives does offer is an annual site-wide back up to local IT Services as part of their Enterprise agreement. That includes: all timestamps, comments and versions. The copy contains everything. This is promising, so all academics should be aware of this because they can then request a copy from IT Services. And they should be able to get a full comprehensive backup of their ELN. This also means that IT Services is also preserving a copy of the ELNs, like LabArchives.

So, we are going to follow up with IT Services, to talk about how they will preserve and provide access to these ELN backups as part of the pilot. Many of you will have similar conversations with your own IT departments over time, as you will need to work closely with them to ensure good digital preservation practices.

And these are some of the questions you may want to consider asking when talking with your IT department about the preservation of ELNs:

  • How many backups? Where are the backups stored? What mediums are being used? Are backups checked and restored as part of testing and maintenance? How often is the media refreshed?
  • What about fixity?
  • What about the primary storage? Is it checked or refreshed regularly? Is there any redundancy if that primary storage is online? If it is offline, how can it be requested by staff?
  • What metadata is being kept and created about the different notebooks?
  • What file formats are being retained? Is any data being stored on the different file formats? Presumably with research data, there would be a large variety of data.
  • How long are these annual backups being retained?
  • Is your IT department actively going to share the ELNs with staff?
  • If it is currently the PI and department’s responsibility to store physical notebooks, what will be the arrangement with electronic ones?

Got anything else you would ask your IT department when looking into preserving ELNs? Share in the comments below.

Project update: available datasets

Oxford’s Outreach and Training Fellow, Sarah, announces the first available datasets from the DPOC project. This is part of the project’s self-archiving initiative, where they will be making sure project outputs have a permanent home.


As the project begins to come to a close (or in my case, maternity leave starts next week), we’ve begun efforts to self-archive the project. We’ll be employing a variety of methods to clean out SharePoint sites and identify records with enduring value to the project. We’ll be crawling websites and Twitter to make sure we have a record for future digital preservation projects to utilise. Most importantly, we’ll give our project outputs a long-term home so they can be reused as necessary.

That permanent home is of course our institutional repositories. Our conference papers, presentations, posters, monograph chapters and journal articles will rest there. But so will numerous datasets and records of reports and other material that will be embargoed. I’ve started depositing my datasets already, into ORA (Oxford University Research Archive).

There are two new datasets now available for download:

You can also find links to them on the Project Resources page. As more project outputs are made available through institutional repositories, we’ll be making more announcements. And at the end of the project, we’ll do a full blog post on how we self-archived the DPOC project, so that the knowledge gained will not be lost after the project ends.


Any tips for how you self-archive a project? Share them in the comments.

How I got JHOVE running in a debugger

Cambridge’s Technical Fellow, Dave, steps through how he got JHOVE running in a debugger, including the various troubleshooting steps. As for what he found when he got under the skin of JHOVE—stay tuned.


Over the years of developing apps, I have come to rely upon the tools of the trade; so rather than read programming documentation, I prefer getting code running under a debugger and stepping through it, to let it show me what an app does. In my defence, Object Oriented code tends to get quite complicated, with various methods of one class calling unexpected methods of another… To avoid this, you can use Design Patterns and write Clean Code, but it’s also very useful to let the debugger show you the path through the code, too.

This was the approach I took when I took a closer look at JHOVE. I wanted to look under the hood of this application to help James with validating a major collection of TIFFs for a digitisation project by Bodleian Libraries and The Vatican Library.

Step 1: Getting the JHOVE code into an IDE

Jargon alert: ‘IDE’ – stands for ‘Integrated Development Environment’, which means: “… piece of software for writing, managing, sharing, testing and (in this instance) debugging code”.

So I had to pick the correct IDE to use… I already knew that JHOVE was a Java app: the fact it’s compiled as a Java Archive (JAR) was the giveaway, though if I’d needed confirmation, checking the coloured bar on the homepage of its GitHub repository would have told me, too.

Github project language analysis

Coding language analysis in a GitHub project

My Java IDE of choice is JetBrains’s IntelliJ IDEA, so the easiest way to get the code was to start a new project by Checking Out from Version Control, selecting the GitHub option and adding the URL for the JHOVE project (https://github.com/openpreserve/JHOVE). This copied (or ‘cloned’) all the code to my local machine.

Load from GitHub

Loading a project into IntelliJ IDEA directly from GitHub

GitHub makes it quite easy to manage code branches, i.e.: different versions of the codebase that can be developed in parallel with each other – so you can, say, fix a bug and re-release the app quickly in one branch, while taking longer to add a new feature in another.

The Open Preservation Foundation (who manage JHOVE’s codebase now) have (more or less) followed a convention of ‘branching on release’ – so you can easily debug the specific version you’re running in production by switching to the relevant branch… (…though version 1.7 seems to be missing a branch?) It’s usually easy to switch branches within your IDE – doing so simply pulls the code from the different branch down and loads it into your IDE, and your local Git repository in the background.

GitHub branches

Finding the correct code branch in GitHub. Where’s 1.7 gone?

Step 2: Finding the right starting point for the debugger

Like a lot of apps that have been around for a while, JHOVE’s codebase is quite large, and it’s therefore not immediately obvious where the ‘starting point’ is. At least, it isn’t obvious if you don’t READ the README file in the codebase’s root. Once you finally get around to doing that, there’s a clue buried quite near the bottom in the Project Structure section:

JHOVE-apps: The JHOVE-apps module contains the command-line and GUI application code and builds a fat JAR containing the entire Java application.

… so the app starts from within the jhove-apps folder somewhere. A little extra sniffing about and I found a class file in the src/main/java folder called Jhove.java, which contained the magic Java method:

public static void main (String [] args) {}

…which is the standard start point for any Java app (and several other languages too).

However, getting the debugger running successfully wasn’t just a case of finding the right entry point and clicking ‘run’ – I also had to setup the debugger configuration to pass the correct command-line arguments to the application, or it fell at the first hurdle. This is achieved in IntelliJ IDEA by editing the Run / Debug configuration. I set this up initially by right-clicking on the Jhove.java file and selecting Run JHOVE.main().

Running Jhove in IntelliJ

Running the Jhove class to start the application

The run failed (because I hadn’t added the command line arguments) but at least IntelliJ was clever enough to setup a new Run / Debug configuration (called Jhove after the class I’d run) that I could then add the Program Arguments to – in this case, the same command line arguments you’d run JHOVE with normally (e.g. the module you want to run, the handler you’d want to output the result with, the file you want to characterise etc etc).

Edit the run config

Editing the Run configuration in IntelliJ

I could then add a breakpoint to the code in the Jhove.main() method and off I went… Or did I?

Step 3: setting up a config file

So this gave me what I needed to start stepping through the code. Unfortunately, my first attempt didn’t get any further than the initial Jhove.main() method… It got all the way through, but then the following error occurred:

Cannot instantiate module: com.mcgath.jhove.module.PngModule

The clue for how to fix this was actually provided by the debugger as it ran, however, and provides a good example of the kind of insight you get from running code in debug mode in your IDE. Because the initial set of command-line parameters I was passing in from the Run / Debug configuration didn’t contain a “-c” parameter to set a config file, JHOVE was automagically picking up its configuration from a default location: i.e. the JHOVE/config folder in my user directory – which existed, with a config file, because I’d also installed JHOVE on my machine the easy way beforehand…)

Config file variable in debugger

Debugger points towards the config file mix-up

A quick look at this config showed that JHOVE was expecting all sorts of modules to be available to load, one of which was the ‘external’ module for PNG characterisation mentioned in the error message. This is included in the JHOVE codebase, but in a separate folder (jhove-ext-modules): the build script that pulls JHOVE together for production deployment clearly copes with copying the PNG module from this location to the correct place, but the IDE couldn’t find it when debugging.

So the solution? Put a custom config file in place, and remove the parts that referenced the PNG module. This worked a treat, and allowed me to track the code execution all the way through for a test TIFF file.

Adding a config file parameter

Adding an extra -c config file parameter and a custom config file.

Conclusion

Really, all the above, while making it possible to get under the skin of JHOVE, is just the start. Another blog post may follow regarding what I actually found when I ran through its processes and started to get and idea of how it worked (though as a bit of a spoiler, it wasn’t exactly pretty)…

But, given that JHOVE is more or less ubiquitous in digital preservation (i.e. all the major vended solutions wrap it up in their ingest processes in one way or another), hopefully more people will be encouraged to dive into it and learn how it works in more detail. (I guess you could just ‘read the manual’ – but if you’re a developer, doing it this way is more insightful, and more fun, too).

Electronic lab notebooks and digital preservation: part I

Outreach and Training Fellow, Sarah, writes about a trial of electronic lab notebooks (ELN) at Oxford. She discusses the requirements and purpose of the ELN trial and raises lingering questions around preserving the data from ELNs. This is part I of what will be a 2-part series.


At the end of June, James and I attended a training course on electronic lab notebooks (ELN). IT Services at the University of Oxford is currently running a trial of Lab Archives‘ ELN offering. This course was intended to introduce departments and researchers to the trial and to encourage them to start their own ELN.

Screenshot of a LabArchives electronic lab notebook

When selecting an ELN for Oxford, IT Services considered a number of requirements. Those that were most interesting from a preservation perspective included:

  • the ability to download the data to store in an institutional repository, like ORA-data
  • the ability to upload and download data in arbitrary formats and to have it bit-preserved
  • the ability to upload and download images without any unrequested lossy compression

Moving from paper-based lab notebooks to an ELN is intended to help a lot with compliance as well as collaboration. For example, the government requires every scientist to keep a record of every chemical used for their lifetime. This has a huge impact on the Chemistry Department; the best way to search for a specific chemical is to be able to do so electronically. There are also costs associated with storing paper lab notebooks. There’s also the risk of damage to the notebook in the lab. In some ways, an electronic lab notebook can solve some of those issues. Storage will likely cost less and the risk of damage in a lab scenario is minimised.

But how to we preserve that electronic record for every scientist for at least the duration of their life? And what about beyond that?

One of the researchers presenting on their experience using LabArchives’ ELN stated, “it’s there forever.” Even today, there’s still an assumption that data online will remain online forever. Furthermore, there’s an overall assumption that data will last forever. In reality, without proper management this will almost certainly not be the case. While IT Services will be exporting the ELNs for back up purposes, but management and retention periods for those exports were not detailed.

There’s also a file upload limit of 250MB per individual file, meaning that large datasets will need to be stored somewhere else. There’s no limit to the overall size of the ELN at this point, which is useful, but individual file limits may prove problematic for many researchers over time (this has already been an issue for me when uploading zip files to SharePoint).

After learning how researchers (from PIs to PhD students) are using ELNs for lab work and having a few demos on the many features of LabArchives’ ELN, we were left with a few questions. We’ve decided to create our own ELN (available to us for free at during the trial period) in order to investigate these questions further.

The questions around preserving ELNs are:

  1. Authenticity of research – are timestamps and IP addresses retained when the ELN is exported from LabArchives?
  2. Version/revision history – Can users export all previous versions of data? If not users, then can IT Services? Can the information on revision history be exported, even if not the data?
  3. Commenting on the ELN – are comments on the ELN exported? Are they retained if deleted in revision history?
  4. Export – What exactly can be exported by a user? What does it look like? What functionality do you have with the data? What is lost?

While there’s potential for ELNs to open up collaboration and curation in lab work by allowing notes and raw data to be kept together, and facilitating sharing and fast searching. However, the long-term preservation implications are still unclear and many still seem complacent about the associated risks.

We’re starting our LabArchives’ ELN now, with the hope of answering some of those questions. We also hope to make some recommendations for preservation and highlight any concerns we find.


Anyone have an experience preserving ELNs? What challenges and issues did you come across? What recommendations would you have for researchers or repository staff to facilitate preservation? 

Digital Preservation at Oxford Open Days

Oxford Fellow, Sarah, describes the DPOC team’s pop-up exhibition “Saving Digital,” held at the Radcliffe Science Library during Oxford Open Days #OxOpenDay. The post describes from the equipment and games the team showcased over the two days and some of the goals they had in mind for this outreach work.


On 27 June and 28 June, Oxford ran Open Days for prospective students. The city was alive with open doors and plenty of activity. It was the perfect opportunity for us to take our roadshow kit out and meet with prospective students with a pop-up exhibition called “Saving Digital”. The Radcliffe Science Library (RSL) on Parks Road kindly hosted the DPOC team and all of our obsolete media for two day in their lounge area.

The pop-up exhibition hosted at the RSL

We set up our table with a few goals in mind:

  • to educate prospective students about the rapid pace of technology and the concern about how we’re going to read digital data off them in the future (we educated a few parents as well!)
  • to speak with library and university staff about their digital dilemmas and what we at the digital preservation team could do about it
  • to raise awareness about the urgency and need of digital preservation in all of our lives and to inform more people about our project (#DP0C)

To achieve this, we first drew people in with two things: retro gaming and free stuff.

Last minute marketing to get people to the display. It worked!

Our two main games were the handheld game, Galaxy Invader 1000, and Frak! for the BBC Micro.

Frak! on the BBC Micro. The yellow handheld console to the right is Galaxy Invader 1000.

Galaxy Invader 1000 by CGL (1980) is a handheld game, which plays a version of Space Invaders. This game features a large multi-coloured display and 3 levels of skill. The whole game was designed to fit in 2 kilobytes of memory. 

Frak! (1984) was a game released for the BBC Micro in 1984 under the Aardvark software label. It was praised for excellent graphics and game play. In the side scrolling game, you play a caveman named Trogg. The aim of the game is to cross a series of platforms while avoiding dangers that include various monsters named Poglet and Hooter. Trogg is armed with a yo-yo for defence. 

Second, we gave them some digestible facts, both in poster form and by talking with them:

Saving Digital poster

Third, we filled the rest of the table with obsolete media and handheld devices from about the last forty years—just a small sample of what was available! This let them hold some of the media of the past, marvel over how little it could hold, but how much it could do for the time. And then we asked them how would they read the data off it today. That probably concerned parents more than their kids as several of them admitted to having important digital stuff either still on VHS or miniDV tapes, or on 3.5-inch disks! It got everyone thinking at least.

A lot of obsolete media all in one place.

Lastly, an enthusiastic team with some branded t-shirts made to emulate our most popular 1st generation badge, which was pink with a 3.5-inch disk in the middle. We gave away our last one during Open Days! But don’t worry, we have some great 2nd generation badges to collect now.

An enthusiastic team always helps. Especially if they are willing to demo the equipment.


A huge thank you to the RSL for hosting us for two days—we’ll be back on the 16th of July if you missed us and want to visit the exhibition! We’ll have a few extra retro games on hand and some more obsolete storage media!

Our poster was found on display in the RSL.

Update on the training programme pilot

Sarah, Oxford’s Outreach and Training Fellow, has been busy since the new year designing and a running a digital preservation training programme pilot in Oxford. It consisted of one introductory course on digital preservation and six other workshops. Below is an update on what she did for the pilot and what she has learnt over the past few months.


It’s been a busy few months for me, so I have been quiet on the blog. Most of my time and creative energy has been spent working on this training programme pilot. In total, there were seven courses and over 15 hours of material. In the end, I trialled the courses on over 157 people from Bodleian Libraries and the various Oxford college libraries and archives. Many attendees were repeats, but some were not.

The trial gave me an opportunity to test out different ideas and various topics. Attendees were good at giving feedback, both during the course and after via an online survey. It’s provided me with further ideas and given me the chance to see what works or what doesn’t. I’ve been able to improve the experience each time, but there’s still more work to be done. However, I’ve already learned a lot about digital preservation and teaching.

Below are some of the most important lessons I’ve learned from the training programme pilot.

Time: You always need more

I found that I almost always ran out of time at the end of a course; it left no time for questions or to finish that last demo. Most of my courses could have either benefited from less content, shorter exercises, or just being 30 minutes longer.

Based on feedback from attendees, I’ll be making adjustments to every course. Some will be longer. Some will have shorter exercises with more optional components and some will have slightly less content.

While you might budget 20 minutes for an activity, you will likely use 5-10 minutes more. But it varies every time due to the attendees. Some might have a lot of questions, but others will be quieter. It’s almost better to overestimate the time and end early than rush to cover everythhing. People need a chance to process the information you give them.

Facilitation: You can’t go it alone

In only one of my courses did I have to facilitate alone. I was run off my feet for the 2 hours because it was just me answering questions during  exercises for 15 attendees. It doesn’t sound like a lot, but I had a hoarse voice by the end from speaking for almost 2 hours!

Always get help with facilitation—especially for workshops. Someone to help:

  • answer questions during exercises,
  • get some of the group idea exercises/conversations started,
  • make extra photocopies or print outs, and
  • load programs and files onto computers—and then help delete them after.

It is possible to run training courses alone, but having an extra person makes things run smoother and saves a lot of time. Edith and James have been invaluable support!

Demos: Worth it, but things often go wrong

Demos were vital to illustrate concepts, but they were also sometimes clunky and time consuming to manage. I wrote up demo sheets to help. The demos relied on software or the Internet—both which can and will go wrong. Patience is key; so is accepting that sometimes things will not go right. Processes might take a long time to run or the course concludes before the demo is over.

The more you practice on the computer you will be using, the more likely things will go right. But that’s not always an option. If it isn’t, always have a back up plan. Or just apologise, explain what should have happened and move on. Attendees are generally forgiving and sometimes it can be turned into a really good teaching moment.

Exercises: Optional is the way to go

Unless you put out a questionnaire beforehand, it is incredibly hard to judge the skill level of your attendees. It’s best to prepare for all levels. Start each exercise slow and have a lot of optional work built in for people that work faster.

In most of my courses I was too ambitious for the time allowed. I wanted them to learn and try everything. Sometimes I wasn’t asking the right questions on the exercises either. Testing exercises and timing people is the only way to tailor them. Now that I have run the workshops and seen the exercises in action, I have a clearer picture of what I want people to learn and accomplish—now I just have to make the changes.

Future plans

There were courses I would love to run in the future (like data visualisation and digital forensics), but I did not have the time to develop. I’d like to place them on a roadmap for future training. As well as reaching out more to the Oxford colleges, museums and other departments. I would also like to tailor the introductory course a bit more for different audiences.

I’d like to get involved with developing courses like Digital Preservation Carpentry that the University of Melbourne is working on. The hands-on workshops excited and challenged me the most. Not only did others learn a lot, but so did I. I would like to build on that.

At the end of this pilot, I have seven courses that I will finalise and make available through a creative commons licence. What I learned when trying to develop these courses is that there isn’t always a lot of good templates available on the Internet to use as a starting point—you have to ask around for people willing to share.

So, I am hoping to take the work that I’ve done and share it with the digital preservation community. I hope they will be useful resources that can be reused and repurposed. Or at the very least, I hope it can be used as a starting point for inspiration (basic speakers notes included).

These will be available via the DPOC website sometime this summer, once I have been able to make the changes necessary to the slides and exercises—along with course guidance material. It has been a rewarding experience (as well as an exhausting one); I look forward to developing and delivering more digital preservation training in the future.