Digital preservation with limited resources

What should my digital preservation strategy be, if I do not have access to repository software or a DAMS system?

At Oxford, we recently received this question from a group of information professionals working for smaller archives. This will be a familiar scenario for many – purchasing and running repository software will require a regular dedicated budget, which many archives in the UK do not currently have available to them.

So what intermediate solutions could an archive put in place to better its chances of not losing digital collection content until such a time? This blog summarises some key points from meeting with the archivists, and we hope that these may be useful for other organisations who are asking the same question.


Protect yourself against human error

CC-BY KateMangoStar, Freepik

Human error is one of the major risks to digital content. It is not uncommon that users will inadvertently drag files/folders or delete content by mistake. It is therefore important to have strict user restrictions in place which limits who can delete, move, and edit digital collections. For this purpose you need to ensure that you have defined an “archives directory” which is separate from any “working directories” where users can still edit and actively work with content.

If you have IT support available to you, then speak to them about setting up new user restrictions.


Monitor fixity

CC-BY Dooder, Freepik

However, even with strong user restrictions in place, human error can occur. In addition to enforcing stronger user restrictions in the “archives directory”, tools like Fixity from AVP can be used to spot if content has been moved between folders, deleted, or edited. By running regular Fixity reports an archivist can spot any suspicious looking changes.

We are aware that time constraints are a major factor which inhibits staff from adding additional tasks to their workload, but luckily Fixity can be set to run automatically on a weekly basis, providing users with an email report at the end of the week.


Understand how your organisation does back-ups

CC-BY Shayne_ch13, Freepik

A common IT retention period for back-ups of desktop computers is 14 days. The two week period enables disaster recovery of working environments, to ensure that business can continue as usual. However, a 14 day back-up is not the same as preservation storage and it is not a suitable solution for archival collections.

In this scenario, where content is stored on a file system with no versioning, the archivist only has 14 days to spot any issues and retrieve an older back-up before it is too late. So please don’t go on holiday or get ill for long! Even with tools like Fixity, fourteen days is an unrealistic turn-around time (if the issue is at all spotted in the first place).

If possible, try and make the case to your organisation that you require more varied types of back-ups for the “archival directory”. These should include back-ups which are at least retained for a year. Using a mix of tape storage and/or cloud service providers can be a less expensive way of storing additional back-ups which do not require ongoing access. It is an investment which is worth making.

As a note of warning though – you are still only dealing in back-ups. This is not archival storage. If there are issues with multiple back-ups (due to for example transfer or hardware errors) you can still lose content. The longer term goal, once better back-ups are in place, should be to monitor the fixity of multiple copies of content from the “archival directory”. (For more information about the difference between back-ups used for regular IT purposes and storage for digital preservation see the DPC Handbook)


Check that your back-ups work
Once you have got additional copies of your collection content, remember to check that you can retrieve them again from storage.

Many organisations have been in the positions where they think they have backed up their content – only to find out that their back-ups have not been created properly when they need them. By testing retrieval you can protect your collections against this particular risk.


But… what do I do if my organisation does not do back-ups at all?
Although the 14 day back-up retention is common in many businesses, it is far from the reality which certain types of archives operate within. A small community organisation may for example do all its business on a laptop or workstation which is shared by all staff (including the archive).

This is a dangerous position to be in, as hardware failure can cause immediate and total loss. There is not a magic bullet for solving this issue, but some of the advice which Sarah (Training and Outreach Fellow at Bodleian Libraries) has provided in her Personal Digital Archiving Course could apply.

Considerations from Sarah’s course include:

  • Create back-ups on additional removable hard drive(s) and store them in a different geographical location from the main laptop/workstation
  • Make use of free cloud storage limits (do check the licenses though to see what you are agreeing to – it’s not where you would want to put your HR records!)
  • Again – remember to check your back-ups!
  • For digitized images and video, consider using the Internet Archive’s Gallery as an additional copy (note that this is open to the public, and requires assigning a CC-BY license)  (If you like the work that the Internet Archive does – you can donate to them here )
  • Apply batch-renaming tools to file names to ensure that they contain understandable metadata in case they are separated from their original folders

(Email us if you would like to get a copy of Sarah’s lecture slides with more information)


Document all of the above

CC-BY jcomp, Freepik

Make sure to write down all the decisions you have made regarding back-ups, monitoring, and other activities. This allows for succession planning and ensures that you have a paper trail in place.


Stronger in numbers

CC-BY, Kjpargeter, Freepik

Licenses, contracts and ongoing management is expensive. Another venue to consider is looking to peer organisations to lower some of these costs. This could include entering into joint contracts with tape storage providers, or consortium models for using repository software. An example of an initiative which has done this is the NEA (Network Electronic Archive) group which has been an established repository for over ten years supporting 28 small Danish archives.


Summary:
These are some of the considerations which may lower the risk of losing digital collections. Do you have any other ideas (or practical experience) of managing and preserving digital collections with limited resources, and without using a repository or DAMS system?

Project update

A project update from Edith Halvarsson, Policy and Planning Fellow at Bodleian Libraries. 


Ms Arm.e.1, Folio 23v

Bodleian Libraries’ new digital preservation policy is now available to view on our website, after having been approved by Bodleian Libraries’ Round Table earlier this year.

The policy articulates Bodleian Libraries’ approach and commitment to digital preservation:

“Bodleian Libraries preserves its digital collections with the same level of commitment as it has preserved its physical collections over many centuries. Digital preservation is recognized as a core organizational function which is essential to Bodleian Libraries’ ability to support current and future research, teaching, and learning activities.”

 

Click here to read more of Bodleian Libraries’ policies and reports.

In other related news we are currently in the process of ratifying a GLAM (Gardens, Libraries and Museums) digital preservation strategy which is due for release after the summer. Our new digitization policy is also in the pipelines and will be made publicly available. Follow the DPOC blog for future updates.

Digital Preservation Roadshow – Part 2

Building on the success of CUL’s digital preservation roadshow kit, the Oxford fellows have begun assembling a local version. The kit is a mixture of samples of old hardware, storage technology, quiz activities, and general “digital preservation swag”.

Pens, pins, and a BBC Micro

We were able to trial run it as part of a GLAM (Gardens, Libraries and Museums) showcase at the Weston Library this January. Among the showcase attendees’ favourite items was an early floppy disk camera (c.1998) and our BBC Micro Computer (1981).

Sony Digital Mavica (MVC-FD7) 

Technical Fellow James Mooney at the Oxford GLAM Showcase

Our floppy disk camera was among the first in the Mavica “FD” series from Sony. Sony produced 3.5” floppy disk cameras from late 1997 until 2002 (when it moved on to Mavica for CD). MVC-FD7 takes 8-bit images which can be easily transferred to a home computer. This is one of the reasons that the Mavica FD series was so popular – the FAT12 file system and wide spread adoption of 3.5″ floppy disk drives in computers made transfer a simple and quick task.

It is easy to forget that the floppy disk camera is really the grandfather of the microSD card!

 

 

BBC Micro

The BBC Micro is well known by most British people who went to school in the 1980s and ’90s – but even today some UK classrooms will feature a BBC Micro for more nostalgic reasons.  The BBC Microcomputer series was design and built by Acorn for the BBC Computer Literacy Project. Most schools in the UK adopted the system, and for many children the BBC BASIC programming language was the first one they learnt.

There is to this day a cult following of BBC Micro educational games, such as Granny’s Garden (1983).


The kit will be displayed in different Oxford libraries throughout 2018 to promote the DPOC training programme and raise awareness of Bodleian Libraries’ new digital preservation policy.


Advocating for digital preservation

Bodleian Libraries and Cambridge University Library are entering into the last phase of the DPOC project, where they are starting to write up business cases for digital preservation. In preparation, the Fellows attended DPC’s “advocacy briefing day” in London.  Policy and Planning Fellow, Edith, blogs about some of the highlights and lessons from the day.


This week I had the pleasure of attending DPC’s advocacy training day. It was ran by Catherine Heaney, the founder of DHR Communications, and a veteran when it comes to advocating for supporting digital heritage. Before the event I thought I had a clear idea of what advocacy means in broad terms. You invite yourself into formal meetings and try to deliver measured facts and figures which will be compelling to the people in front of you – right?

Well… not quite it turns out. Many of these assumptions were put on their head during this session. Here are my four favourite pieces of (sometimes surprising) advocacy advice from Catherine.

Tip 1: Advocacy requires tenaciousness

The scenario which was described above is what communications professionals might call “the speech” – but it is only one little part of effective advocacy. “The digital preservation speech” is important, but it is not necessarily where you will get the most buy-in for digital preservation. Research has shown that one-off communications like these are usually not effective.

In fact, all of those informal connections and conversations you have with colleagues also come under advocacy and may reap greater benefits due to their frequency. And if one of these colleagues are themselves talented at influencing others, they can be invaluable in advocating for digital preservation when you are not there in person.

Lesson learnt: you need to keep communicating the message whenever and wherever you can if you want it to seep in to peoples’ consciousness. Since digital preservation issues do not crop up that often in popular culture and the news, it is up to us to deliver, re-deliver… and then re-deliver the message if we want it to stick.

Tip 2: Do your background research

When you know that you will be interacting with colleagues and senior management, it is important to do your background research and find out what argument will most appeal to the person you are meeting. Having a bog-standard ‘speech’ about digital preservation which you pull out at all occasions is not the most effective approach. In order to make your case, the problem you are attempting to solve should also reflect the goals and the challenges which the person you are trying to advocate to are facing.

The aspects which appeal about digital preservation will be different depending on the role, concerns and responsibilities of the person you are advocating to. Are they concerned with:

  • Legal or reputational risk?
  • Financial costs and return on investment?
  • About being seen as someone at the forefront of the digital preservation fields?
  • Creating reproducible research?
  • Collecting unique collections?
  • Or perhaps about the opportunity to collaborate cross-institutionally?

Tip 3: Ensure that you have material for a “stump speech” ready

Tailoring your message to the audience is important, and this will be easier if you have material ready at hand which you can pick and choose from. Catherine suggested preparing a folder of stories, case studies, data and facts about digital preservation which you can cut and paste from to suit the occasion.

What is interesting though is the order of that list of “things to collect”:

  1. Stories
  2. Case studies
  3. Data and facts

The ranking is intentional. We tend to think that statistics and raw data will convince people, as this appeals to their logic. In fact, your argument will be stronger if your pitch starts with a narrative (a story) about WHY we need digital preservation and case studies to illustrate your point.  Catherine advises that it is then when the audience is listening that you bring out the data and facts. This approach is both more memorable and more effective in capturing your audience’s attention.

Tip 4: Personalise your follow up

This connects to tip 2 – about knowing your audience. Catherine advised that, although it may feel strange at first, writing a personalised follow up message is a very effective tool. When you do have the chance to present your case to an important group within your organisation, the follow up message can further solidify that initial pitch (again – see tip 1 about repeated communication).

By taking notes about the concerns or points that have been made during a meeting, you have the opportunity to write personalised messages which captures and refers back to the concerns raised by that particular person. The personalised message also has the additional benefit of opening up a channel for future communication.


This was just a small subsection of all the interesting things we talked about on the advocacy briefing day. For some more information have a look at the hashtag for the day #DPAdvocacy.

Using ePADD with Josh Schneider

Edith, Policy and Planning Fellow at Bodleian Libraries, writes about her favourite features in ePADD (an open source software for email archives) and about how the tool aligns with digital preservation workflows.


At iPres a few weeks ago I had the pleasure of attending an ePadd workshop ran by Josh Schneider from Stanford University Libraries. The workshop was for me one of the major highlights of the conference, as I have been keen to try out ePADD since first hearing about it at DPC’s Email Preservation Day. I wrote a blog about the event back in July, and have now finally taken the time to review ePADD using my own email archive.

ePADD is primarily for appraisal and delivery, rather than a digital preservation tool. However, as a potential component in ingest workflows to an institutional repository, ensuring that email content retains integrity during processing in ePADD is paramount. The creators behind ePADD are therefore thinking about how to enhance current features to make the tool fit better into digital preservation workflows. I will discuss these features later in the blog, but first I wanted to show some of the capabilities of ePADD. I can definitely recommend having a play with this tool yourself as it is very addictive!

ePADD: Appraisal module dashboard

Josh, our lovely workshop leader, recommends that new ePADD users go home and try it on their own email collections. As you know your own material fairly well it is a good way of learning about both what ePADD does well and its limits. So I decided to feed in my work emails from the past year into ePADD – and found some interesting trends about my own working patterns.

ePADD consists of four modules, although I will only be showing features from the first two in this blog:

Module 1: Appraisal (Module used by donors for annotation and sensitivity review of emails before delivering them to the archive)

Module 2: Processing (A module with some enhanced appraisal features used by archivist to find additional sensitive information which may have been missed in the first round of appraisal)

Module 3: Discovery (A module which provides users with limited key word searching for entities in the email archive)

Module 4: Delivery (This module provides more enhanced viewing of the content of the email archive – including a gallery for viewing images and other document attachments)

Note that ePADD only support MBOX files, so if you are an Outlook user like myself you will need to first convert from PST to MBOX. After you have created an MBOX file, setting up ePADD is fairly simple and quick. Once the first ePADD module (“Appraisal”) was up and running, processing my 1,500 emails and 450 attachments took around four minutes. This time includes time for natural language processing. ePADD recognises and indexes various “entities” – including persons, places and events – and presents these in a digestible way.

ePADD: Appraisal module processing MBOX file

Looking at the entities recognised by ePADD, I was able to see who I have been speaking with/about during the past year. There were some not so surprising figures that popped up (such as my DPOC colleagues James Mooney and Dave Gerrard). However, curiously I seem to also have received a lot of messages about the “black spider” this year (turns out they were emails from the Libraries’ Dungeons and Dragons group).

ePADD entity type: Person (some details removed)

An example of why you need to look deeper at the results of natural language processing was evident when I looked under the “place entities” list in ePADD:

ePADD entity type: Place

San Francisco comes highest up on the list of mentioned places in my inbox. I was initially quite surprised by this result. Looking a bit closer, all 126 emails containing a mention of San Francisco turned out to be from “Slack”.  Slack is an instant messaging service used by the DPOC team, which has its headquarters in San Francisco. All email digests from Slack contains the head office address!

Another one of my favourite things about ePADD is its ability to track frequency of messages between email accounts. Below is a graph showing correspondence between myself and Sarah Mason (outreach and training fellow on the DPOC project). The graph shows that our peak period of emailing each other was during the PASIG conference, which DPOC hosted in Oxford at the start of September this year. It is easy to imagine how this feature could be useful to academics using email archives to research correspondence between particular individuals.

ePADD displaying correspondence frequency over time between two users

The last feature I wanted to talk about is “sensitivity review” in ePADD. Although I annotate personal data I receive, I thought that the one year mark of the DPOC project would also be a good time to run a second sensitivity review of my own email archive. Using ePADD’s “lexicon hits search” I was able to sift through a number of potentially sensitive emails. See image below for categories identified which cover everything from employment to health. These were all false positives in the end, but it is a feature I believe I will make use of again.

ePADD processing module: Lexicon hits for sensitive data

So now on to the Digital Preservation bit. There are currently three risks of using ePADD in terms of preservation which stands out to me.

1) For practical reasons, MBOX is currently the only email format option supported by ePADD. If MBOX is not the preferred preservation format of an archive it may end up running multiple migrations between email formats resulting in progressive loss of data

2) There are no checksums being generated when you download content from an ePADD module in order to copy it onto the next one. This could be an  issue as emails are copied multiple times without monitoring of the integrity of the email archive files occurring

3) There is currently limited support for assigning multiple identifiers to archives in ePADD. This could potentially become an issue when trying to aggregate email archives from different intuitions. Local identifiers could in this scenario clash and other additional unique identifiers would then also be required

Note however that these concerns are already on the ePADD roadmap, so they are likely to improve or even be solved within the next year.

To watch out for ePADD updates, or just have a play with your own email archive (it is loads of fun!), check out their:

PASIG 2017 Twitter round-up

After many months of planning it feels quite strange to us that PASIG 2017 is over. Hosting the PASIG conference in Oxford has been a valuable experience for the DPOC fellows and a great chance for Bodleian Libraries’ staff to meet with and listen to presentations by digital preservation experts from around the world.

In the end 244 conference delegates made their way to Oxford and the Museum of Natural History. The delegates came from 130 different institutions and every continent of the world was represented (…well, apart from Antarctica).

What was especially exciting though were all the new faces. In fact 2/3 of the delegates this year had not been to a PASIG conference before! Is this perhaps a sign that interest in digital preservation is on the rise?

As always at PASIG, Twitter was ablaze with discussion in spite of an at times flaky Wifi connection. Over three days #PASIG17 was mentioned a whopping 5300 times on Twitter and had a “reach” of 1.7 million. Well done everyone on some stellar outreach! Most active Twittering came from the UK, USA and Austria.

Twitter activity by country using #PASIG17 (Talkwalker statistics)

Although it is hard to choose favourites among all the Tweets, a few of the DPOC project’s personal highlights included:

Cambridge Fellow Lee Pretlove lists “digital preservation skills” and why we cannot be an expert in all areas. Tweet by Julian M. Morley

Bodleian Fellow James makes some insightful observations about the incompatibility between tar pits and digital preservation.

Cambridge Fellow Somaya Langley presents in the last PASIG session on the topic of “The Future of Digital Preservation”.  

What were some of your favourite talks and Twitter conversations? What would you like to see more of at PASIG 2018? #futurePASIG

Visit to the Parliamentary Archives: Training and business cases

Edith Halvarsson, Policy and Planning Fellow at Bodleian Libraries, writes about the DPOC project’s recent visit to the Parliamentary Archives.


This week the DPOC fellows visited the Parliamentary Archives in London. Thank you very much to Catherine Hardman (Head of Preservation and Access), Chris Fryer (Digital Archivist) and Grace Bell (Digital Preservation Trainee) for having us. Shamefully I have to admit that we have been very slow to make this trip; Chris first invited us to visit all the way back in September last year! However, our tardiness to make our way to Westminster was in the end aptly timed with the completion of year one of the DPOC project and planning for year 2.

Like CUL and Bodleian Libraries, the Parliamentary Archives also first began their own Digital Preservation Project back in 2010. Their project has since transitioned into digital preservation in a more programmatic capacity as of 2015. As CUL and Bodleian Libraries will be beginning to draft business cases for moving from project to programme in year 2; meeting with Chris and Catherine was a good opportunity to talk about how you start making that tricky transition.

Of course, every institution has its own drivers and risks which influence business cases for digital preservation, but there are certain things which will sound familiar to a lot of organisations. For example, what Parliamentary Archives have found over the past seven years, is that advocacy for digital collections and training staff in digital preservation skills is an ongoing activity. Implementing solutions is one thing, whereas maintaining them is another. This, in addition to staff who have received digital preservation training eventually moving on to new institutions, means that you constantly need to stay on top of advocacy and training. Making “the business case” is therefore not a one-off task.

Another central challenge in terms of building business cases, is how you frame digital preservation as a service rather than as “an added burden”. The idea of “seamless preservation” with no human intervention is a very appealing one to already burdened staff, but in reality workflows need to be supervised and maintained. To sell digital preservation, that extra work must therefore be perceived as something which adds value to collection material and the organisation. It is clear that physical preservation adds value to collections, but the argument for digital preservation can be a harder sell.

Catherine had, however, some encouraging comments on how we can attempt to turn advice about digital preservation into something which is perceived as value adding.  Being involved with and talking to staff early on in the design of new project proposals – rather than as an extra add on after processes are already in place – is an example of this.

Image by James Mooney

All in all, it has been a valuable and encouraging visit to the Parliamentary Archives. The DPOC fellows look forward to keeping in touch – particularly to hear more about the great work Parliamentary Archive have been doing to provide digital preservation training to staff!

Email preservation: How hard can it be?

Policy and Planning Fellow Edith summarises some highlights from the Digital Preservation Coalition’s briefing day on email preservation. See the full schedule of speakers on DPC’s website.


Yesterday Sarah and I attended DPC’s briefing day on email preservation at the National Archives (UK) in Kew, London. We were keen to go and hear about latest findings from the Email Preservation Task Force as Sarah will be developing a course dedicated to email preservation for the DPOC teaching programme. An internal survey circulated to staff in Bodleian Libraries’ earlier this year showed a real appetite for learning about email preservation. It is an issue which evidently spans several areas of our organisation.

The subheading of the event “How hard can it be?” turned out to be very apt. Before even addressing preservation, we were asked to take a step back and ask ourselves:

Do I actually know what email is?”

As Kate Murray from the Library of Congress put it: “email is an object, several things and a verb”. In this sense email has much in common with the World Wide Web, as they are heavily linked and complex objects. Retention decisions must be made, not only about text content but also about email attachments and external web links. In addition, supporting features (such as instant messaging and calendars) are increasingly integrated into email services and potential candidates for capture.

Thinking about email “as a verb” also highlights that it is a cultural and social practice. Capturing relationships and structures of communication is an additional layer to preserve. Anecdotally, some participants on the Email Preservation day had found that data mining, including the ability to undertake analysis across email archives, is increasingly in demand from historians using big data research techniques.

Anthea Seles, National Archives (UK), talks about visualisation of email archives.

What are people doing?

So what are organisations currently doing to preserve email? A strength of the Email Preservation Taskforce’s new draft report is that it draws together samples of workflows currently in use by other organisations (primarily US based). Additional speakers from Preservica, National Archives and the British Library supplemented these with some local examples from the UK throughout the day.

The talks and report show that migration is by far the most common approach to email preservation in the institutions consulted. EML and Mbox are the most common formats migrated to. Each have different approaches to storing either single messages (EML) or aggregating messages in a single database file (Mbox). (However, beware that Mbox is a whole family of formats which have varying documentation levels!)

While some archives choose to ingest Mbox and EML files into their repositories without further processing, others choose to unpack content within these files. Unpacking content provides a mode of displaying emails, as well as the ability to normalise content within them.

The British Library for example have chosen to unpack email files using Aid4Mail, and are attempting to replicate the message hierarchy within a folder structure. Using Aid4Mail, they migrate text from email messages to PDF/A-2b which are displayed alongside folders containing any email attachments. PDF/A-2b can then be validated using vera/PDF or other tools. A CSV manifest is also generated and entered into relevant catalogues. Preservica’s out of the box workflow is very similar to the British Library’s, although they choose to migrate text content to HTML or UTF-8 encoded text files.

Another tantalising example (which I can imagine will gain more traction in the future) came from one institution who has used Emulation As A Service to provide access to one of its collections of email. By using an emulation approach it is able to provide access to content within the original operating environment used by the donor of the email archive. This has particular strength in that email attachments, such as images and word processing files, can be viewed on contemporary software (providing licenses can be acquired for the software itself).

Finally, a tool which was considered or already in use by many of the contributors is ePADD. ePADD is an open source tool developed by Stanford University Libraries. It provides functions for processing and appraisal of Mbox files, but also has many interesting features for exploring the social and cultural aspect of email. ePADD can mine emails for subjects such as places, events and people. En masse, these subjects provide researchers with a much richer insight into trends and topics within large email archives. (Tip: why not have a look at the ePADD discovery module to see it in practice?)

What do we still need to explore?

It is encouraging that people are already undertaking preservation of email and that there are workflows out there which other organisations can adopt. However, there are many questions and issues still to explore.

  1. Current processes cannot fully capture the interlinked nature of email archives. Questions were raised during the day about the potential of describing archives using linked open data in order to amalgamate separate collections. Email archives may be more valuable to historians as they acquire critical mass
  2. Other questions were raised around whether or not archives should also crawl web links within emails. Links to external content may be crucial for understanding the context of a message, but this becomes a very tricky issue if emails are accessioned years after creation. If webpages are crawled and associated with the email message years after it was sent, serious doubt is raised around the reliability of the email as a record
  3. The issue of web links also brings up the question of when email harvesting should occur. Would it be better if emails were continually harvested to the archive/record management system than waiting until a member of staff leave their position? The good news is that many email providers are increasingly documenting and providing APIs to their services, meaning that the ability to do so may become more feasible in the future
  4. As seen in many of the sample workflows from the Email Preservation Task Force report, email files are often migrated multiple times. Especially as ePADD works with Mbox, some organisations end up adding an additional migration step in order to use the tool before normalising to EML. There is currently very little available literature on the impact of migrations, and indeed multiple migrations, on the information content of emails.

What can you do now to help?    

So while there are some big technical and philosophical challenges, the good news is that there are things you can do to contribute right now. You can:

  • Become a “Friend of the Email Preservation Task Force” and help them review new reports and outputs
  • Contribute your organisation’s workflows to the Email Preservation Task Force report, so that they can be shared with the community
  • Run trial migrations between different email formats such as PST, Mbox and EML and blog about your finding
  • Support open source tools such as ePADD through either financial aid or (if you are technically savvy) your time. We rely heavily on these tools and need to work together to make them sustainable!

Overall the Email Preservation day was very inspiring and informative, and I cannot wait to hear more from the Email Preservation Task Force. Were you also at the event and have some other highlights to add? Please comment below!  

Policy ramblings

For the second stage of the DPOC project Oxford and Cambridge have started looking at policy and strategy development. As part of the DPOC deliverables, the Policy and Planning Fellows will be collaborating with colleagues to produce a digital preservation policy and strategy for their local institutions. Edith (Policy and Planning Fellow at Oxford) blogs about what DPOC has been up to so far.


Last Friday I met with Somaya (Policy and Planning Fellow) and Sarah (Training and Outreach Fellow) at the British Library in London. We spent the day discussing review work which DPOC has done of digital preservation policies so far. The meeting also gave us a chance to outline an action plan for consulting stakeholders at CUL and Bodleian Libraries on future digital preservation policy development.

Step 1: Policy review work
Much work has already gone into researching digital preservation policy development [see for example the SCAPE project and OSUL’s policy case study]. As considerable effort has been exerted in this area, we want to make sure we are not reinventing the wheel while developing our own digital preservation policies. We therefore started by reading as many digital preservation policies from other organisations as we could possibly get our hands on. (Once we ran out of policies in English, I started feeding promising looking documents into Google Translate with a mixed bag of results.) The policy review drew attention to aspects of policies which we felt were particular successful, and which could potentially be re-purposed for the local CUL and Bodleian Libraries contexts.

My colleague Sarah helped me with the initial policy review work. Between the two of us we read 48 policies dating from 2008-2017. However, determining which documents were actual policies was trickier than we had first anticipated. We found that documents named ‘strategy’ sometimes read as policy, and documents named policy sometimes read as more low level procedures. For this reason, we decided to add another 12 strategy documents to the review which had strong elements of policy in them. This brought us up to a round 60 documents in total.

So we began reading…. But we soon found that once you are on your 10th policy of the day, you start to get them muddled up. To better organise our review work, we decided to put them into a classification system developed by Kirsten Snawder (2011) and adapted by Madeline Sheldon (2013). Snawder and Sheldon identified nineteen common topics from digital preservation policies. The topics range from ‘access and use’ to ‘preservation planning’ [for the full list of topics, see Sheldon’s article on The Signal from 2013]. I was interested in seeing how many policies would make direct reference to the Open Archival Information System (OAIS) reference model, so I added this in as an additional topic to the original nineteen identified by Snawder and Sheldon.

Reviewing digital preservation policies written between 2008-2017

Step 2: Looking at findings
Interestingly, after we finished annotating the policy documents we did not find a correlation between covering all of Snawder and Sheldon’s nineteen topics and having what we perceived as an effective policy. Effective in this context was defined as the ability of the policy to clearly guide and inform preservation decisions within an organisation. In fact, the opposite was more common as we judged several policies which had good coverage of topics from the classification system to be too lengthy, unclear, and sometimes inaccessible due to heavy use of digital preservation terminology.

In terms of OAIS, another interesting finding was that 33 out of 60 policies made direct reference to the OAIS. In addition to these 33, several of the ones which did not make an overt reference to the model still used language and terminology derived from it.

So while we found that the taxonomy was not able to guide us on which policy topics were an absolute essential in all circumstances, using it was a good way of arranging and documenting our thoughts.

Step 3: Thinking about guiding principles for policy writing
What this foray into digital preservation policies has shown us is that there is no ‘one fits all’ approach or a magic formula of topics which makes a policy successful. What works in the context of one institution will not work in another. What ultimately makes a successful policy also comes down to communication of the policy and organisational uptake. However, there are number of high level principles which the three of us all felt strongly about and which we would like to guide future digital preservation policy development at our local institutions.

Principle 1: Policy should be accessible to a broad audience. Contrary to findings from the policy review, we believe that digital preservation specific language (including OAIS) should be avoided at policy level if possible. While reviewing policy statements we regularly asked ourselves:

“Would my mother understand this?”

If the answer is yes, the statement gets to stay. If it is no, maybe consider re-writing it. (Of course, this does not apply if your mother works in digital preservation.)

Principle 2: Policy also needs to be high-level enough that it does not require constant re-writing in order to make minor procedural changes. In general, including individuals’ names or prescribing specific file formats can make a policy go out of date quickly. It is easier to change these if they are included in lower level procedures and guidelines.

Principle 3: Digital preservation requires resources. Getting financial commitment to invest in staff at policy level is important. It takes time to build organisation expertise in digital preservation, but losing it can happen a lot quicker. Even if you choose to outsource several aspects of digital preservation, it is important that staff have skills which enables them to understand and critically assess the work of external digital preservation service providers.

What are your thoughts? Do you have other principles guiding digital preservation policy development in your organisations? Do you agree or disagree with our high-level principles?

Over 20 years of digitization at the Bodleian Libraries

Policy and Planning Fellow Edith writes an update on some of her findings from the DPOC project’s survey of digitized images at the Bodleian Libraries.


During August-December 2016 I have been collating information about Bodleian Libraries’ digitized collections. As an early adopter of digitization technology, Bodleian Libraries have made digital surrogates of its collections available online since the early 1990’s. A particular favourite of mine, and a landmark among the Bodleian Libraries’ early digital projects, is the Toyota Transport Digitization Project (1996). [Still up and running here]

At the time of the Toyota Project, digitization was still highly specialised and the Bodleian Libraries opted to outsource the digital part to Laser Bureau London. Laser Bureau ‘digitilised’ 35mm image negatives supplied by Bodleian Libraries’ imaging studio and sent the files over on a big bundle of CDs. 1244 images all in all – which was a massive achievement at the time. It is staggering to think that we could now produce the same many times over in just a day!

Since the Toyota projects completion twenty years ago, Bodleian Libraries have continued large scale digitization activities in-house via its commercial digitization studio, outsourced to third party suppliers, and in project partnerships. With generous funding from the Polonsky Foundation the Bodleian Libraries are now set to add over half a million image surrogates of Special Collection manuscripts to its image portal – Digital.Bodleian.

What happens to 20 years’ worth of digitized material? Since 1996 both Bodleian Libraries and digitization standards have changed massively. Early challenges around storage alone have meant that content inevitably has been squirreled away in odd locations and created to the varied standards of the time. Profiling our old digitized collections is the first step to figuring out how these can be brought into line with current practice and be made more visible to library users.

“So what is the extent of your content?”, librarians from other organisations have asked me several times over the past few months. In the hope that it will be useful for other organisations trying to profile their legacy digitized collections, I thought I would present some figures here on the DPOC blog.

When tallying up our survey data, I came to a total of approximately 134 million master images in primarily TIFF and JP2 format. From very early digitization projects however, the idea of ‘master files’ was not yet developed and master and access files will, in these cases, often be one and the same.

The largest proportion of content, some 127,000,000 compressed JP2s, were created as part of the Google Books project up to 2009 and are available via Search Oxford Libraries Online. These add up to 45 TB of data. The library further holds three archives of 5.8million/99.4TB digitized image content primarily created by the Bodleian Libraries’ in-house digitization studio in TIFF. These figures does not include back-ups – with which we start getting in to quite big numbers.

Of the remaining 7 million digitized images which are not from the Google Books project, 2,395,000 are currently made available on a Bodleian Libraries website. In total the survey examined content from 40 website applications and 24 exhibition pages. 44% of the images which are made available online were, at the time of the survey, hosted on Digital.Bodleian, 4% on ODL Greenstone and 1% on Luna.The latter two are currently in the processes of being moved onto Digital.Bodleian. At least 6% of  content from the sample was duplicated across multiple website applications and are candidates for deduplication. Another interesting fact from the survey is that JPEG, JP2 (transformed to JPEG on delivery) and GIF are by far the most common access/derivative formats on Bodleian Libraries’ website applications.

The final digitized image survey report has now been reviewed by the Digital Preservation Coalition and is being looked at internally. Stay tuned to hear more in future blog posts!