Validating half a million TIFF files. Part Two.

Back in May, I wrote a blog post about preparing the groundwork for the process of validating over 500,000 TIFF files which were created as part of a Polonsky Digitization Project which started in 2013. You can read Part One here on the blog.

Restoring the TIFF files from tape

Stack of backup tapes. Photo: Amazon

For the digitization workflow we used Goobi and within that process, the master TIFF files from the project were written to tape. In order to actually check these files, it was obvious we would need to restore all the content to spinning disk. I duly made a request to our system administration team and waited.

As I mentioned in Part One, we had setup a new virtualised server which had access to a chunk of network storage. The Polonsky TIFF files were restored to this network storage, however midway through the restoration from tape, the tape server’s operating system crashed…disaster.

After reviewing the failure, it appeared there was a bug within the RedHat operating system which had caused the problem. This issue proved to be a good lesson, a tape backup copy is only useful if you can actually restore it!

Question for you. When was the last time you tried to restore a large quantity of data from tape?

After some head scratching, patching and a review of the related systems, a second attempt at restoring all the TIFF content from tape commenced and this time all went well and the files were restored to the network storage. Hurrah!

JHOVE to validate those TIFFs

I decided that for the initial validation of the TIFF files, checking the files were well-formed and valid, JHOVE would provide a good baseline report.

As I mentioned in another blog post Customizable JHOVE TIFF output handler anyone? JHOVE’s XML output is rather unwieldy and so I planned to transform the XML using xsltproc (a command line xslt processor) with a custom XSLT stylesheet, allowing us to select any of attributes from the file which we might want to report on later, this would then produce a simple CSV output.

On a side note, work on adding a CSV output handler to JHOVE is in progress! This would mean the above process would be much simpler and quicker.

Parallel processing for the win.

What’s better than one JHOVE process validating TIFF content? Two! (well actually for us, sixteen at once works out quite nicely.)

It was clear from some initial testing with a 10,000 sample set of TIFF files that a single JHOVE process was going to take a long time to process 520,000+ images (around two and half days!)

So I started to look for a simple way to run many JHOVE processes in parallel. Using GNU Parallel seemed like a good way to go.

I created a command line BASH script which would take a list of directories to scan and then utilise GNU Parallel to fire off many JHOVE + XSLT processes to result in a CSV output, one line per TIFF file processed.

As our validation server was virtualised, it meant that I could scale the memory and CPU cores in this machine to do some performance testing. Below is a chart showing the number of images that the parallel processing system could handle per minute vs. the number of CPU cores enabled on the virtual server. (For all of the testing the memory in the server remained at 4 GB.)

So with 16 CPU cores, the estimate was that it would take around 6-7 hours to process all the Polonksy TIFF content, so a nice improvement on a single process.

At the start of this week, I ran a full production test, validating all 520,000+ TIFF files. 4 and half hours later the process was complete and 100 MB+ CSV file was generated with 520,000+ rows of data. Success!

For Part Three of this story I will write up how I plan to visualise the CSV data in Qlik Sense and the further analysis of those few files which failed the initial validation.

Visit to the Parliamentary Archives: Training and business cases

Edith Halvarsson, Policy and Planning Fellow at Bodleian Libraries, writes about the DPOC project’s recent visit to the Parliamentary Archives.


This week the DPOC fellows visited the Parliamentary Archives in London. Thank you very much to Catherine Hardman (Head of Preservation and Access), Chris Fryer (Digital Archivist) and Grace Bell (Digital Preservation Trainee) for having us. Shamefully I have to admit that we have been very slow to make this trip; Chris first invited us to visit all the way back in September last year! However, our tardiness to make our way to Westminster was in the end aptly timed with the completion of year one of the DPOC project and planning for year 2.

Like CUL and Bodleian Libraries, the Parliamentary Archives also first began their own Digital Preservation Project back in 2010. Their project has since transitioned into digital preservation in a more programmatic capacity as of 2015. As CUL and Bodleian Libraries will be beginning to draft business cases for moving from project to programme in year 2; meeting with Chris and Catherine was a good opportunity to talk about how you start making that tricky transition.

Of course, every institution has its own drivers and risks which influence business cases for digital preservation, but there are certain things which will sound familiar to a lot of organisations. For example, what Parliamentary Archives have found over the past seven years, is that advocacy for digital collections and training staff in digital preservation skills is an ongoing activity. Implementing solutions is one thing, whereas maintaining them is another. This, in addition to staff who have received digital preservation training eventually moving on to new institutions, means that you constantly need to stay on top of advocacy and training. Making “the business case” is therefore not a one-off task.

Another central challenge in terms of building business cases, is how you frame digital preservation as a service rather than as “an added burden”. The idea of “seamless preservation” with no human intervention is a very appealing one to already burdened staff, but in reality workflows need to be supervised and maintained. To sell digital preservation, that extra work must therefore be perceived as something which adds value to collection material and the organisation. It is clear that physical preservation adds value to collections, but the argument for digital preservation can be a harder sell.

Catherine had, however, some encouraging comments on how we can attempt to turn advice about digital preservation into something which is perceived as value adding.  Being involved with and talking to staff early on in the design of new project proposals – rather than as an extra add on after processes are already in place – is an example of this.

Image by James Mooney

All in all, it has been a valuable and encouraging visit to the Parliamentary Archives. The DPOC fellows look forward to keeping in touch – particularly to hear more about the great work Parliamentary Archive have been doing to provide digital preservation training to staff!

What is holding us back from change?

There are worse spots for a meeting. Oxford. Photo by: S. Mason

Every 3 months the DPOC teams gets together in person in either Oxford, Cambridge or London (there’s also been talk of taking a meeting at Bletchley Park sometime). As this is a collaborative effort, these meetings offer a rare opportunity to work face-to-face instead of via Skype with the endless issues around screen sharing and poor connections. Good ideas come when we get to sit down together.

As our next joint board meeting is next week, it was important to look over the work of the past year and make sure we are happy with the plan for year two. Most importantly, we wanted to discuss the messages we need to give our institutions as we look towards the sustainability of our digital preservation activities. How do we ensure that the earlier work and the work being done by us does not get repeated in 2-5 years time?

Silos in institutions

This is especially complicated when dealing with institutions like Oxford and Cambridge. We are big and old institutions with teams often working in silos. What does siloing have an effect on? Well, everything. Communication, effort, research—it all suffers. Work done previously is done again. Over and over.

The same problems are being tackled within different silos; this is duplicated and wasted effort if they are not communicating their work to each other. This means that digital preservation efforts can be fractured and imbalanced if institutional collaboration is ignored. We have an opportunity and responsibility in this project to get people together and to get them to talk openly about the digital preservation problems they are each trying to tackle.

Managers need to lead the culture change in the institution

While not always the case, it is important that managers do not just sit back and say “you will never get this to work” or “it has always been this way.” We need them on our side; they after often the gatekeepers of silos. We have to bring them together in order to start opening the silos.

It is within their power to be the agents of change; we have to empower them to believe in changing the habits of our institution. They have to believe that digital preservation is worth it if their team will also.

This might be the ‘carrot and stick’ approach or the ‘carrot’ only, but whatever approach is used, the are a number of points we agreed needed to be made clear:

  • our digital collections are significant and we have made assurances about their preservation and long term access
  • our institutional reputation plays a role in the preservation our digital assets
  • digital preservation is a moving target and we must be moving with it
  • digital preservation will not be “solved” through this project, but we can make a start; it is important that this is not then the end.

Roadmap to sustainable digital preservation

Backing up any messages is the need for a sustainable roadmap. If you want change to succeed and if you want digital preservation to be a core activity, then steps must be actionable and incremental. Find out where you are, where you want to go and then outline the timeline of steps it will take to get there. Consider using maturity models to set goals for your roadmap, such as Kenney and McGovern’s, Brown’s or the NDSA model. Each are slightly different and some might be more suitable for your institutions than others, so have a look at all of them.

It’s like climbing a mountain. I don’t look at the peak as I walk; it’s too far away and too unattainable. Instead, I look at my feet and the nearest landmark. Every landmark I pass is a milestone and I turn my attention to the next one. Sometimes I glance up at the peak, still in the distance—over time it starts to grow closer. And eventually, my landmark is the peak.

It’s only when I get to the top that I see all of the other mountains I also have to climb. And so I find my landmarks and continue on. I consider digital preservation a bit of the same thing.

What are your suggestions for breaking down the silos and getting fractured teams to work together? 

Operational Pragmatism in Digital Preservation: a discussion

From Somaya Langley, Policy and Planning Fellow at Cambridge: In September this year, six digital preservation specialists from around the world will be leading a panel and audience discussion. The panel is titled Operational Pragmatism in Digital Preservation: establishing context-aware minimum viable baselines. This will be held at the iPres International Digital Preservation Conference in Kyoto, Japan.


Panellists

Panellists include:

  • Dr. Anthea Seles – The National Archives, UK
  • Andrea K Byrne – Rensselaer Polytechnic Institute, USA
  • Dr. Dinesh Katre – Centre for Development of Advanced Computing (C-DAC), India
  • Dr. Jones Lukose Ongalo – International Criminal Court, The Netherlands
  • Bertrand Caron – Bibliothèque nationale de France
  • Somaya Langley – Cambridge University Library, UK

Panellists have been invited based on their knowledge of a breadth of digital creation, archiving and preservation contexts and practices including having worked in non-Western, non-institutional and underprivileged communities.

Operational Pragmatism

What does ‘operational pragmatism’ mean? For the past year or two I’ve been pondering ‘what corners can we cut’? For over a decade I have witnessed an increasing amount of work in the digital preservation space, yet I haven’t seen the increase in staffing and resources to handle this work. Meanwhile deadlines for transferring digital (and analogue audiovisual) content from carriers are just around the corner (e.g. Deadline 2025).

Panel Topic

Outside of the First World and national institutional/top-tier university context, individuals in the developing world struggle to access basic technology and resources to be able to undertake archiving and preservation of digital materials. Privileged First World institutions (who still struggle with deeply ingrained under-resourcing) are considering Trusted Digital Repository certification, while in the developing world meeting these standards is just not feasible. (Evidenced by work that has taken place in the POWRR project and Anthea Seles’ PhD thesis and more.)

How do we best prioritise our efforts so we can plan effectively (with the current resources we have)? How do we strategically develop these resources in methodical ways while ensuring the critical digital preservation work gets done before it is simply too late?

Approach

This panel discussion will take the form of a series of provocations addressing topics including: fixity, infrastructure and storage, preconditioning, pre-ingest processes, preservation metadata, scalability (including bi-directional scalability), technical policies, tool error reporting and workflows.

Each panellist will present their view on a different topic. Audience involvement in the discussion will be strongly encouraged.

Outcomes

The intended outcome is a series of agreed-upon ‘baselines’ tailored to different cultural, organisational and contextual situations, with the hope that these can be used for digital preservation planning and strategy development.

Further Information

The Panel Abstract is included below.

iPres Digital Preservation Conference program information can be found at: https://ipres2017.jp/program/.

We do hope you’ll be able to join us.


Panel Abstract

Undertaking active digital preservation, holistically and thoroughly, requires substantial infrastructure and resources. National archives and libraries across the Western world have established, or are working towards maturity in digital preservation (often underpinned by legislative requirements). On the other hand, smaller collectives and companies situated outside of memory institution contexts, as well as organisations in non-Western and developing countries, are struggling with the basics of managing their digital materials. This panel continues the debate within the digital preservation community, critiquing the development of digital preservation practices typically from within positions of privilege. Bringing together individuals from diverse backgrounds, the aim is to establish a variety of ‘bare minimum’ baselines for digital preservation efforts, while tailoring these to local contexts.

Six Priority Digital Preservation Demands

Somaya Langley, Cambridge Policy and Planning Fellow, talks about her top 6 demands for a digital preservation system.


Photo: Blazej Mikula, Cambridge University Library

As a former user of one digital preservation system (Ex Libris’ Rosetta), I have spent a few years frustrated by the gap between what activities need to be done as part of a digital stewardship end-to-end workflow – including packaging and ingesting ‘information objects’ (files and associated metadata) – and the maturity level of digital preservation systems.

Digital Preservation Systems Review

At Cambridge, we are looking at different digital preservation systems and what each one can offer. This has involved talking to both vendors and users of systems.

When I’m asked about what my top digital preservation system current or future requirements are, it’s excruciatingly hard to limit myself to a handful of things. However, having previously been involved in a digital preservation system implementation project, there are some high-level takeaways from past experiences that remain with me.

Shortlist

Here’s the current list of my six top ‘digital preservation demands’ (aka user requirements):

Integration (with various other systems)

A digital preservation ‘system’ is only one cog in a wheel within a much larger machine; one piece of a much larger puzzle. There is an entire ‘digital ecosystem’ that this ‘system’ should exist within, and end-to-end digital stewardship workflows are of primary importance. The right amount of metadata and/or files should flow should flow from one system to another. We must also know where the ‘source of truth’ is for each bit.

Standards-based

This seems like a no-brainer. We work in Library Land. Libraries rely on standards. We also work with computers and other technologies that also require standard ways (protocols etc.) of communicating.

For files and metadata to flow from one system to another – whether via import, ingest, export, migration or an exit strategy from a system – we already spend a bunch of time creating mappings and crosswalks from one standard (or implementation of a standard) to another. If we don’t use (or fully implement) existing standards, this means we risk mangling data, context or meaning; potentially losing or not capturing parts of the data; or just wasting a whole lot of time.

Error Handling (automated, prioritised)

There’s more work to be done in managing digital materials than there are people to do it. Content creation is increasing at exponential rates, meanwhile the number of staff (with the right skills) just aren’t. We have to be smart about how we work. This requires prioritisation.

We need to have smarter systems that help us. This includes helping to prioritise where we focus our effort. Digital preservation systems are increasingly incorporating new third-party tools. We need to know which tool reports each error and whether these errors are show-stoppers or not. (For example: is the content no longer renderable versus a small piece of non-critical descriptive metadata that is missing?) We have to accept that, for some errors, we will never get around to addressing them.

Reporting

We need to be able to report to different audiences. The different types of reporting classes include (but are not limited to):

  1. High-level reporting – annual reports, monthly reports, reports to managers, projections, costings etc.)
  2. Collection and preservation management reporting – reporting on successes and failures, overall system stats, rolling checksum verification etc.
  3. Reporting for preservation planning purposes – based on preservation plans, we need to be able to identify subsections of our collection (configured around content types, context, file format and/or whatever other parameters we choose to use) and report on potential candidates that require some kind of preservation action.

Provenance

We need to best support – via metadata – where a file has come from. This, for want of a better approach, is currently being handled by the digital preservation community through documenting changes as Provenance Notes. Digital materials acquired into our collections are not just the files, they’re also the metadata. (Hence, why I refer to them as ‘information objects’.) When an ‘information object’ has been bundled, and is ready to be ingested into a system, I think of it as becoming an ‘information package’.

There’s a lot of metadata (administrative, preservation, structural, technical) that appears along the path from an object’s creation until the point at which it becomes an ‘information package’. We need to ensure we’re capturing and retaining the important components of this metadata. Those components we deem essential must travel alongside their associated files into a preservation system. (Not all files will have any or even the right metadata embedded within the file itself.) Standardised ways of handling information held in Provenance Notes (whether these are from ‘outside of the system’ or created by the digital preservation system) and event information so it can be interrogated and reported on is crucial.

Managing Access Rights

Facilitating access is not black and white. Collections are not simply ‘open’ or ‘closed’. We have a myriad of ways that digital material is created and collected; we need to ensure we can provide access to this content in a variety of ways that support both the content and our users. This can include access from within an institution’s building, via a dedicated computer terminal, online access to anyone in the world, mediated remote access, access to only subsets of a collection, support for embargo periods, ensuring we respect cultural sensitivities or provide access to only the metadata (perhaps as large datasets) and more.

We must set a goal of working towards providing access to our users in the many different (and new-ish) ways they actually want to use our content.

It’s imperative to keep in mind the whole purpose of preserving digital materials is to be able to access them (in many varied ways). Provision of content ‘viewers’ and facilitating other modes of access (e.g. to large datasets of metadata) are essential.

Final note: I never said addressing these concerns was going to be easy. We need to factor each in and make iterative improvements, one step at a time.

Email preservation: How hard can it be?

Policy and Planning Fellow Edith summarises some highlights from the Digital Preservation Coalition’s briefing day on email preservation. See the full schedule of speakers on DPC’s website.


Yesterday Sarah and I attended DPC’s briefing day on email preservation at the National Archives (UK) in Kew, London. We were keen to go and hear about latest findings from the Email Preservation Task Force as Sarah will be developing a course dedicated to email preservation for the DPOC teaching programme. An internal survey circulated to staff in Bodleian Libraries’ earlier this year showed a real appetite for learning about email preservation. It is an issue which evidently spans several areas of our organisation.

The subheading of the event “How hard can it be?” turned out to be very apt. Before even addressing preservation, we were asked to take a step back and ask ourselves:

Do I actually know what email is?”

As Kate Murray from the Library of Congress put it: “email is an object, several things and a verb”. In this sense email has much in common with the World Wide Web, as they are heavily linked and complex objects. Retention decisions must be made, not only about text content but also about email attachments and external web links. In addition, supporting features (such as instant messaging and calendars) are increasingly integrated into email services and potential candidates for capture.

Thinking about email “as a verb” also highlights that it is a cultural and social practice. Capturing relationships and structures of communication is an additional layer to preserve. Anecdotally, some participants on the Email Preservation day had found that data mining, including the ability to undertake analysis across email archives, is increasingly in demand from historians using big data research techniques.

Anthea Seles, National Archives (UK), talks about visualisation of email archives.

What are people doing?

So what are organisations currently doing to preserve email? A strength of the Email Preservation Taskforce’s new draft report is that it draws together samples of workflows currently in use by other organisations (primarily US based). Additional speakers from Preservica, National Archives and the British Library supplemented these with some local examples from the UK throughout the day.

The talks and report show that migration is by far the most common approach to email preservation in the institutions consulted. EML and Mbox are the most common formats migrated to. Each have different approaches to storing either single messages (EML) or aggregating messages in a single database file (Mbox). (However, beware that Mbox is a whole family of formats which have varying documentation levels!)

While some archives choose to ingest Mbox and EML files into their repositories without further processing, others choose to unpack content within these files. Unpacking content provides a mode of displaying emails, as well as the ability to normalise content within them.

The British Library for example have chosen to unpack email files using Aid4Mail, and are attempting to replicate the message hierarchy within a folder structure. Using Aid4Mail, they migrate text from email messages to PDF/A-2b which are displayed alongside folders containing any email attachments. PDF/A-2b can then be validated using vera/PDF or other tools. A CSV manifest is also generated and entered into relevant catalogues. Preservica’s out of the box workflow is very similar to the British Library’s, although they choose to migrate text content to HTML or UTF-8 encoded text files.

Another tantalising example (which I can imagine will gain more traction in the future) came from one institution who has used Emulation As A Service to provide access to one of its collections of email. By using an emulation approach it is able to provide access to content within the original operating environment used by the donor of the email archive. This has particular strength in that email attachments, such as images and word processing files, can be viewed on contemporary software (providing licenses can be acquired for the software itself).

Finally, a tool which was considered or already in use by many of the contributors is ePADD. ePADD is an open source tool developed by Stanford University Libraries. It provides functions for processing and appraisal of Mbox files, but also has many interesting features for exploring the social and cultural aspect of email. ePADD can mine emails for subjects such as places, events and people. En masse, these subjects provide researchers with a much richer insight into trends and topics within large email archives. (Tip: why not have a look at the ePADD discovery module to see it in practice?)

What do we still need to explore?

It is encouraging that people are already undertaking preservation of email and that there are workflows out there which other organisations can adopt. However, there are many questions and issues still to explore.

  1. Current processes cannot fully capture the interlinked nature of email archives. Questions were raised during the day about the potential of describing archives using linked open data in order to amalgamate separate collections. Email archives may be more valuable to historians as they acquire critical mass
  2. Other questions were raised around whether or not archives should also crawl web links within emails. Links to external content may be crucial for understanding the context of a message, but this becomes a very tricky issue if emails are accessioned years after creation. If webpages are crawled and associated with the email message years after it was sent, serious doubt is raised around the reliability of the email as a record
  3. The issue of web links also brings up the question of when email harvesting should occur. Would it be better if emails were continually harvested to the archive/record management system than waiting until a member of staff leave their position? The good news is that many email providers are increasingly documenting and providing APIs to their services, meaning that the ability to do so may become more feasible in the future
  4. As seen in many of the sample workflows from the Email Preservation Task Force report, email files are often migrated multiple times. Especially as ePADD works with Mbox, some organisations end up adding an additional migration step in order to use the tool before normalising to EML. There is currently very little available literature on the impact of migrations, and indeed multiple migrations, on the information content of emails.

What can you do now to help?    

So while there are some big technical and philosophical challenges, the good news is that there are things you can do to contribute right now. You can:

  • Become a “Friend of the Email Preservation Task Force” and help them review new reports and outputs
  • Contribute your organisation’s workflows to the Email Preservation Task Force report, so that they can be shared with the community
  • Run trial migrations between different email formats such as PST, Mbox and EML and blog about your finding
  • Support open source tools such as ePADD through either financial aid or (if you are technically savvy) your time. We rely heavily on these tools and need to work together to make them sustainable!

Overall the Email Preservation day was very inspiring and informative, and I cannot wait to hear more from the Email Preservation Task Force. Were you also at the event and have some other highlights to add? Please comment below!  

C4RR – Containers for Reproducible Research Conference

James shares his thoughts after attending the C4RR Containers for Reproducible Research Conference at the University of Cambridge (27 – 28 June).


At the end of June both Dave and I, the Technical Fellows, attended the C4RR conference/workshop hosted by The Software Sustainability Institute in Cambridge. This event brought together researchers, developers and educators to explore best practices when using containers and the future of research software with containers.

Containers, specially Docker and Singularity are the ‘in’ thing at the moment and it was interesting to hear from a variety of research projects who are using them for reproducible research.

Containers are another form of server virtualisation but are lighter than a virtual machine. Containers and virtual machines have similar resource isolation and allocation benefits, but function differently because containers virtualize the operating system instead of hardware; containers are more portable and efficient.

 
Comparison of VM vs Container (Images from docker.com)

Researchers described how they were using Docker, one of the container implementations, to package the software used in their research so they could easily reproduce their computational environment across several different platforms (desktop, server and cluster). Others were using Singularity, another container technology, when implementing containers on a HPC (High-Performance Computing) Cluster due to restrictions of the Docker requirements for root access. It was clear from the talks, there is a rapid development of these technologies and ever increasing complexity of the computing environments involved, which does make me worry how these might be preserved.

Near the end of the second day, Dave and I gave a 20 minute presentation to encourage the audience to think more about preservation. As the audience were all evangelists for container technology it made sense to try to tap into them to promote building preservation into their projects.


Image By Raniere Silva

One aim was to get people to think about their research after the project was over. There is often a lack of motivation to think about how others might reproduce the work, whether that’s six months into the future let alone 15+ years from now.

Another area we briefly covered was relating to depositing research data. We use DROID to scan our repositories to identify file formats which relies on the technical registry of PRONOM. We put out a plea to the audience to ask for help with creating new file signatures for unknown file formats.

I had some great conversations with others over the two days and my main takeaway from the event was that we should look to attend more non-preservation specific conferences with a view to promote preservation in other computer-related areas of study.

Our slides from the event have been posted by The Software Sustainability Institute via Google.

Transcribing interviews

The second instalment of Lee’s experience running a skills audit at Cambridge University Library. He explains what is needed to be able to transcribe the lengthy and informative interviews with staff.


There’s no ground-breaking digital preservation goodness contained within this post so you have permission to leave this page now. However, this groundwork is crucial to gaining an understanding of how institutions can prepare for digital preservation skills and knowledge development. It may also be useful to anyone who is preparing to transcribe recorded interviews.

Post-interview: transcribing the recording

Once you have interviewed your candidates and made sure that you have all the recordings (suitably backed up three times into private, network free storage like an encrypted USB stick so as to respect privacy wishes), it is time to transcribe.

So, what do you need?

  • A very quiet room. Preferably silence, where there are no distractions and where you can’t distract people. You may wish to choose the dictation path and if you do that in an open plan office, you may attract attention. You will also be reciting information that you have assured will remain confidential.
  • Audio equipment. You will need a device that can play your audio files and has an audio control player built into it. You can use your device’s speakers, headphones, preferably with a control device built into the wire, or foot pedal.
  • Time. Bucket loads of it. If you are doing other work, this needs to become the big rock in your time planning, everything else should be mere pebbles and sand. This is where manager support is really helpful, as is…
  • Understanding. The understanding that this will rule your working life for the next month or two and the understanding of those around the size of the task of what you are doing. To have an advocate who has experience of this type of work before is invaluable.
  • Patience. Of a saint.
  • Simple transcription rules. Given the timeframes of the project, complex transcription would have been too time consuming. Please see the following work below, as used by the University of California, San Diego, it’s really useful with nice big text.
    Dresing, Thorsten/Pehl, Thorsten/Schmieder, Christian (2015): Manual (on) Transcription. Transcription Conventions, Software Guides and Practical Hints for Qualitative Researchers. 3rd English Edition. Marburg Available Online: http://www.audiotranskription.de/english/transcription-practicalguide.htm
    (Last accessed: 27.06.2017). ISBN: 978-3-8185-0497-7.

Cropped view of person hands typing on laptop computer. Image credit: Designed by Freepik

What did you do?

Using a Mac environment, I imported the audio files for transcription into a desktop folder and created a play list in iTunes. I reduced the iTunes application to the mini player view and opened up Word to type into. I plugged in my headphones and pressed play and typed as I was listening.

If you get tired typing, the Word application on my Mac has a nifty voice recognition package. It’s uncannily good now. Whilst I tried to route the output sound into the mic by using Soundflower (I wasted time doing this as when the transcription did yield readable text, it used words worthy of inciting a Mary Whitehouse campaign) I did find that dictation provided a rest for weary fingers. After a while, you will probably need to rest a weary voice, so you can switch back to typing.

When subjects starting talking quickly, I needed a way to slow them down as constantly pressing pause and rewind got onerous. A quick fix for this was to download Audacity. This has the function to slow down your sound files. Once the comedic effect of voice alteration has worn off, it becomes easier to transcribe as you don’t have to pause and rewind as much.

Process wise, it doesn’t sound much and it isn’t. It’s just the sheer hours of audio that needs to be made legible through listening, rewinding an typing.

How can the process be made (slightly) easier?

  • Investigate transcription technology and processes. Investigate technologies available beforehand that you can access. I wish I had done this rather than rely on the expectation that I would be just listening and typing. I didn’t find a website with the answer but a thoughtful web search can help you with certain parts of the transcription method.
  • Talk slowly. This one doesn’t apply to the transcription process but the interview process. Try and ask the questions a little bit slower than you usually would as the respondent will subconsciously mimic your speed of delivery and slow themselves down

Hang on in there, it’s worth it

Even if you choose to incorporate the suggestions above, be under absolutely no illusions: transcription is a gruelling task. That’s not a slight against the participants’ responses for they will be genuinely interesting and insightful. No, it’s a comment on the frustration of the process and sheer mental grind of getting through it. I must admit I had only come to a reasonably happy transcription method by the time I had reached number fourteen (of fifteen). However, the effort is completely worth it. In the end, I now have around 65,000 quality words (research data) to analyse to understand what existing digital skills, knowledge, ways of learning and managing change exist within my institution that can be fed into the development of digital preservation skills and knowledge.

DPASSH: Getting close to producers, consumers and digital preservation

Sarah shares her thoughts after attending the DPASSH (Digital Preservation in the Arts, Social Sciences and Humanities) Conference at the University of Sussex (14 – 15 June).


DPASSH is a conference that the Digital Repository Ireland (DRI) puts on with a host organisation. This year, it was hosted by the Sussex Humanities Lab at the University of Sussex, Brighton. What is exciting about this digital preservation conference is that it brings together creators (producers) and users (consumers) with digital preservation experts. Most digital preservation conferences end up being a bit of an echo chamber, full of practitioners and vendors only. But what about the creators and the users? What knowledge can we share? What can we learn?

DPASSH is a small conference, but it was an opportunity to see what researchers are creating and how they are engaging with digital collections. For example in Stefania Forlini’s talk, she discussed the perils of a content-centric digitisation process where unique print artefacts are all treated the same; the process flattens everything into identical objects though they are very different. What about the materials and the physicality of the object? It has stories to tell as well.

To Forlini, books span several domains of sensory experience and our digitised collections should reflect that. With the Gibson Project, Forlini and project researchers are trying to find ways to bring some of those experiences back through the Speculative W@nderverse. They are currently experimenting with embossing different kinds of paper with a code that can be read by a computer. The computer can then bring up the science fiction pamphlets that are made of that specific material. Then a user can feel the physicality of the digitised item and then explore the text, themes and relationships to other items in the collection using generous interfaces. This combines a physical sensory experience with a digital experience.

For creators, the decision of what research to capture and preserve is sometimes difficult; often they lack the tools to capture the information. Other times, creators do not have the skills to perform proper archival selection. Athanasios Velios offered a tool solution for digital artists called Artivity. Artivity can capture the actions performed on a digital artwork in certain programs, like Photoshop or Illustrator. This allows the artist to record their creative process and gives future researchers the opportunity to study the creative process. Steph Taylor from CoSector suggested in her talk that creators are archivists now, because they are constantly appraising their digital collections and making selection decisions.  It is important that archivists and digital preservation practitioners empower creators to make good decisions around what should be kept for the long-term.

As a bonus to the conference, I was awarded with the ‘Best Tweet’ award by the DPC and DPASSH. It was a nice way to round out two good, informative days. I plan to purchase many books with my gift voucher!

I certainly hope they hold the conference next year, as I think it is important for researchers in the humanities, arts and social sciences to engage with digital preservation experts, archivists and librarians. There is a lot to learn from each other. How often do we get our creators and users in one room with us digital preservation nerds?

Policy ramblings

For the second stage of the DPOC project Oxford and Cambridge have started looking at policy and strategy development. As part of the DPOC deliverables, the Policy and Planning Fellows will be collaborating with colleagues to produce a digital preservation policy and strategy for their local institutions. Edith (Policy and Planning Fellow at Oxford) blogs about what DPOC has been up to so far.


Last Friday I met with Somaya (Policy and Planning Fellow) and Sarah (Training and Outreach Fellow) at the British Library in London. We spent the day discussing review work which DPOC has done of digital preservation policies so far. The meeting also gave us a chance to outline an action plan for consulting stakeholders at CUL and Bodleian Libraries on future digital preservation policy development.

Step 1: Policy review work
Much work has already gone into researching digital preservation policy development [see for example the SCAPE project and OSUL’s policy case study]. As considerable effort has been exerted in this area, we want to make sure we are not reinventing the wheel while developing our own digital preservation policies. We therefore started by reading as many digital preservation policies from other organisations as we could possibly get our hands on. (Once we ran out of policies in English, I started feeding promising looking documents into Google Translate with a mixed bag of results.) The policy review drew attention to aspects of policies which we felt were particular successful, and which could potentially be re-purposed for the local CUL and Bodleian Libraries contexts.

My colleague Sarah helped me with the initial policy review work. Between the two of us we read 48 policies dating from 2008-2017. However, determining which documents were actual policies was trickier than we had first anticipated. We found that documents named ‘strategy’ sometimes read as policy, and documents named policy sometimes read as more low level procedures. For this reason, we decided to add another 12 strategy documents to the review which had strong elements of policy in them. This brought us up to a round 60 documents in total.

So we began reading…. But we soon found that once you are on your 10th policy of the day, you start to get them muddled up. To better organise our review work, we decided to put them into a classification system developed by Kirsten Snawder (2011) and adapted by Madeline Sheldon (2013). Snawder and Sheldon identified nineteen common topics from digital preservation policies. The topics range from ‘access and use’ to ‘preservation planning’ [for the full list of topics, see Sheldon’s article on The Signal from 2013]. I was interested in seeing how many policies would make direct reference to the Open Archival Information System (OAIS) reference model, so I added this in as an additional topic to the original nineteen identified by Snawder and Sheldon.

Reviewing digital preservation policies written between 2008-2017

Step 2: Looking at findings
Interestingly, after we finished annotating the policy documents we did not find a correlation between covering all of Snawder and Sheldon’s nineteen topics and having what we perceived as an effective policy. Effective in this context was defined as the ability of the policy to clearly guide and inform preservation decisions within an organisation. In fact, the opposite was more common as we judged several policies which had good coverage of topics from the classification system to be too lengthy, unclear, and sometimes inaccessible due to heavy use of digital preservation terminology.

In terms of OAIS, another interesting finding was that 33 out of 60 policies made direct reference to the OAIS. In addition to these 33, several of the ones which did not make an overt reference to the model still used language and terminology derived from it.

So while we found that the taxonomy was not able to guide us on which policy topics were an absolute essential in all circumstances, using it was a good way of arranging and documenting our thoughts.

Step 3: Thinking about guiding principles for policy writing
What this foray into digital preservation policies has shown us is that there is no ‘one fits all’ approach or a magic formula of topics which makes a policy successful. What works in the context of one institution will not work in another. What ultimately makes a successful policy also comes down to communication of the policy and organisational uptake. However, there are number of high level principles which the three of us all felt strongly about and which we would like to guide future digital preservation policy development at our local institutions.

Principle 1: Policy should be accessible to a broad audience. Contrary to findings from the policy review, we believe that digital preservation specific language (including OAIS) should be avoided at policy level if possible. While reviewing policy statements we regularly asked ourselves:

“Would my mother understand this?”

If the answer is yes, the statement gets to stay. If it is no, maybe consider re-writing it. (Of course, this does not apply if your mother works in digital preservation.)

Principle 2: Policy also needs to be high-level enough that it does not require constant re-writing in order to make minor procedural changes. In general, including individuals’ names or prescribing specific file formats can make a policy go out of date quickly. It is easier to change these if they are included in lower level procedures and guidelines.

Principle 3: Digital preservation requires resources. Getting financial commitment to invest in staff at policy level is important. It takes time to build organisation expertise in digital preservation, but losing it can happen a lot quicker. Even if you choose to outsource several aspects of digital preservation, it is important that staff have skills which enables them to understand and critically assess the work of external digital preservation service providers.

What are your thoughts? Do you have other principles guiding digital preservation policy development in your organisations? Do you agree or disagree with our high-level principles?