What is holding us back from change?

There are worse spots for a meeting. Oxford. Photo by: S. Mason

Every 3 months the DPOC teams gets together in person in either Oxford, Cambridge or London (there’s also been talk of taking a meeting at Bletchley Park sometime). As this is a collaborative effort, these meetings offer a rare opportunity to work face-to-face instead of via Skype with the endless issues around screen sharing and poor connections. Good ideas come when we get to sit down together.

As our next joint board meeting is next week, it was important to look over the work of the past year and make sure we are happy with the plan for year two. Most importantly, we wanted to discuss the messages we need to give our institutions as we look towards the sustainability of our digital preservation activities. How do we ensure that the earlier work and the work being done by us does not get repeated in 2-5 years time?

Silos in institutions

This is especially complicated when dealing with institutions like Oxford and Cambridge. We are big and old institutions with teams often working in silos. What does siloing have an effect on? Well, everything. Communication, effort, research—it all suffers. Work done previously is done again. Over and over.

The same problems are being tackled within different silos; this is duplicated and wasted effort if they are not communicating their work to each other. This means that digital preservation efforts can be fractured and imbalanced if institutional collaboration is ignored. We have an opportunity and responsibility in this project to get people together and to get them to talk openly about the digital preservation problems they are each trying to tackle.

Managers need to lead the culture change in the institution

While not always the case, it is important that managers do not just sit back and say “you will never get this to work” or “it has always been this way.” We need them on our side; they after often the gatekeepers of silos. We have to bring them together in order to start opening the silos.

It is within their power to be the agents of change; we have to empower them to believe in changing the habits of our institution. They have to believe that digital preservation is worth it if their team will also.

This might be the ‘carrot and stick’ approach or the ‘carrot’ only, but whatever approach is used, the are a number of points we agreed needed to be made clear:

  • our digital collections are significant and we have made assurances about their preservation and long term access
  • our institutional reputation plays a role in the preservation our digital assets
  • digital preservation is a moving target and we must be moving with it
  • digital preservation will not be “solved” through this project, but we can make a start; it is important that this is not then the end.

Roadmap to sustainable digital preservation

Backing up any messages is the need for a sustainable roadmap. If you want change to succeed and if you want digital preservation to be a core activity, then steps must be actionable and incremental. Find out where you are, where you want to go and then outline the timeline of steps it will take to get there. Consider using maturity models to set goals for your roadmap, such as Kenney and McGovern’s, Brown’s or the NDSA model. Each are slightly different and some might be more suitable for your institutions than others, so have a look at all of them.

It’s like climbing a mountain. I don’t look at the peak as I walk; it’s too far away and too unattainable. Instead, I look at my feet and the nearest landmark. Every landmark I pass is a milestone and I turn my attention to the next one. Sometimes I glance up at the peak, still in the distance—over time it starts to grow closer. And eventually, my landmark is the peak.

It’s only when I get to the top that I see all of the other mountains I also have to climb. And so I find my landmarks and continue on. I consider digital preservation a bit of the same thing.

What are your suggestions for breaking down the silos and getting fractured teams to work together? 

Operational Pragmatism in Digital Preservation: a discussion

From Somaya Langley, Policy and Planning Fellow at Cambridge: In September this year, six digital preservation specialists from around the world will be leading a panel and audience discussion. The panel is titled Operational Pragmatism in Digital Preservation: establishing context-aware minimum viable baselines. This will be held at the iPres International Digital Preservation Conference in Kyoto, Japan.


Panellists

Panellists include:

  • Dr. Anthea Seles – The National Archives, UK
  • Andrea K Byrne – Rensselaer Polytechnic Institute, USA
  • Dr. Dinesh Katre – Centre for Development of Advanced Computing (C-DAC), India
  • Dr. Jones Lukose Ongalo – International Criminal Court, The Netherlands
  • Bertrand Caron – Bibliothèque nationale de France
  • Somaya Langley – Cambridge University Library, UK

Panellists have been invited based on their knowledge of a breadth of digital creation, archiving and preservation contexts and practices including having worked in non-Western, non-institutional and underprivileged communities.

Operational Pragmatism

What does ‘operational pragmatism’ mean? For the past year or two I’ve been pondering ‘what corners can we cut’? For over a decade I have witnessed an increasing amount of work in the digital preservation space, yet I haven’t seen the increase in staffing and resources to handle this work. Meanwhile deadlines for transferring digital (and analogue audiovisual) content from carriers are just around the corner (e.g. Deadline 2025).

Panel Topic

Outside of the First World and national institutional/top-tier university context, individuals in the developing world struggle to access basic technology and resources to be able to undertake archiving and preservation of digital materials. Privileged First World institutions (who still struggle with deeply ingrained under-resourcing) are considering Trusted Digital Repository certification, while in the developing world meeting these standards is just not feasible. (Evidenced by work that has taken place in the POWRR project and Anthea Seles’ PhD thesis and more.)

How do we best prioritise our efforts so we can plan effectively (with the current resources we have)? How do we strategically develop these resources in methodical ways while ensuring the critical digital preservation work gets done before it is simply too late?

Approach

This panel discussion will take the form of a series of provocations addressing topics including: fixity, infrastructure and storage, preconditioning, pre-ingest processes, preservation metadata, scalability (including bi-directional scalability), technical policies, tool error reporting and workflows.

Each panellist will present their view on a different topic. Audience involvement in the discussion will be strongly encouraged.

Outcomes

The intended outcome is a series of agreed-upon ‘baselines’ tailored to different cultural, organisational and contextual situations, with the hope that these can be used for digital preservation planning and strategy development.

Further Information

The Panel Abstract is included below.

iPres Digital Preservation Conference program information can be found at: https://ipres2017.jp/program/.

We do hope you’ll be able to join us.


Panel Abstract

Undertaking active digital preservation, holistically and thoroughly, requires substantial infrastructure and resources. National archives and libraries across the Western world have established, or are working towards maturity in digital preservation (often underpinned by legislative requirements). On the other hand, smaller collectives and companies situated outside of memory institution contexts, as well as organisations in non-Western and developing countries, are struggling with the basics of managing their digital materials. This panel continues the debate within the digital preservation community, critiquing the development of digital preservation practices typically from within positions of privilege. Bringing together individuals from diverse backgrounds, the aim is to establish a variety of ‘bare minimum’ baselines for digital preservation efforts, while tailoring these to local contexts.

Six Priority Digital Preservation Demands

Somaya Langley, Cambridge Policy and Planning Fellow, talks about her top 6 demands for a digital preservation system.


Photo: Blazej Mikula, Cambridge University Library

As a former user of one digital preservation system (Ex Libris’ Rosetta), I have spent a few years frustrated by the gap between what activities need to be done as part of a digital stewardship end-to-end workflow – including packaging and ingesting ‘information objects’ (files and associated metadata) – and the maturity level of digital preservation systems.

Digital Preservation Systems Review

At Cambridge, we are looking at different digital preservation systems and what each one can offer. This has involved talking to both vendors and users of systems.

When I’m asked about what my top digital preservation system current or future requirements are, it’s excruciatingly hard to limit myself to a handful of things. However, having previously been involved in a digital preservation system implementation project, there are some high-level takeaways from past experiences that remain with me.

Shortlist

Here’s the current list of my six top ‘digital preservation demands’ (aka user requirements):

Integration (with various other systems)

A digital preservation ‘system’ is only one cog in a wheel within a much larger machine; one piece of a much larger puzzle. There is an entire ‘digital ecosystem’ that this ‘system’ should exist within, and end-to-end digital stewardship workflows are of primary importance. The right amount of metadata and/or files should flow should flow from one system to another. We must also know where the ‘source of truth’ is for each bit.

Standards-based

This seems like a no-brainer. We work in Library Land. Libraries rely on standards. We also work with computers and other technologies that also require standard ways (protocols etc.) of communicating.

For files and metadata to flow from one system to another – whether via import, ingest, export, migration or an exit strategy from a system – we already spend a bunch of time creating mappings and crosswalks from one standard (or implementation of a standard) to another. If we don’t use (or fully implement) existing standards, this means we risk mangling data, context or meaning; potentially losing or not capturing parts of the data; or just wasting a whole lot of time.

Error Handling (automated, prioritised)

There’s more work to be done in managing digital materials than there are people to do it. Content creation is increasing at exponential rates, meanwhile the number of staff (with the right skills) just aren’t. We have to be smart about how we work. This requires prioritisation.

We need to have smarter systems that help us. This includes helping to prioritise where we focus our effort. Digital preservation systems are increasingly incorporating new third-party tools. We need to know which tool reports each error and whether these errors are show-stoppers or not. (For example: is the content no longer renderable versus a small piece of non-critical descriptive metadata that is missing?) We have to accept that, for some errors, we will never get around to addressing them.

Reporting

We need to be able to report to different audiences. The different types of reporting classes include (but are not limited to):

  1. High-level reporting – annual reports, monthly reports, reports to managers, projections, costings etc.)
  2. Collection and preservation management reporting – reporting on successes and failures, overall system stats, rolling checksum verification etc.
  3. Reporting for preservation planning purposes – based on preservation plans, we need to be able to identify subsections of our collection (configured around content types, context, file format and/or whatever other parameters we choose to use) and report on potential candidates that require some kind of preservation action.

Provenance

We need to best support – via metadata – where a file has come from. This, for want of a better approach, is currently being handled by the digital preservation community through documenting changes as Provenance Notes. Digital materials acquired into our collections are not just the files, they’re also the metadata. (Hence, why I refer to them as ‘information objects’.) When an ‘information object’ has been bundled, and is ready to be ingested into a system, I think of it as becoming an ‘information package’.

There’s a lot of metadata (administrative, preservation, structural, technical) that appears along the path from an object’s creation until the point at which it becomes an ‘information package’. We need to ensure we’re capturing and retaining the important components of this metadata. Those components we deem essential must travel alongside their associated files into a preservation system. (Not all files will have any or even the right metadata embedded within the file itself.) Standardised ways of handling information held in Provenance Notes (whether these are from ‘outside of the system’ or created by the digital preservation system) and event information so it can be interrogated and reported on is crucial.

Managing Access Rights

Facilitating access is not black and white. Collections are not simply ‘open’ or ‘closed’. We have a myriad of ways that digital material is created and collected; we need to ensure we can provide access to this content in a variety of ways that support both the content and our users. This can include access from within an institution’s building, via a dedicated computer terminal, online access to anyone in the world, mediated remote access, access to only subsets of a collection, support for embargo periods, ensuring we respect cultural sensitivities or provide access to only the metadata (perhaps as large datasets) and more.

We must set a goal of working towards providing access to our users in the many different (and new-ish) ways they actually want to use our content.

It’s imperative to keep in mind the whole purpose of preserving digital materials is to be able to access them (in many varied ways). Provision of content ‘viewers’ and facilitating other modes of access (e.g. to large datasets of metadata) are essential.

Final note: I never said addressing these concerns was going to be easy. We need to factor each in and make iterative improvements, one step at a time.

C4RR – Containers for Reproducible Research Conference

James shares his thoughts after attending the C4RR Containers for Reproducible Research Conference at the University of Cambridge (27 – 28 June).


At the end of June both Dave and I, the Technical Fellows, attended the C4RR conference/workshop hosted by The Software Sustainability Institute in Cambridge. This event brought together researchers, developers and educators to explore best practices when using containers and the future of research software with containers.

Containers, specially Docker and Singularity are the ‘in’ thing at the moment and it was interesting to hear from a variety of research projects who are using them for reproducible research.

Containers are another form of server virtualisation but are lighter than a virtual machine. Containers and virtual machines have similar resource isolation and allocation benefits, but function differently because containers virtualize the operating system instead of hardware; containers are more portable and efficient.

 
Comparison of VM vs Container (Images from docker.com)

Researchers described how they were using Docker, one of the container implementations, to package the software used in their research so they could easily reproduce their computational environment across several different platforms (desktop, server and cluster). Others were using Singularity, another container technology, when implementing containers on a HPC (High-Performance Computing) Cluster due to restrictions of the Docker requirements for root access. It was clear from the talks, there is a rapid development of these technologies and ever increasing complexity of the computing environments involved, which does make me worry how these might be preserved.

Near the end of the second day, Dave and I gave a 20 minute presentation to encourage the audience to think more about preservation. As the audience were all evangelists for container technology it made sense to try to tap into them to promote building preservation into their projects.


Image By Raniere Silva

One aim was to get people to think about their research after the project was over. There is often a lack of motivation to think about how others might reproduce the work, whether that’s six months into the future let alone 15+ years from now.

Another area we briefly covered was relating to depositing research data. We use DROID to scan our repositories to identify file formats which relies on the technical registry of PRONOM. We put out a plea to the audience to ask for help with creating new file signatures for unknown file formats.

I had some great conversations with others over the two days and my main takeaway from the event was that we should look to attend more non-preservation specific conferences with a view to promote preservation in other computer-related areas of study.

Our slides from the event have been posted by The Software Sustainability Institute via Google.

DPASSH: Getting close to producers, consumers and digital preservation

Sarah shares her thoughts after attending the DPASSH (Digital Preservation in the Arts, Social Sciences and Humanities) Conference at the University of Sussex (14 – 15 June).


DPASSH is a conference that the Digital Repository Ireland (DRI) puts on with a host organisation. This year, it was hosted by the Sussex Humanities Lab at the University of Sussex, Brighton. What is exciting about this digital preservation conference is that it brings together creators (producers) and users (consumers) with digital preservation experts. Most digital preservation conferences end up being a bit of an echo chamber, full of practitioners and vendors only. But what about the creators and the users? What knowledge can we share? What can we learn?

DPASSH is a small conference, but it was an opportunity to see what researchers are creating and how they are engaging with digital collections. For example in Stefania Forlini’s talk, she discussed the perils of a content-centric digitisation process where unique print artefacts are all treated the same; the process flattens everything into identical objects though they are very different. What about the materials and the physicality of the object? It has stories to tell as well.

To Forlini, books span several domains of sensory experience and our digitised collections should reflect that. With the Gibson Project, Forlini and project researchers are trying to find ways to bring some of those experiences back through the Speculative W@nderverse. They are currently experimenting with embossing different kinds of paper with a code that can be read by a computer. The computer can then bring up the science fiction pamphlets that are made of that specific material. Then a user can feel the physicality of the digitised item and then explore the text, themes and relationships to other items in the collection using generous interfaces. This combines a physical sensory experience with a digital experience.

For creators, the decision of what research to capture and preserve is sometimes difficult; often they lack the tools to capture the information. Other times, creators do not have the skills to perform proper archival selection. Athanasios Velios offered a tool solution for digital artists called Artivity. Artivity can capture the actions performed on a digital artwork in certain programs, like Photoshop or Illustrator. This allows the artist to record their creative process and gives future researchers the opportunity to study the creative process. Steph Taylor from CoSector suggested in her talk that creators are archivists now, because they are constantly appraising their digital collections and making selection decisions.  It is important that archivists and digital preservation practitioners empower creators to make good decisions around what should be kept for the long-term.

As a bonus to the conference, I was awarded with the ‘Best Tweet’ award by the DPC and DPASSH. It was a nice way to round out two good, informative days. I plan to purchase many books with my gift voucher!

I certainly hope they hold the conference next year, as I think it is important for researchers in the humanities, arts and social sciences to engage with digital preservation experts, archivists and librarians. There is a lot to learn from each other. How often do we get our creators and users in one room with us digital preservation nerds?

Preserving research – update from the Cambridge Technical Fellow

Cambridge’s Technical Fellow, Dave, discusses some of the challenges and questions around preserving ‘research output’ at Cambridge University Library.


One of the types of content we’ve been analysing as part of our initial content survey has been labelled ‘research output’. We knew this was a catch-all term, but (according to the categories in Cambridge’s Apollo Repository), ‘research output’ potentially covers: “Articles, Audio Files, Books or Book Chapters, Chemical Structures, Conference Objects, Datasets, Images, Learning Objects, Manuscripts, Maps, Preprints, Presentations, Reports, Software, Theses, Videos, Web Pages, and Working Papers”. Oh – and of course, “Other”. Quite a bundle of complexity to hide behind one simple ‘research output’ label.

One of the categories in particular, ‘Dataset’, zooms the fractal of complexity in one step further. So far, we’ve only spoken in-depth to a small set of scientists (though our participation on Cambridge’s Research Data Management Project Group means we have a great network of people to call on). However, both meetings we’ve had indicate that ‘Datasets’ are a whole new Pandora’s box of complicated management, storage and preservation challenges.

However – if we pull back from the complexity a little, things start to clarify. One of the scientists we spoke to (Ben Steventon at the Steventon Group) presented a very clear picture of how his research ‘tiered’ the data his team produced, from 2-4 terabyte outputs from a Light Sheet Microscope (at the Cambridge Advanced Imaging Centre) via two intermediate layers of compression and modelling, to ‘delivery’ files only megabytes in size. One aspect of the challenge of preserving such research then, would seem to be one of tiering preservation storage media to match the research design.

(I believe our colleagues at the JISC, who Cambridge are working with on the Research Data Management Shared Service Pilot Project, may be way ahead of us on this…)

Of course, tiering storage is only one part of the preservation problem for research data: the same issues of acquisition and retention that have always been part of archiving still apply… But that’s perhaps where the ‘delivery’ layer of the Steventon Group’s research design starts to play a role. In 50 or 100 years’ time, which sets of the research data might people still be interested in? It’s obviously very hard to tell, but perhaps it’s more likely to be the research that underpins the key model: the major finding?

Reaction to the ‘delivered research’ (which included papers, presentations and perhaps three or four more from the list above) plays a big role, here. Will we keep all 4TBs from every Light Sheet session ever conducted, for the entirety of a five or ten-year project? Unlikely, I’d say. But could we store (somewhere cold, slow and cheap) the 4TBs from the experiment that confirmed the major finding?

That sounds a bit more within the realms of possibility, mostly because it feels as if there might be a chance that someone might want to work with it again in 50 years’ time. One aspect of modern-day research that makes me feel this might be true is the complexity of the dependencies between pieces of modern science, and the software it uses in particular. (Blender, for example, or Fiji). One could be pessimistic here and paint a negative scenario of ‘what if a major bug is found in one of those apps, that calls into question the science ‘above it in the chain’. But there’s an optimistic view, here, too… What if someone comes up with an entirely new, more effective analysis method that replaces something current science depends on? Might there not be value in pulling the data from old experiments ‘out of the archive’ and re-running them with the new kit? What would we find?

We’ll be able to address some of these questions in a bit more detail later in the project. However, one of the more obvious things talking to scientists has revealed is that many of them seem to have large collections of images that need careful management. That seems quite relevant to some of the more ‘close to home’ issues we’re looking at right now in The Library.

When was that?: Maintaining or changing ‘created’ and ‘last modified’ dates

Sarah has recently been testing scenarios to investigate the question of changes in file ‘date created’ and ‘last modified’ metadata. When building training, it’s always best to test out what your advice before giving it and below is the result of Sarah’s research with helpful screenshots.


Before doing some training that involved teaching better recordkeeping habits to staff, I ran some tests to be sure that I was giving the right advice when it came to created and last modified dates. I am often told by people in the field that these dates are always subject to change—but are they really? I knew I would tell staff to put created dates in file names or in document headers in order to retain that valuable information, but could the file maintain the correct embedded date anyways?  I set out to test a number of scenarios on both my Mac OS X laptop and Windows desktop.

Scenario 1: Downloading from cloud storage (Google Drive)

This was an ALL DATES change for both Mac OS X and Windows.

Scenario 2: Uploading to cloud storage (Google Drive)

Once again this was an ALL DATES change for both systems.

Note: I trialled this a second time with the Google Drive for PC application and in OS X and found that created and last modified dates do not change when the file is uploaded or downloaded the Google Drive folder on the PC. However, when in Google Drive via the website, the created date is different (the date/time of upload), though the ‘file info’ will confirm the date has not changed. Just to complicate things.

Scenario 3: Transfer from a USB

Mac OS X had no change to the dates. Windows showed an altered created date, but maintained the original last modified date.

Scenario 4: Transfer to a USB

Once again there was no change of a dates in the Mac OS X. Windows showed an altered created date, but maintained the original last modified date.

Note: I looked into scenarios 3 and 4 for Windows a bit further and saw that Robocopy is an option as a command prompt that will allow directories to be copied across and maintains those date attributes. I copied a ‘TEST’ folder containing the file from the Windows computer to the USB, and back again. It did what was promised and there were no changes to either dates in the file. It is a bit annoying that an extra step is required (that many people would find technically challenging and therefore avoid).

Scenario 5: Moving between folders

No change across either systems. This was a relief for me considering how often I move files around my directories.

Conclusions

When in doubt (and you should always be in doubt), test the scenario. Even when I tested these scenarios three of four times, it did not always come out with the same result. That alone should make one cautious. I still stick to putting created date in the file name and in the document itself (where possible), but it doesn’t meant I always receive documents that way.

Creating a zip of files/folders before transfer is one method of preserving dates, but I had some weird issues trying to unzip the file in cloud storage that took a few tries before the dates remained preserved. It is also possible to use Quickhash for transferring files unchanged (and it generates a checksum).

I ignored the last accessed date during testing, because it was too easy to accidentally double-click a file and change it (as you can see happened to my Windows 7 test version).

Has anyone tested any other scenarios to assess when file dates are altered? Does anyone have methods for transferring files without causing any change to dates?

An approach to selecting case studies

Cambridge Policy & Planning Fellow, Somaya, writes about a case study approach developed by the Cambridge DPOC Fellows for CUL. Somaya’s first blog post about the case studies looks at the selection methodology the Cambridge DPOC fellows used to choose their final case studies.


Physical format digital carriers. Photo: Somaya Langley

Background & approach

Cambridge University Library (CUL) has moved to a ‘case study’ approach to the project. The case studies will provide an evidence-based foundation for writing a policy and strategy, developing a training programme and writing technical requirements within the time constraints of the project.The case studies we choose for the DPOC project will enable us to test hands-on day-to-day tasks necessary for working with digital collection materials at CUL. They also need to be representative of our existing collections and future acquisitions, our Collection Development Policy FrameworkStrategic Plan,  our current and future audiences, while considering the ‘preservation risk’ of the materials.

Classes of material

Based on the digital collections surveying work I’ve been doing, our digital collections fall into seven different ‘classes’:

  1. Unpublished born-digital materials – personal and corporate papers, digital archives of significant individuals or institutions
  2. Born-digital university archives – selected records of the University of Cambridge
  3. Research outputs – research data and publications (including compliance)
  4. Published born-digital materials – physical format carriers (optical media), eBooks, web archives, archival and access copies of electronic subscription services, etc.
  5. Digitised image materials – 2D photography (and 3D imaging)
  6. Digital (and analogue) audiovisual materials – moving image (film and video) and sound recordings
  7. In-house created content – photography and videography of events, lectures, photos of conservation treatments, etc.

Proposed case studies

Approximately 40 potential case studies suggested by CUL and Affiliated Library staff were considered. These proposed case studies were selected from digital materials in our existing collections, current acquisition offers, and requests for assistance with digital collection materials, from across Cambridge University. Each proposed case study would allow us to trial different tools (and digital preservation systems), approaches, workflow stages, and represent different ‘classes’ of material.

Digital lifecycle stages

The selected stages are based on a draft Digital Stewardship End-to-End Workflow I am developing. The workflow includes approximately a dozen different stages. It is based on the Digital Curation Centre’s Curation Lifecycle Model, and is also aligned with the Digital POWRR (Preserving Digital Objects with Restricted Resources) Tool Evaluation Grid.

There are also additional essential concerns, including:

  • data security
  • integration (with CUL systems and processes)
  • preservation risk
  • remove and/or delete
  • reporting
  • resources and resourcing
  • system configuration

Selected stages for Cambridge’s case studies

Dave, Lee and I discussed the stages and cut it down to the bare-minimum required to test out various tasks as part of the case studies. These stages include:

  1. Appraise and Select
  2. Acquire / Transfer
  3. Pre-Ingest (including Preconditioning and Quality Assurance)
  4. Ingest (including Generate Submission Information Package)
  5. Preservation Actions (sub-component of Preserve)
  6. Access and Delivery
  7. Integration (with Library systems and processes) and Reporting

Case study selection

In order to produce a shortlist, I needed to work out a parameter best suited in order to rank the proposed case studies from a digital preservation perspective. The initial parameter we decided on was complexity. Did the proposed case study provide enough technical challenges to fully test out what we needed to research?

We also took into account a Streams Matrix (still in development) that outlines different tasks taken at each of the at each of the selected digital life cycle stages. This would ensure different variations of activities were factored in at each stage.

We revisited the case studies once in ranked order and reviewed them, taking into account additional parameters. The additional parameters included:

  • Frequency and/or volume – how much of this type of material do we have/are we likely to acquire (i.e. is this a type of task that would need to be carried out often)?
  • Significance – how significant is the collection in question?
  • Urgency – does this case study fit within strategic priorities such as the current Cambridge University Library Strategic Plan and Collection Development Policy Framework etc.?
  • Uniqueness – is the case study unique and would it be of interest to our users (e.g. the digital preservation field, Cambridge University researchers)?
  • Value to our users and/or stakeholders – is this of value to our current and future users, researchers and/or stakeholders?

This produced a shortlist of eight case studies. We concluded that each provided different long-term digital preservation issues and were experiencing considerable degrees of ‘preservation risk’.

Conclusion

This was a challenging and time-consuming approach, however it ensures fairness in the selection process. The case studies will enable us to have tangible evidence in which to ground the work of the rest of the project. The Cambridge University Library Polonsky Digital Preservation Project Board have agreed that we will undertake three case studies, including a digitisation case study, a born-digital case study and one more – the details of which are still being discussed. Stay tuned for more updates.

Validating half a million TIFF files. Part One.

Oxford Technical Fellow, James, reports on the validation work he is doing with JHOVE and DPF Manager in Part One of this blog series on validation tools for auditing the Polonsky Digitization Project’s TIFF files.


In 2013, The Bodleian Libraries of the University of Oxford and the Biblioteca Apostolica Vaticana (Vatican Library) joined efforts in a landmark digitization project. The aim was to open up their repositories of ancient texts including Hebrew manuscripts, Greek manuscripts, and incunabula, or 15th-century printed books. The goal was to digitize over one and half million pages. All of this was made possible by funding from the Polonsky Foundation.

As part of our own Polonsky funded project, we have been preparing the ground to validate over half a million TIFF files which have been created from digitization work here at Oxford.

Many in the Digital Preservation field have already written articles and blogs on the tools available for validating TIFF files, Yvonne Tunnat (from ZBW Leibniz Information Centre for Economics) wrote a blog for the Open Preservation Foundation regarding the tools. I also had the pleasure of hearing from Yvonne and Michelle Lindlar (from TIB Leibniz Information Centre for Science and Technology) talk at IDCC 2017 conference on this very subject in more detail when discussing JHOVE in their talk, How Valid Is Your Validation? A Closer Look Behind The Curtain Of JHOVE

The go-to validator for TIFF files?

Preparation for validation

In order to validate the master TIFF files, firstly we needed to retrieve these from our tape storage system; fortunately around two-thirds of the images had already been restored to spinning disk storage as part of another internal project. When the master TIFF files were written to tape this included MD5 hashes of the files, so as part of this validation work we will confirm the fixity of all the files. Our network storage system had plenty of room to accommodate all the required files, so we began auditing what still needed to be recovered.

Whilst the auditing and retrieval was progressing, I set about investigating validating a sample set of master TIFF files using both JHOVE and DPF Manager to get an estimate on the time it would take to process the approximate 50 TB of files. I was also interested to compare the results of both tools when faced with invalid or corrupted sample sets of files.

We setup a new virtual machine server in order to carry out the validation workload; this allowed us to scale this machine’s performance as required. Both validation tools were going to be run on a RedHat Linux environment and both would be run from the command line.

It quickly became clear that JHOVE was going to be able to validate the TIFF files a lot quicker than DPF Manager. If DPF Manager is being used as part of one of your workflows, you may not have noticed any real-time penalty when processing small numbers of files, however with a large batch, the time difference with the two tools was noticeable.

Potential alternative for TIFF validation?

During the testing I noticed there were several issues with DPF Manager, including the lack of being able to specify the number of threads the process could use, which I suspect resulted in the poor initial performance. I dutifully reported the bug to the DPF community GitHub and was pleased to see an almost instant response stating that it would be resolved in the next monthly release. I do love Open Source projects, and I think this highlights the importance of those using the tools being responsible for improving them. Without community engagement, these projects are liable to run out of steam and slowly die.

I’m going to reserve judgement on the tools until the next release of DPF Manager. We will then also be in a position to report back on our findings from this validation case study. So check back with our blog for Part Two.

I would be interested to hear from anyone else who might have been faced with validating large batches of files, what tools are you using? what challenges have you faced? Do let me know!

Visit to the National Archives: herons and brutalism

An update from Edith Halvarsson about the DPOC team’s trip to visit the National Archives last week. Prepare yourself for a discussion about digital preservation, PRONOM, dark archives, and wildlife!


Last Thursday DPOC visited the National Archives in London. David Clipsham kindly put much time into organising a day of presentations with the TNA’s developers, digitization experts and digital archivists. Thank you Diana, David & David, Ron, Ian & Ian, Anna and Alex for all your time and interesting thoughts!

After some confusion, we finally arrived at the picturesque Kew Gardens station. The area around Kew is very sleepy, and our first thought on arrival was “is this really the right place?” However, after a bit more circling around Kew, you definitely cannot miss it. The TNA is located in an imposing brutalist building, surrounded by beautiful nature and ponds built as flood protection for the nation’s collections. They even have a tame heron!

After we all made it on site, the day the kicked off with an introduction from Diana Newton (Head of Digital Preservation). Diana told us enthusiastically about the history of the TNA and its Digital Records Infrastructure. It was really interesting to hear how much has changed in just six years since DRI was launched – both in terms of file format proliferation and an increase in FOI requests.

We then had a look at TNA’s ingest workflows into Preservica and storage model with Ian Hoyle (Senior Developer) and David Underdown (Senior Digital Archivist). It was particularly interesting to hear about the TNA’s decision to store all master file content on offline tape, in order to bring down the archive’s carbon footprint.

After lunch with Ron Davies (Senior Project Manager), Anna de Sousa and Ian Henderson spoke to us about their work digitizing audiovisual material and 2D images. Much of our discussion focused on standards and formats (particularly around A/V). Alex Green and David Clipsham then finished off the day talking about born-digital archive accession streams and PRONOM/DROID developments. This was the first time we had seen the clever way a file format identifier is created – there is much detective work required on David’s side. David also encouraged us and anyone else who relies on DROID to have a go and submit something to PRONOM – he even promised its fun! Why not read Jenny Mitcham’s and Andrea Byrne’s articles for some inspiration?

Thanks for a fantastic visit and some brilliant discussions on how digital preservation work and digital collecting is done at the TNA!