Digital Preservation futurology

I fancy attempting futurology, so here’s a list of things I believe could happen to ‘digital preservation systems’ over the next decade. I’ve mostly pinched these ideas from folks like Dave Thompson, Neil Jefferies, and my fellow Fellows. But if you see one of your ideas, please claim it using the handy commenting mechanism. And because it’s futurology, it doesn’t have to be accurate, so kindly contradict me!

Ingest becomes a relationship, not a one-off event

Many of the core concepts underpinning how computers are perceived to work are crude, paper-based metaphors – e.g. ‘files’, ‘folders’, ‘desktops’, ‘wastebaskets’ etc – that don’t relate to what your computer’s actually doing. (The early players in office computing were typewriter and photocopier manufacturers, after all…) These metaphors have succeeded at getting everyone to use computers, but they’ve also suppressed various opportunities to work smarter, too.

The concept of ingesting (oxymoronic) ‘digital papers’ is obviously heavily influenced by this paper paradigm.  Maybe the ‘paper paradigm’ has misled the archival community about computers a bit, too, given that they were experts at handling ‘papers’ before computers arrived?

As an example of what I mean: in the olden days (25 whole years ago!), Professor Plum would amass piles of important papers until the day he retired / died, and then, and only then, could these personal papers be donated and archived. Computers, of course, make it possible for the Prof both to keep his ‘papers’ where he needs them, and donate them at the same time, but the ‘ingest event’ at the centre of current digital preservation systems still seems to be underpinned by a core concept of ‘piles of stuff needing to be dealt with as a one-off task’. In future, the ‘ingest’ of a ‘donation’ will actually become a regular, repeated set of occurrences based upon ongoing relationships between donors and collectors, and forged initially when Profs are but lowly postgrads. Personal Digital Archiving and Research Data Management will become key; and ripping digital ephemera from dying hard disks will become less necessary as they become so.

The above depends heavily upon…

Object versioning / dependency management

Of course, if Dr. Damson regularly donates materials from her postgrad days onwards, some of these may be updates to things donated previously. Some of them might have mutated so much since the original donation that they can be considered ‘child’ objects, which may have ‘siblings’ with ‘common ancestors’ already extant in the archive. Hence preservation systems need to manage multiple versions of ‘digital objects’, and the relationships between them.

Some of the preservation systems we’ve looked at claim to ‘do versioning’ but it’s a bit clunky – just side-by-side copies of immutable ‘digital objects’, not records of the changes from one version to the next, and with no concept of branching siblings from a common parent. Complex structures of interdependent objects are generally problematic for current systems. The wider computing world has been pushing at the limits of the ‘paper-paradigm’ immutable object for a while now (think Git, Blockchain, various version control and dependency management platforms, etc). Digital preservation systems will soon catch up.

Further blurring of the object / metadata boundary

What’s more important, the object or the metadata? The ‘paper-paradigm’ has skewed thinking towards the former (the sacrosanct ‘digital object’, comparable to the ‘original bit of paper’), but after you’ve digitised your rare book collection, what are Humanities scholars going to text-mine? It won’t be images of pages – it’ll be the transcripts of those (i.e. the ‘descriptive metadata’)*. Also, when seminal papers about these text mining efforts are published, how is this history of the engagement with your collection going to be recorded? Using a series of PREMIS Events (that future scholars can mine in turn), perhaps?

The above talk of text mining and contextual linking of secondary resources raises two more points…

* While I’m here, can I take issue with the term ‘descriptive metadata’? All metadata is descriptive. It’s tautological; like saying ‘uptight Englishman’. Can we think of a better name?

Ability to analyse metadata at scale

‘Delivery’ no longer just means ‘giving users a viewer to look at things one-by-one with’ – it now also means ‘letting people push their Natural Language or image processing algorithms to where the data sits, and then coping with vast streams of output data’.

Storage / retention informed by well-understood usage patterns

The fact that everything’s digital, and hence easier to disseminate and link together than physical objects, also means better understanding how people use our material. This doesn’t just mean ‘wiring things up to Google Analytics’ – advances in bibliometrics that add social / mainstream media analysis, and so forth, to everyday citation counts present opportunities to judge the impact of our ‘stuff’ on the world like never before. Smart digital archives will inform their storage management and retention decisions with this sort of usage information, potentially in fully or semi-automated ways.

Ability to get data out, cleanly – all systems are only ever temporary!

Finally – it’s clear that there are no ‘long-term’ preservation system options. The system you procure today will merely be ‘custodian’ of your materials for the next ten or twenty years (if you’re lucky). This may mean moving heaps of content around in future, but perhaps it’s more pragmatic to think of future preservation systems as more like ‘lenses’ that are laid on top of more stable data stores to enable as-yet-undreamt-of functionality for future audiences?

(OK – that’s enough for now…)

Six Priority Digital Preservation Demands

Somaya Langley, Cambridge Policy and Planning Fellow, talks about her top 6 demands for a digital preservation system.


Photo: Blazej Mikula, Cambridge University Library

As a former user of one digital preservation system (Ex Libris’ Rosetta), I have spent a few years frustrated by the gap between what activities need to be done as part of a digital stewardship end-to-end workflow – including packaging and ingesting ‘information objects’ (files and associated metadata) – and the maturity level of digital preservation systems.

Digital Preservation Systems Review

At Cambridge, we are looking at different digital preservation systems and what each one can offer. This has involved talking to both vendors and users of systems.

When I’m asked about what my top digital preservation system current or future requirements are, it’s excruciatingly hard to limit myself to a handful of things. However, having previously been involved in a digital preservation system implementation project, there are some high-level takeaways from past experiences that remain with me.

Shortlist

Here’s the current list of my six top ‘digital preservation demands’ (aka user requirements):

Integration (with various other systems)

A digital preservation ‘system’ is only one cog in a wheel within a much larger machine; one piece of a much larger puzzle. There is an entire ‘digital ecosystem’ that this ‘system’ should exist within, and end-to-end digital stewardship workflows are of primary importance. The right amount of metadata and/or files should flow should flow from one system to another. We must also know where the ‘source of truth’ is for each bit.

Standards-based

This seems like a no-brainer. We work in Library Land. Libraries rely on standards. We also work with computers and other technologies that also require standard ways (protocols etc.) of communicating.

For files and metadata to flow from one system to another – whether via import, ingest, export, migration or an exit strategy from a system – we already spend a bunch of time creating mappings and crosswalks from one standard (or implementation of a standard) to another. If we don’t use (or fully implement) existing standards, this means we risk mangling data, context or meaning; potentially losing or not capturing parts of the data; or just wasting a whole lot of time.

Error Handling (automated, prioritised)

There’s more work to be done in managing digital materials than there are people to do it. Content creation is increasing at exponential rates, meanwhile the number of staff (with the right skills) just aren’t. We have to be smart about how we work. This requires prioritisation.

We need to have smarter systems that help us. This includes helping to prioritise where we focus our effort. Digital preservation systems are increasingly incorporating new third-party tools. We need to know which tool reports each error and whether these errors are show-stoppers or not. (For example: is the content no longer renderable versus a small piece of non-critical descriptive metadata that is missing?) We have to accept that, for some errors, we will never get around to addressing them.

Reporting

We need to be able to report to different audiences. The different types of reporting classes include (but are not limited to):

  1. High-level reporting – annual reports, monthly reports, reports to managers, projections, costings etc.)
  2. Collection and preservation management reporting – reporting on successes and failures, overall system stats, rolling checksum verification etc.
  3. Reporting for preservation planning purposes – based on preservation plans, we need to be able to identify subsections of our collection (configured around content types, context, file format and/or whatever other parameters we choose to use) and report on potential candidates that require some kind of preservation action.

Provenance

We need to best support – via metadata – where a file has come from. This, for want of a better approach, is currently being handled by the digital preservation community through documenting changes as Provenance Notes. Digital materials acquired into our collections are not just the files, they’re also the metadata. (Hence, why I refer to them as ‘information objects’.) When an ‘information object’ has been bundled, and is ready to be ingested into a system, I think of it as becoming an ‘information package’.

There’s a lot of metadata (administrative, preservation, structural, technical) that appears along the path from an object’s creation until the point at which it becomes an ‘information package’. We need to ensure we’re capturing and retaining the important components of this metadata. Those components we deem essential must travel alongside their associated files into a preservation system. (Not all files will have any or even the right metadata embedded within the file itself.) Standardised ways of handling information held in Provenance Notes (whether these are from ‘outside of the system’ or created by the digital preservation system) and event information so it can be interrogated and reported on is crucial.

Managing Access Rights

Facilitating access is not black and white. Collections are not simply ‘open’ or ‘closed’. We have a myriad of ways that digital material is created and collected; we need to ensure we can provide access to this content in a variety of ways that support both the content and our users. This can include access from within an institution’s building, via a dedicated computer terminal, online access to anyone in the world, mediated remote access, access to only subsets of a collection, support for embargo periods, ensuring we respect cultural sensitivities or provide access to only the metadata (perhaps as large datasets) and more.

We must set a goal of working towards providing access to our users in the many different (and new-ish) ways they actually want to use our content.

It’s imperative to keep in mind the whole purpose of preserving digital materials is to be able to access them (in many varied ways). Provision of content ‘viewers’ and facilitating other modes of access (e.g. to large datasets of metadata) are essential.

Final note: I never said addressing these concerns was going to be easy. We need to factor each in and make iterative improvements, one step at a time.

Designing digital preservation training – it’s more than just talking

Sarah, Oxford’s Outreach and Training Fellow, writes about the ‘training cycle’ and concludes that delivering useful training is more than just talking at learners.


We have all been there before: trying to keep our eyes open as someone drones on in the front of the room, while the PowerPoint slides seem to contain a novella that hurts your eyes to squint to read. That’s not how training is supposed to go.

Rather, engaging your learner in a variety activities will help them retain knowledge. And in a field like digital preservation, the more hands-on the training, the better. So often we talk about concepts or technical tools, but we very rarely provide examples, demonstrate them, or (better yet) have staff experiment with them.

And training is just one small part of the training process. I’ve learned there are many steps involved in developing a course that will be of use to staff. Most of your time will not be spent in the training room.

Identifying Learner’s Needs

Often easier said than done. It’s better to prepare for all types of learners and pitch the material to a wide audience. With hands-on tasks, it’s possible to have additional work prepared for advanced learners, so they don’t get bored while other learners are still working through the task.

Part of the DPOC project has been about finding the gaps in digital preservation skills and knowledge, so that our training programmes can better meet staff’s needs. What I am learning is that I need to cast my net wide to reach everyone!

Planning and Preparation

The hard bit. Start with what your outcomes are going to be and try not to put too many into a session. It’s too easy to be extra ambitious. Once you have them, then you pick your activities, gather your materials (create that PowerPoint) and practise! Never underestimate the value of practising your session on your peers beforehand.

Teaching and Learning

The main event. It’s important to be confident, open and friendly as a trainer. I admit, I stand in the bathroom and do a “Power Pose” for a few minutes to psyche myself up. You are allowed nerves as a trainer! It’s important to be flexible during the course

Assessment

Because training isn’t just about Teaching and Learning. That only accounts for 1/5th of the training cycle. Assessment is another 1/5th and if that’s going to happen during the course, then it needs to be planned. Using a variety of the activities mentioned above will help with that. Be aware though: activities almost always take longer than you plan! 

Activities to facilitate learning:

  • questioning
  • group activities such as, case studies, card sorting, mindmapping, etc.
  • hands-on tasks with software
  • group discussions
  • quizzes and games
  • modelling and demonstrations followed by an opportunity to practise the skill

Evaluation

Your evaluation is crucial to this. Make notes after your session on what you liked and what you need to fix. Peer evaluation is also important and sending out surveys immediately after will help with response rates. However, if you can do a paper evaluation at the end of the course, your response rates will be higher. Use that feedback to improve the course, tweak activities and content, so that you can start all over again.

Planning is a verb

In her DPC webinar on October 19, Nancy McGovern (MIT Libraries) spoke about ‘Preservation Planning and Maturity Modelling’. Maturity models are a great way to measure our progress as we look to solves some of our institutions’ digital preservation issues. Without them, digital preservation would be an unending task with no benchmarks, no goals. And one of the things that stuck out in the talk were some words of wisdom from Nancy:

Planning is a verb, it is not something you can do once and you’re done.

This is something that I think sits at the heart of digital preservation: this is not something we “do” and we’re done. Technology is constantly changing and requires continual monitoring for new tools, applications, and obsolescence. This constantly shifting environment means there is no single, one-time solution to digital preservation. It is a coordinated effort between “technology, decision-making, and people.” None of these things remains constant, but are ever-changing. Decision-making tools (such as policies) and people (skills) are also the hardest part of digital preservation, because there is no one-size fits all for either one. In comparison, technology is relatively easy to manage and plan for.

Having maturity models provides the stepping stones for developing technology, decision-making, and people. If viewed all at once, the task of implementing a sustainable digital preservation programme seems unlikely, but following steps makes it manageable ad measurable. One such maturity model is The Five Organizational Stages of Digital Preservation (from Kenney & McGovern):

  1. Acknowledge: Understanding that digital preservation is a local concern;
  2. Act: Initiating digital preservation projects;
  3. Consolidate: Segueing from projects to programs;
  4. Institutionalize: Incorporating the larger environment; and
  5. Externalize: Embracing inter-institutional collaboration and dependency.

(this is just one of many maturity models available, but it was referenced in the webinar)

And when Nancy spoke about this maturity model, she stressed the importance that your organisation might reach a level 5, but it might not stay a level 5 forever. The loss of an integral staff member, a shift in technology, or even starting a new digital collection or department would shift the balance again. This discussion only further reinforced for the that digital preservation is not something you can “set and forget,” but an on-going process.

Planning is also an important function in the OAIS reference model (preservation planning sits over the entire model). It is about monitoring external environments and recommending revisions or changes where necessary. Planning is essentially the “safeguard against a constantly evolving user and technology environment” (Lavoie, 2014). Meaning that where people and technology are involved, we are facing an ever-changing future; we must continually monitor and plan in order to provide long-term access to our digital assets.

After all, planning is a verb isn’t it?


What do you think? Is digital preservation a solution you can do once and be done with or does it require ongoing support and development? Or something else entirely? Join the discussion below:

Come Join the DPOC Team!

The Bodleian Libraries, University of Oxford are looking for the third Polonsky Fellow (Technical Officer/Research Software Engineer) to join the team! 

As a Technical Officer/Research Software Engineer at Oxford you will undertake research and training to build upon your expertise in the technical issues surrounding digital preservation and your awareness of the tools, systems and projects that seek to address these issues. You will also develop and/or implement digital preservation applications and services within the Bodleian Libraries, contribute to the development of a business case and sustainability plan for digital preservation operations, disseminate the key findings of your work to at least one conference and submit one journal article per year based on your work in collaboration with colleagues.

If you’re interested in joining this project and want more information, apply here.


*Remember you get to work with these great team members at Oxford and Cambridge!