The vision for a preservation repository

Over the last couple of months, work at Cambridge University Library has begun to look at what a potential digital preservation system will look like, considering technical infrastructure, the key stakeholders and the policies underpinning them. Technical Fellow, Dave, tells us more about the holistic vision…


This post discusses some of the work we’ve been doing to lay foundations beneath the requirements for a ‘preservation system’ here at Cambridge. In particular, we’re looking at the core vision for the system. It comes with the standard ‘work in progress’ caveats – do not be surprised if the actual vision varies slightly (or more) from what’s discussed here. A lot of the below comes from Mastering the Requirements Process by Suzanne and James Robertson.

Also – it’s important to note that what follows is based upon a holistic definition of ‘system’ – a definition that’s more about what people know and do, and less about Information Technology, bits of tin and wiring.

Why does a system change need a vision?

New systems represent changes to the existing status-quo. The vision is like the Pole Star for such a change effort – it ensures that people have something fixed to move towards when they’re buried under minute details. When confusion reigns, you can point to the vision for the system to guide you back to sanity.

Plus, as with all digital efforts, none of this is real: there’s no definite, obvious end point to the change. So the vision will help us recognise when we’ve achieved what we set out to.

Establishing scope and context

Defining what the system change isn’t is a particularly good a way of working out what it actually represents. This can be achieved by thinking about the systems around the area you’re changing and the information that’s going to flow in and out. This sort of thinking makes for good diagrams: one that shows how a preservation repository system might sit within the broader ecosystem of digitisation, research outputs / data, digital archives and digital published material is shown below.

System goals

Being able to concisely sum-up the key goals of the system is another important part of the vision. This is a lot harder than it sounds and there’s something journalistic about it – what you leave out is definitely more important than what you keep in. Fortunately, the vision is about broad brush strokes, not detail, which helps at this stage.

I found some great inspiration in Sustainable Economics for a Digital Planet, which indicated goals such as: “the system should make the value of preserving digital resources clear”, “the system should clearly support stakeholders’ incentives to preserve digital resources” and “the functional aspects of the system should map onto clearly-defined preservation roles and responsibilities”.

Who are we implementing this for?

The final main part of the ‘vision’ puzzle is the stakeholders: who is going to benefit from a preservation system? Who might not benefit directly, but really cares that one exists?

Any significant project is likely to have a LOT of these, so the Robertsons suggest breaking the list down by proximity to the system (using Ian Alexander’s Onion Model), from the core team that uses the system, through the ‘operational work area’ (i.e. those with the need to actually use it) and out to interested parties within the host organisation, and then those in the wider world beyond. An initial attempt at thinking about our stakeholders this way is shown below.

One important thing that we realised was that it’s easy to confuse ‘closeness’ with ‘importance’: there are some very important stakeholders in the ‘wider world’ (e.g. Research Councils or historians) that need to be kept in the loop.

A proposed vision for our preservation repository

After iterating through all the above a couple of times, the current working vision (subject to change!) for a digital preservation repository at Cambridge University Library is as follows:

The repository is the place where the best possible copies of digital resources are stored, kept safe, and have their usefulness maintained. Any future initiatives that need the most perfect copy of those resources will be able to retrieve them from the repository, if authorised to do so. At any given time, it will be clear how the digital resources stored in the repository are being used, how the repository meets the preservation requirements of stakeholders, and who is responsible for the various aspects of maintaining the digital resources stored there.

Hopefully this will give us a clear concept to refer back to as we delve into more detail throughout the months and years to come…

Digital Preservation futurology

I fancy attempting futurology, so here’s a list of things I believe could happen to ‘digital preservation systems’ over the next decade. I’ve mostly pinched these ideas from folks like Dave Thompson, Neil Jefferies, and my fellow Fellows. But if you see one of your ideas, please claim it using the handy commenting mechanism. And because it’s futurology, it doesn’t have to be accurate, so kindly contradict me!

Ingest becomes a relationship, not a one-off event

Many of the core concepts underpinning how computers are perceived to work are crude, paper-based metaphors – e.g. ‘files’, ‘folders’, ‘desktops’, ‘wastebaskets’ etc – that don’t relate to what your computer’s actually doing. (The early players in office computing were typewriter and photocopier manufacturers, after all…) These metaphors have succeeded at getting everyone to use computers, but they’ve also suppressed various opportunities to work smarter, too.

The concept of ingesting (oxymoronic) ‘digital papers’ is obviously heavily influenced by this paper paradigm.  Maybe the ‘paper paradigm’ has misled the archival community about computers a bit, too, given that they were experts at handling ‘papers’ before computers arrived?

As an example of what I mean: in the olden days (25 whole years ago!), Professor Plum would amass piles of important papers until the day he retired / died, and then, and only then, could these personal papers be donated and archived. Computers, of course, make it possible for the Prof both to keep his ‘papers’ where he needs them, and donate them at the same time, but the ‘ingest event’ at the centre of current digital preservation systems still seems to be underpinned by a core concept of ‘piles of stuff needing to be dealt with as a one-off task’. In future, the ‘ingest’ of a ‘donation’ will actually become a regular, repeated set of occurrences based upon ongoing relationships between donors and collectors, and forged initially when Profs are but lowly postgrads. Personal Digital Archiving and Research Data Management will become key; and ripping digital ephemera from dying hard disks will become less necessary as they become so.

The above depends heavily upon…

Object versioning / dependency management

Of course, if Dr. Damson regularly donates materials from her postgrad days onwards, some of these may be updates to things donated previously. Some of them might have mutated so much since the original donation that they can be considered ‘child’ objects, which may have ‘siblings’ with ‘common ancestors’ already extant in the archive. Hence preservation systems need to manage multiple versions of ‘digital objects’, and the relationships between them.

Some of the preservation systems we’ve looked at claim to ‘do versioning’ but it’s a bit clunky – just side-by-side copies of immutable ‘digital objects’, not records of the changes from one version to the next, and with no concept of branching siblings from a common parent. Complex structures of interdependent objects are generally problematic for current systems. The wider computing world has been pushing at the limits of the ‘paper-paradigm’ immutable object for a while now (think Git, Blockchain, various version control and dependency management platforms, etc). Digital preservation systems will soon catch up.

Further blurring of the object / metadata boundary

What’s more important, the object or the metadata? The ‘paper-paradigm’ has skewed thinking towards the former (the sacrosanct ‘digital object’, comparable to the ‘original bit of paper’), but after you’ve digitised your rare book collection, what are Humanities scholars going to text-mine? It won’t be images of pages – it’ll be the transcripts of those (i.e. the ‘descriptive metadata’)*. Also, when seminal papers about these text mining efforts are published, how is this history of the engagement with your collection going to be recorded? Using a series of PREMIS Events (that future scholars can mine in turn), perhaps?

The above talk of text mining and contextual linking of secondary resources raises two more points…

* While I’m here, can I take issue with the term ‘descriptive metadata’? All metadata is descriptive. It’s tautological; like saying ‘uptight Englishman’. Can we think of a better name?

Ability to analyse metadata at scale

‘Delivery’ no longer just means ‘giving users a viewer to look at things one-by-one with’ – it now also means ‘letting people push their Natural Language or image processing algorithms to where the data sits, and then coping with vast streams of output data’.

Storage / retention informed by well-understood usage patterns

The fact that everything’s digital, and hence easier to disseminate and link together than physical objects, also means better understanding how people use our material. This doesn’t just mean ‘wiring things up to Google Analytics’ – advances in bibliometrics that add social / mainstream media analysis, and so forth, to everyday citation counts present opportunities to judge the impact of our ‘stuff’ on the world like never before. Smart digital archives will inform their storage management and retention decisions with this sort of usage information, potentially in fully or semi-automated ways.

Ability to get data out, cleanly – all systems are only ever temporary!

Finally – it’s clear that there are no ‘long-term’ preservation system options. The system you procure today will merely be ‘custodian’ of your materials for the next ten or twenty years (if you’re lucky). This may mean moving heaps of content around in future, but perhaps it’s more pragmatic to think of future preservation systems as more like ‘lenses’ that are laid on top of more stable data stores to enable as-yet-undreamt-of functionality for future audiences?

(OK – that’s enough for now…)

Visit to the National Archives: herons and brutalism

An update from Edith Halvarsson about the DPOC team’s trip to visit the National Archives last week. Prepare yourself for a discussion about digital preservation, PRONOM, dark archives, and wildlife!


Last Thursday DPOC visited the National Archives in London. David Clipsham kindly put much time into organising a day of presentations with the TNA’s developers, digitization experts and digital archivists. Thank you Diana, David & David, Ron, Ian & Ian, Anna and Alex for all your time and interesting thoughts!

After some confusion, we finally arrived at the picturesque Kew Gardens station. The area around Kew is very sleepy, and our first thought on arrival was “is this really the right place?” However, after a bit more circling around Kew, you definitely cannot miss it. The TNA is located in an imposing brutalist building, surrounded by beautiful nature and ponds built as flood protection for the nation’s collections. They even have a tame heron!

After we all made it on site, the day the kicked off with an introduction from Diana Newton (Head of Digital Preservation). Diana told us enthusiastically about the history of the TNA and its Digital Records Infrastructure. It was really interesting to hear how much has changed in just six years since DRI was launched – both in terms of file format proliferation and an increase in FOI requests.

We then had a look at TNA’s ingest workflows into Preservica and storage model with Ian Hoyle (Senior Developer) and David Underdown (Senior Digital Archivist). It was particularly interesting to hear about the TNA’s decision to store all master file content on offline tape, in order to bring down the archive’s carbon footprint.

After lunch with Ron Davies (Senior Project Manager), Anna de Sousa and Ian Henderson spoke to us about their work digitizing audiovisual material and 2D images. Much of our discussion focused on standards and formats (particularly around A/V). Alex Green and David Clipsham then finished off the day talking about born-digital archive accession streams and PRONOM/DROID developments. This was the first time we had seen the clever way a file format identifier is created – there is much detective work required on David’s side. David also encouraged us and anyone else who relies on DROID to have a go and submit something to PRONOM – he even promised its fun! Why not read Jenny Mitcham’s and Andrea Byrne’s articles for some inspiration?

Thanks for a fantastic visit and some brilliant discussions on how digital preservation work and digital collecting is done at the TNA!

Polonsky Fellows visit Western Bank Library at Sheffield University

Overview of DPOC’s visit to the Western Bank Library at Sheffield University by James Mooney, Technical Fellow at Bodleian Libraries, Oxford.
___________________________________________________________________________

The Polonsky Fellows were invited to the Western Bank Library at Sheffield University to speak with Laura Peaurt and other members of the Library. The aim of the meeting was to discuss the experiences of using and implementing Ex Libris’ Rosetta product.

After arriving by train, it was just a quick tram ride to Western Bank campus at Sheffield University, then we had the fun of using the paternoster lift in the Western Bank Library to arrive at our meeting, it’s great to see this technology has been preserved and still in use.

Paternoster lifts still in use at the Western Library. Image Credit: James Mooney

We met with Laura Peaurt (Digital Preservation Manager), Chris Jones (Library Systems Manager) and Angus Taggart (Library Systems Manager – Research).

Andy Bussey, Head of Digital Services & Systems was kind enough to give us an hour of his time at the start of the meeting, allowing us to discuss parts of the procurement and implementation process.

When working out the requirements for the system, Sheffield was able to collaborate with the White Rose University Consortium (the Universities of Leeds, Sheffield and York) to work out an initial scope.

When reviewing the options both open source and proprietary products were considered. For the Western Library and the University back in 2014, after a skills audit, the open source options had to be ruled out due to a lack of technical and developmental skills to customise or support them. I’m sure if this was revisited today the outcome may well have been different as the team has grown and gained experience and expertise. Many organisations may find it easier to budget for a software package and support contract with a vendor than to pursue the creation of several new employment positions.

With that said, as part of the implementation of Rosetta, Laura’s role was created as there was an obvious need for a Digital Preservation manager, we then went on to discuss the timeframe of the project and then moved onto the configuration of the product with Laura providing a live demonstration of the product whilst talking about the current setup, the scalability of the instances and the granularity of the sections within Rosetta.

During the demonstrations we discussed what content was held in Rosetta, how people had been trained with Rosetta and what feedback they had received so far. We reviewed the associated metadata which had been stored with the items that had been ingested and went over the options regarding integration with a Catalogue and/or Archival Management System.

After lunch we went on discuss the workflows currently being used with further demonstrations so we could see an end-to-end examples including what ingest rules and polices were in place along with what tools were in use and what processes were carried out. We then looked at how problematic items were dealt with in the Technical Analysis Workbench, covering the common issues and how additional steps in the ingest process can minimise certain issues.

As part of reviewing the sections of Rosetta we also inspected of Rosetta’s metadata model, the DNX (Digital Normalised XML) and discussed ingesting born-digital content and associated METS files.

Western Library. Image Credit: A J Buildings Library.

We visited Sheffield with many questions and during the course of the discussions throughout the day many of these were answered but as the day came to a close we had to wrap up the talks and head back to the train station. We all agreed it had been an invaluable meeting and sparked further areas of discussion. Having met face to face and with an understanding of the environment at Sheffield will make future conversations that much easier.

On the core concepts of digital preservation

Cambridge’s Technical Fellow, Dave Gerrard, shares his learning on digital preservation from the PASIG 2016. As a newcomer to digital preservation, he is sharing his insights as he learns them.


As a relative newbie to Digital Preservation, attending PASIG 2016 was an important step towards getting a picture of the state of the art in digital preservation. One of the most important things for a technician to do when entering a new domain is to get a high-level view of the overall landscape, and build up an understanding of some of the overarching concepts, and last week’s PASIG conference provided a great opportunity to do this.

So this post is about some of those central overarching data preservation concepts, and how they might, or might not, map onto ‘real-world’ archives and archiving. I should also warn you that I’m going to be posing as many questions as answers here: it’s early days for our Polonsky project, after all, so we’re all still definitely in the ‘asking’ phase. (Feel free to answer, of course!) I’ll also be contrasting two particular presentations that were delivered at PASIG, which at first glance have little in common, but which I thought actually made the same point from completely different perspectives.

Perhaps the most obvious, key concept in digital preservation is ‘the archive’: a place where one deposits (or donates) things of value to be stored and preserved for the long term. This concept inevitably influences a lot of the theory and activity related to preserving digital resources, but is there really a direct mapping between how one would preserve ‘real’ objects, in a ‘bricks and mortar’ archive, and the digital domain? The answer appears to be ‘yes and no’: in certain areas (perhaps related to concepts such as acquiring resources and storing them, for example) it seems productive to think in broadly ‘real-world’ terms. Other ‘real-world’ concepts may be problematic when applied directly to digital preservation, however.

For example, my fellow Fellows will tell you that I take particular issue with the word ‘managing’: a term which in digital preservation seems to be used (at least by some people) to describe a particular small set of technical activities related to checking that digital files are still usable in the long-term. (‘Managing’ was used in this context in at least one PASIG presentation). One of the keys to working effectively with Information Systems is to get one’s terminology right, and in particular, to group together and talk about parts of a system that are on the same conceptual level. I.e. don’t muddle your levels of detail, particularly when modelling things. ‘Managing’ to me is a generic, high-level concept, which could mean anything from ‘making sure files are still usable’ to ‘ensuring public-facing staff answer the phone within five rings’ or even ‘making sure the staff kitchen is kept clean’. So I’m afraid that I think it’s an entirely inappropriate word to describe a very specific set of technical activities.

The trouble is, most of the other words we’ve considered for describing the process of ‘keeping files usable’ are similarly ‘higher-level’ concepts… One obvious one (preservation) once again applies to much more of the overall process, and so do many of its synonyms (‘stewardship’, ‘keeping custody of’, etc…) So these are all good terms at that high level of abstraction, but they’re for describing the big picture, not the details. Another term that is more specific, ‘fixity checking’, is maybe a bit too much like jargon…  (We’re still working on this: answers below please!) But the key point is: until one understands a concept well enough to be able to describe it in relatively simple terms, that make sense and fit together logically, building an information system and marshalling the related technology is always going to be tough.

Perhaps the PASIG topic that highlighted the biggest difference between ‘real world’ archiving and digital preservation the most, however, was discussion regarding the increased rate at which preserved digital resources can be ‘touched’ by outside forces. Obviously, nobody stores things in a ‘real-world’ archive in the expectation that they will never be looked at again (do they?), but in the digital realm, there are potentially many more opportunities for resources to be linked directly to the knowledge and information that builds upon them.

This is where the two contrasting presentations came in. The first was Scholarly workflow integration: The key to increasing reproducibility and preservation efficacy, by Jeffrey Spies (@JeffSpies) from the Center for Open Science. Jeffrey clarified exactly how digital preservation in a research data management context can highlight, explicitly, how a given piece of research builds upon what went before, by enabling direct linking to the publications, and (increasingly) to the raw data of peers working in the same field. Once digital research outputs and data are preserved, they are available to be linked to, reliably, in a manner that brings into play entirely new opportunities for archived research that never existed in the ‘real world’ of paper archives. Thus enabling the ‘discovery’ of preserved digital resources is not just about ensuring that resources are well-indexed and searchable, it’s about adding new layers of meaning and interpretation as future scholars use them in their own work. This in turn indicates how digital preservation is a function that is entirely integral to the (cyclical) research process – a situation which is well-illustrated in the 20th slide from Jeffrey’s presentation (if you download it – Figshare doesn’t seem to handle the animation in the slide too well – which sounds like a preservation issue in itself…).

By contrast, Symmetrical Archiving with Webrecorder, a talk by Dragan Espenschied (@despens), was at first glance completely unrelated to the topic of how preserved digital resources might have a greater chance of changing as time passes than their ‘real-world’ counterparts. Dragan was demonstrating the Webrecorder tool for capturing online works of art by recording visits to those works through a browser, and it was during the discussion afterwards that the question was asked: “how do you know that everything has been recorded ‘properly’ and nothing has been missed?”

For me, this question (and Dragan’s answer) struck at the very heart of the same issue. The answer was that each recording is a different object in itself, as the interpretation of the person recording the artwork is an integral part of the object. In fact, Dragan’s exact answer contained the phrase: “when an archivist adds an object to an archive, they create a new object”; the actual act of archiving changes an object’s meaning and significance (potentially subtly, though not always) to an extent that it is not the same object once it has been preserved. Furthermore, the object’s history and significance change once more with every visit to see it, and every time it is used as inspiration for a future piece of work.

Again – I’m a newbie, but I’m told by my fellow Fellows this situation is well understood in archiving and hence may be more of a revelation to me than most readers of this post. But what has changed is the way the digital realm gives us the opportunity not just to record how objects change as they’re used and referred to, but also a chance to make the connections to new knowledge gained from use of digital objects completely explicit and part of the object itself.

This highlights the final point I want to make about two of the overarching concepts of ‘real-world’ archiving and preservation which PASIG indicated might not map cleanly onto digital preservation. The first is the concept of ‘depositing’. According to Jeffrey Spies’s model, the ‘real world’ research workflow of ‘plan the research, collect and analyse the data, publish findings, gain recognition / significance in the research domain, and then finally deposit evidence of this ground-breaking research in an archive’, simply no longer applies. In the new model, the initial ‘deposit’ is made at the point a key piece of data is first captured, or a key piece of analysis is created. Works in progress, early drafts, important communications, grey literature, as well as the final published output, are all candidates for preservation at the point they are first created by the researchers. digital preservation happens seamlessly in the background. The states of the ‘preserved’ objects change throughout.

The second is the concept of ‘managing’ (urgh!), or otherwise ‘maintaining the status quo’ of an object into the long-term future. In the digital realm, there doesn’t need to be a ‘status quo’ – in fact there just isn’t one. We can record when people search for objects, when they find them, when they cite them. We can record when preserved data is validated by attempts to reproduce experiments or re-used entirely in different contexts. We can note when people have been inspired to create new artworks based upon our previous efforts, or have interpreted the work we have preserved from entirely new perspectives. This is genuine preservation: preservation that will help fit the knowledge we preserve today into the future picture. This opportunity would be much harder to realise when storing things in a ‘real-world’ archive, and we need to be careful to avoid thinking too much ‘in real terms’ if we are to make the most of it.

What do you think? Is it fruitful to try and map digital preservation onto real world concepts? Or does doing so put us at risk of missing core opportunities? Would moving too far away from ‘real-world’ archiving put us at risk of losing many important skills and ideas? Or does thinking about ‘the digital data archive’ in terms that are too like ‘the real world’ limit us from making important connections to our data in future?

Where does the best balance between ‘real-world’ concepts and digital preservation lie?