Breaking through with Library Carpentry

Thursday 11th January saw the Cambridge University Library’s annual conference take place. This year, it was entitled ‘Breakthrough the Library’, and focused on cutting-edge innovation in libraries and archives. I can honestly say that this was the first ever conference I’ve been to where every single speaker I saw (including the ten or so who gave lightning talks) were absolutely excellent.

So it’s hard to pick the one that made the most impression. Of course, an honourable mention must go to the talk about Jasper the three legged cat, but if I had to plump for the one that was most pertinent to moving Digital Preservation forward, I’d have picked “Library Carpentry: software and data skills for librarian professionals”, from Dr James Baker of the University of Sussex.

I’d heard of the term ‘Library Carpentry’ (and the initiatives it stems from – Software Carpentry and Data Carpentry) and thus had an idea what the talk was about on the way in. Their web presence explains things far better than I can, too (see, so I’m going to skip the exposition and make a different point…

As a full-blown, time-served nerd who’s clearly been embittered by 20 years in the IT profession (though I’m pleased to report, not as much as most of my long-term friends and colleagues!), I went into the talk with a bit of a pessimistic outlook. This was because, in my experience, there are three stages one passes through when learning IT skills:

  • Stage 1: I know nothing. This computer is a bit weird and confuses me.
  • Stage 2: I know EVERYTHING. I can make this computer sing and dance, and now I have the power to conquer the world.
  • Stage 3: … er – hang on… The computer might not have been doing exactly what I thought it was, after all… Ooops! What did I just do?

Stage 1 is just something you get through (if you want – I have nothing but respect for happy Stage 1 dwellers, though). If so inclined, all it really takes is a bit of persistence and a dollop of enthusiasm to get through it. If you want to but think you might struggle, then have a go at this computer programming aptitude test from the University of Kent – you may be pleasantly surprised… In my own case, I got stuck there for quite a while until one day a whole pile of O Level algebra that was lurking in my brain suddenly rose out of the murk, and that was that.

Stage 2 people, on the other hand, tend to be really dangerous… I have personally worked with quite a few well-paid developers who are stuck in Stage 2, and they tend to be the ones who drop all the bombs on your system. So the faster you can get through to Stage 3, the better. This was at the root of my concern, as one of the ideas of Library Carpentry is to pick up skills quick, and then pass them on. But I needn’t have worried because…

When I asked Dr Baker about this issue, he reassured me that ‘questioning whether the computer has done what you expected’ is a core learning point that is central to Library Carpentry, too. He also declared the following (which I’m going to steal): “I make a point of only ever working with people with Impostor Syndrome”.

Hence it really does look as if getting to Stage 3 without even going through Stage 2 at all is what Library Carpentry is all about. I believe moves are afoot to get some of this good stuff going at Cambridge… I watch with interest and might even be able to find the time to join in..? I bet it’ll be fun.

The vision for a preservation repository

Over the last couple of months, work at Cambridge University Library has begun to look at what a potential digital preservation system will look like, considering technical infrastructure, the key stakeholders and the policies underpinning them. Technical Fellow, Dave, tells us more about the holistic vision…

This post discusses some of the work we’ve been doing to lay foundations beneath the requirements for a ‘preservation system’ here at Cambridge. In particular, we’re looking at the core vision for the system. It comes with the standard ‘work in progress’ caveats – do not be surprised if the actual vision varies slightly (or more) from what’s discussed here. A lot of the below comes from Mastering the Requirements Process by Suzanne and James Robertson.

Also – it’s important to note that what follows is based upon a holistic definition of ‘system’ – a definition that’s more about what people know and do, and less about Information Technology, bits of tin and wiring.

Why does a system change need a vision?

New systems represent changes to the existing status-quo. The vision is like the Pole Star for such a change effort – it ensures that people have something fixed to move towards when they’re buried under minute details. When confusion reigns, you can point to the vision for the system to guide you back to sanity.

Plus, as with all digital efforts, none of this is real: there’s no definite, obvious end point to the change. So the vision will help us recognise when we’ve achieved what we set out to.

Establishing scope and context

Defining what the system change isn’t is a particularly good a way of working out what it actually represents. This can be achieved by thinking about the systems around the area you’re changing and the information that’s going to flow in and out. This sort of thinking makes for good diagrams: one that shows how a preservation repository system might sit within the broader ecosystem of digitisation, research outputs / data, digital archives and digital published material is shown below.

System goals

Being able to concisely sum-up the key goals of the system is another important part of the vision. This is a lot harder than it sounds and there’s something journalistic about it – what you leave out is definitely more important than what you keep in. Fortunately, the vision is about broad brush strokes, not detail, which helps at this stage.

I found some great inspiration in Sustainable Economics for a Digital Planet, which indicated goals such as: “the system should make the value of preserving digital resources clear”, “the system should clearly support stakeholders’ incentives to preserve digital resources” and “the functional aspects of the system should map onto clearly-defined preservation roles and responsibilities”.

Who are we implementing this for?

The final main part of the ‘vision’ puzzle is the stakeholders: who is going to benefit from a preservation system? Who might not benefit directly, but really cares that one exists?

Any significant project is likely to have a LOT of these, so the Robertsons suggest breaking the list down by proximity to the system (using Ian Alexander’s Onion Model), from the core team that uses the system, through the ‘operational work area’ (i.e. those with the need to actually use it) and out to interested parties within the host organisation, and then those in the wider world beyond. An initial attempt at thinking about our stakeholders this way is shown below.

One important thing that we realised was that it’s easy to confuse ‘closeness’ with ‘importance’: there are some very important stakeholders in the ‘wider world’ (e.g. Research Councils or historians) that need to be kept in the loop.

A proposed vision for our preservation repository

After iterating through all the above a couple of times, the current working vision (subject to change!) for a digital preservation repository at Cambridge University Library is as follows:

The repository is the place where the best possible copies of digital resources are stored, kept safe, and have their usefulness maintained. Any future initiatives that need the most perfect copy of those resources will be able to retrieve them from the repository, if authorised to do so. At any given time, it will be clear how the digital resources stored in the repository are being used, how the repository meets the preservation requirements of stakeholders, and who is responsible for the various aspects of maintaining the digital resources stored there.

Hopefully this will give us a clear concept to refer back to as we delve into more detail throughout the months and years to come…

Putting ‘stuff’ in ‘context’: deep thoughts triggered by PASIG 2017

Cambridge Technical Fellow, Dave, delves a bit deeper into what PASIG 2017 talks really got him thinking further about digital preservation and the complexity of it.

After a year of studying digital preservation, my thoughts are starting to coalesce, and the presentations at PASIG 2017 certainly helped that. (I’ve already discussed what I thought were the most important talks, so the ones below some that stimulated me about preservation in particular)…

The one that matched my current thoughts on digital preservation generally was John Sheridan’s Creating and sustaining a disruptive digital archive. It was similar to another previous blog post, and to chats with fellow Fellow Lee too (some of which he’s captured in a blog post for the Digital Preservation Coalition)… I.e.: computing’s ‘paper paradigm’ makes little sense in relation to preservation, hierarchical / neat information structures don’t hold together as well digitally, we’re going to need to compute across the whole archive, and, well, ‘digital objects’ just aren’t really material ‘objects’, are they?

An issue with thinking about digital ‘stuff’ too much in terms of tangible objects is that opportunities related to the fact the ‘stuff’ is digital can be missed. Matt Zumwalt highlighted one such opportunity in Data together: Communities & institutions using decentralized technologies to make a better web when he introduced ‘content addressing’: using cryptographic hashing and Directed Acyclic Graphs (in this case, information networks that record content changing as time progresses) to manage many copies of ‘stuff’ robustly.

This addresses some of the complexities of preserving digital ‘stuff’, but perhaps thinking in terms of ‘copies’, and not ‘branches’ or ‘forks’ is an over simplification? Precisely because digital ‘stuff’ is rarely static, all ‘copies’ have the potential to deviate from the ‘parent’ or ‘master’ copy. What’s the ‘version of true record’ in all this? Perhaps there isn’t one? Matt referred to ‘immutable data structures’, but the concept of ‘immutability’ only really holds if we think it’s possible for data to ever be completely separated from its informational context, because the information does change, constantly. (Hold that thought).

Switching topics, fellow Polonsky Somaya often tries to warn me just how complicated working with technical metadata can get. Well, the pennies dropped further during Managing digital preservation metadata at Sound and Vision: A case on matching OAIS and PREMIS with the DPX file format from Annemieke De Jong and Josefien Schuurman. Space precludes going into the same level of detail they did regarding building a Preservation Metadata Dictionary (PMD) about just one, ‘relatively’ simple file format – but let’s say, well, it’s really complicated. (They’ve blogged about it and the whole PMD is online too). The conclusion: preserving files properly means drilling down deep into their formats, but it also got me thinking – shouldn’t the essence of a ‘preservation file format’ be its simplicity?

The need for greater simplicity in preservation was further emphasised by Mathieu Giannecchini’s The Eclair Archive cinema heritage use case: Rising to the challenges of complex formats at large scale. Again – space precludes me from getting into detail, but the key takeaway was that Mathieu has 2 million reels of film to preserve using the Digital Cinema Distribution Master (DCDM) format, and after lots of good work, he’s optimised the process to preserve 8tb a day, (with a target of 15tb). Now, we don’t know how much film is on each reel, but assuming a (likely over-) estimate of 10 minutes per reel, that’s roughly 180,000 films of 1 hour 50 mins in length. Based on Mathieu’s own figures, it’s going to take many decades, perhaps even a few hundred years, to get through all 2 million reels… So further, major optimisations are required, and I suspect DCDM (a format with a 155-page spec, which relies on TIFF, a format with a 122-page spec) might be one of the bottlenecks.

Of course, the trade-off with simplifying formats is that data will likely be ‘decontextualised’, so there must be a robust method for linking data back to context… Thoughts on this were triggered by Developing and applying principles for discovery and access for the UK Data Service by Katherine McNeill from the UK Data Archive, as Katherine discussed production of a next-generation access system based on a linked-data model with which, theoretically, single cells’ worth of data could be retrieved from research datasets.

Again – space precludes entering into the whole debate around the process of re-using data stripped of original context… Mauthner and Parry illustrate the two contrary sides well, and furthermore argue that merely entertaining the possibility of decontextualising data indicates a certain ‘foundational’ way of thinking that might be invalid from the start? This is where I link to William Kilbride’s excellent DPC blog post from a few months ago

William’s PASIG talk Sustainable digital futures was also one of two that got closer to what we know are the root of the preservation problem; economics. The other was Aging of digital: Managed services for digital continuity by Natasa Milic-Frayling, which flagged-up the current “imbalance in control and empowerment” between tech providers and content producers / owners / curators, an imbalance that means tech firms can effectively doom our digital ‘stuff’ into obsolescence, and we have to suck it up.

I think this imbalance in part exists because there’s too much technical context related to data, because it’s generally in the tech providers’ interests to bloat data formats to match the USPs of their software. So, is a pure ‘preservation format’ one in which the technical context of the data is generalised to the point where all that’s left is commonly-understood mathematics? Is that even possible? Do we really need 122-page specs to explain how raster image data is stored? (It’s just an N-dimensional array of pixel values…, isn’t it…?) I think perhaps we don’t need all the complexity – at the data storage level at least. Though I’m only guessing at this stage: much more research required.

Computers are the apogee of profligacy: a response to THE most important PASIG 2017 presentations

Following the PASIG conference, Cambridge Technical Fellow Dave Gerrard couldn’t simply wait to fire off his thoughts on the global context of digital preservation and how we need to better consider the world around us to work on a global solution and not just one that suits capitalist agenda. We usually preface these blogs with “enjoy” but in this instance, please, find a quiet moment, make yourself comfortable, read on and contemplate the global issues presented passionately presented here.

I’m going to work on a more technical blog about PASIG later, but first I want to get this one off my chest. It’s about the two most important presentations: Angeline Takawira’s Digital preservation at the United Nations Mechanism for International Criminal Tribunals and Keep your eyes on the information, Patricia Sleeman’s discussion of preservation work at the UN Refugee Agency (UNHCR).

Angeline Takawira described, in a very precise and formal manner, how the current best practice in Digital Preservation is being meticulously applied to preserving information from UN war crimes tribunals in The Hague (covering the Balkan conflict) and Arusha, Tanzania (covering the Rwandan genocide). As befitted her work, it was striking how calm Angeline was; how well the facts were stuck to, despite the emotive context. Of course, this has to be the case for work underpinning legal processes: intrusion of emotion into the capture of facts could let those trying to avoid justice escape it.

And the importance of maintaining a dispassionate outlook was echoed in the title of the other talk. “Keep your eyes on the information” was what Patricia Sleeman was told when learning to work with the UNHCR, as to engage too emotionally with the refugee crisis could make vital work impossible to perform. However, Patricia provided some context, in part by playing Head Over Heels, (Emi Mahmoud’s poem about the conflict and refugee crisis in Darfur), and by describing the brave, inspirational people she had met in Syria and Kurdistan. An emotionless response was impossible: the talk resulted in the conference’s longest and loudest applause.

Indeed, I think the audience was so stunned by Patricia’s words that questions were hard to formulate. However, my colleague Somaya at least asked the $64,000 one: how can we help? I’d like to tie this question back to one that Patricia raised in her talk, namely (and I paraphrase here): how do you justify expenditure on tasks like preservation when doing so takes food from the mouths of refugees?

So, now I’m less stunned, here’s my take: feeding refugees solves a symptom of the problem. Telling their stories helps to solve the problem, by making us engage our emotions, and think about how our lives are related to theirs, and about how we behave impacts upon them. And how can we help? Sure, we can help Patricia with her data management and preservation problems. But how can we really contribute to a solution? How can we stop refugee crises occurring in the first place?

We have a responsibility to recognise the connections between our own behaviour and the circumstances refugees find themselves in, and it all comes down, of course, to resources, and the profligate waste of them in the developed world. Indeed, Angeline and Patricia’s talks illustrated the borderline absurdity of a bunch of (mostly) privileged ‘Westerners’ / ‘Northerners’ (take your pick) talking about the ‘preservation’ of anything, when we’re products of a society that’s based upon throwing everything away.

And computers / all things ‘digital’ are at the apogee of this profligacy: Natasa Milic-Frayling highlighted this when she (diplomatically) referred to the way in which the ‘innovators’ hold all the cards, currently, in the relationship with ‘content producers’, and can hence render the technologies upon which we depend obsolete across ever-shorter cycles. Though, after Patricia’s talk, I’m inclined to frame this more in terms of ‘capitalist industrialists generating unnecessary markets at the expense of consumers’; particularly given that, while we were listening to Patricia, the latest iPhone was being launched in the US.

Though, of course, it’s not really the ‘poor consumers’ who genuinely suffer due to planned obsolescence… That would be the people in Africa and the Middle East whose countries are war zones due to grabs for oil or droughts caused by global warming. As the world’s most advanced tech companies, Apple, Google, Facebook, Amazon, Microsoft et al are the biggest players in a society that – at best indirectly, at worst carelessly – causes the suffering of the people Patricia and Angeline are helping and providing justice for. And, as someone typing a blog post using a Macbook Pro that doesn’t even let me add a new battery – I’m clearly part of the problem, not the solution.

So – in answer to Somaya’s question: how can we help? Well, for a start, we can stop fetishising the iPhone and start bigging up Fairphone and Phonebloks. However, keeping the focus on Digital Preservation, we’ve got to be really careful that our efforts aren’t used to support an IT industry that’s currently profligate way beyond moral acceptability. So rather than assuming (as I did above) that all the ‘best-practice’ of digital preservation flows from the ‘developed’ (ahem) world to the ‘developing’, we ought to seek some lessons in how to preserve technology from those who have fewer opportunities to waste it.

Somaya’s already on the case with her upcoming panel at iPres on the 28th September: Then we ought to continue down the road of holding PASIG in Mexico City next year by holding one in Africa as soon as possible. As long as – when we’re there, we make sure we shut up and listen.

Digital Preservation futurology

I fancy attempting futurology, so here’s a list of things I believe could happen to ‘digital preservation systems’ over the next decade. I’ve mostly pinched these ideas from folks like Dave Thompson, Neil Jefferies, and my fellow Fellows. But if you see one of your ideas, please claim it using the handy commenting mechanism. And because it’s futurology, it doesn’t have to be accurate, so kindly contradict me!

Ingest becomes a relationship, not a one-off event

Many of the core concepts underpinning how computers are perceived to work are crude, paper-based metaphors – e.g. ‘files’, ‘folders’, ‘desktops’, ‘wastebaskets’ etc – that don’t relate to what your computer’s actually doing. (The early players in office computing were typewriter and photocopier manufacturers, after all…) These metaphors have succeeded at getting everyone to use computers, but they’ve also suppressed various opportunities to work smarter, too.

The concept of ingesting (oxymoronic) ‘digital papers’ is obviously heavily influenced by this paper paradigm.  Maybe the ‘paper paradigm’ has misled the archival community about computers a bit, too, given that they were experts at handling ‘papers’ before computers arrived?

As an example of what I mean: in the olden days (25 whole years ago!), Professor Plum would amass piles of important papers until the day he retired / died, and then, and only then, could these personal papers be donated and archived. Computers, of course, make it possible for the Prof both to keep his ‘papers’ where he needs them, and donate them at the same time, but the ‘ingest event’ at the centre of current digital preservation systems still seems to be underpinned by a core concept of ‘piles of stuff needing to be dealt with as a one-off task’. In future, the ‘ingest’ of a ‘donation’ will actually become a regular, repeated set of occurrences based upon ongoing relationships between donors and collectors, and forged initially when Profs are but lowly postgrads. Personal Digital Archiving and Research Data Management will become key; and ripping digital ephemera from dying hard disks will become less necessary as they become so.

The above depends heavily upon…

Object versioning / dependency management

Of course, if Dr. Damson regularly donates materials from her postgrad days onwards, some of these may be updates to things donated previously. Some of them might have mutated so much since the original donation that they can be considered ‘child’ objects, which may have ‘siblings’ with ‘common ancestors’ already extant in the archive. Hence preservation systems need to manage multiple versions of ‘digital objects’, and the relationships between them.

Some of the preservation systems we’ve looked at claim to ‘do versioning’ but it’s a bit clunky – just side-by-side copies of immutable ‘digital objects’, not records of the changes from one version to the next, and with no concept of branching siblings from a common parent. Complex structures of interdependent objects are generally problematic for current systems. The wider computing world has been pushing at the limits of the ‘paper-paradigm’ immutable object for a while now (think Git, Blockchain, various version control and dependency management platforms, etc). Digital preservation systems will soon catch up.

Further blurring of the object / metadata boundary

What’s more important, the object or the metadata? The ‘paper-paradigm’ has skewed thinking towards the former (the sacrosanct ‘digital object’, comparable to the ‘original bit of paper’), but after you’ve digitised your rare book collection, what are Humanities scholars going to text-mine? It won’t be images of pages – it’ll be the transcripts of those (i.e. the ‘descriptive metadata’)*. Also, when seminal papers about these text mining efforts are published, how is this history of the engagement with your collection going to be recorded? Using a series of PREMIS Events (that future scholars can mine in turn), perhaps?

The above talk of text mining and contextual linking of secondary resources raises two more points…

* While I’m here, can I take issue with the term ‘descriptive metadata’? All metadata is descriptive. It’s tautological; like saying ‘uptight Englishman’. Can we think of a better name?

Ability to analyse metadata at scale

‘Delivery’ no longer just means ‘giving users a viewer to look at things one-by-one with’ – it now also means ‘letting people push their Natural Language or image processing algorithms to where the data sits, and then coping with vast streams of output data’.

Storage / retention informed by well-understood usage patterns

The fact that everything’s digital, and hence easier to disseminate and link together than physical objects, also means better understanding how people use our material. This doesn’t just mean ‘wiring things up to Google Analytics’ – advances in bibliometrics that add social / mainstream media analysis, and so forth, to everyday citation counts present opportunities to judge the impact of our ‘stuff’ on the world like never before. Smart digital archives will inform their storage management and retention decisions with this sort of usage information, potentially in fully or semi-automated ways.

Ability to get data out, cleanly – all systems are only ever temporary!

Finally – it’s clear that there are no ‘long-term’ preservation system options. The system you procure today will merely be ‘custodian’ of your materials for the next ten or twenty years (if you’re lucky). This may mean moving heaps of content around in future, but perhaps it’s more pragmatic to think of future preservation systems as more like ‘lenses’ that are laid on top of more stable data stores to enable as-yet-undreamt-of functionality for future audiences?

(OK – that’s enough for now…)

Digital preservation is a mature concept, but we need to pitch it better

Cambridge Technical Fellow, Dave, presents his thoughts on the OAIS and his own elevator pitch about digital preservation from the Pericles/DPC Acting on Change conference in London, last week.

Some of the best discussions at the Pericles / DPC Acting on Change conference came during the morning panel sessions. In the first, provocatively titled “Beyond the OAIS”, Barbara Sierman, from The KB National Library of the Netherlands, admitted that the OAIS can be confusing for newcomers… and as a newcomer to digital preservation, I agree!

Fellow panellist Barbara Reed, from Recordkeeping Innovation, suggested the OAIS’s Administration function as a potentially-confusing area, and this too struck a chord. I’ve gained some systems analysis and modelling experience over the years, and my first thought looking at the OAIS was that the Admin function looked like a place where much of the hard-to-model, human stuff had been separated from the technical, tool-based parts. (I’ve seen this happen before in other domains…)

There’s actually a hint that this is happening in the standard’s diagram for the Admin function – it’s busier and more information-packed than the other function diagrams, which tends to be a sign that it’s a bit of a ‘bucket’ which needs more modelling. This led me to an immediate concern that Admin doesn’t sit easily within the overall standard, and I think Barbara Reed had picked up on this too, suggesting that two more focused documents – one ‘technical’, one ‘human’ – might make the standard easier to use.

Then Artefactual Systems’ Dan Gillean asked who we should be talking to about the OAIS outside of the community? Barbara Reed answered ‘Enterprise Architects’; and two of the things Enterprise Architects use in their work are domain models and pattern languages. I was glad Barbara made this point, because I had already come to a similar conclusion.

AV Preserve’s Kara Van Malssen replied ‘communications experts’ to Dan’s question, suggesting Marketing in particular, though perhaps skilled science communicators might be even better? (Both Cambridge and Oxford – among others – put a lot of effort into public engagement with research, and there is a healthy body of research literature about it).

And the importance of communication was further emphasised by Nancy McGovern (MIT Libraries) and Neil Beagrie (Charles Beagrie Ltd) during the second day’s panel session (Preparing for Change). Nancy used the phrase ‘Technical Author’ at one stage – and it occurred that such input might be a very quick win for the OAIS Reference Implementation? Meanwhile, Neil talked about needing a short, pithy statement that explains what we do to funders…

So here’s an attempt at an Elevator Pitch:

Digital Preservation means sourcing computer-based material that is worthy of preservation, getting that material under control, and then maintaining the usefulness of that material, forever.

This Elevator Pitch is part of the pattern language I’m working on with my fellow Polonsky Fellows, and (I hope, soon) the broader Digital Preservation community. (We’re still thinking about that last ‘forever’, but considering how old some of the things in our libraries are, ‘forever’ seems an easy way of thinking about it).

The key point that Nancy McGovern made, however, was that we’re ready to take Digital Preservation to a wider audience. I think she’s right. The OAIS is confusing – it’s a real head-scrambler for a newcomer like me – but it has reached a level of maturity: it’s clear how much deep thought and expertise underpins it. And, of course, the same goes for the technology it has influenced over the previous decades. This supports what Arkivum’s Matthew Addis said in the second day’s keynote – the digital preservation community is ready to take their ideas to the world: we perhaps just need to pitch them a little better?

A digital preservation pattern language

Technical Fellow, Dave, shares his final update from PASIG NYC in October. It includes his opinions on digital preservation terminology and his development of an interpretation model for mapping processes.

Another of the sessions at the PASIG NYC conference we attended concerned standardisation. It started with Avoiding the 927 Problem: Standards, Digital Preservation, and Communities of Practice by Artefactual Systems’ Dan Gillean, which explained the relationships between De Jure / De Facto, and Open / Proprietary standards, and which introduced the major Digital Preservation standards. Then later in the session, Sibyl Schaefer (@archivelle) from the UCSD Chronopolis Network presented Here we go again down this road: Certification and Recertification, which covered the ISO standardisation terminology (e.g. Certification vs Accreditation) and went deeper into the formal (De Jure) standards, in particular the Open Archival Information System (OAIS) reference model (ISO 14721) and the Audit and Certification of Trustworthy Digital Repositories (ISO 16363).

One aspect of Dan Gillean’s presentation that resonated with me was his discussion of the Communities of Practice that had emerged around the Digital Preservation standards. This reminded me of a software development concept called design patterns, which has its roots in (real) architecture, and in particular a book called A Pattern Language: towns, buildings, construction, by Christopher Alexander (et al). This proposes that planners and architects develop a ‘language’ of architecture so that they can learn from each other and contribute their ideas to a more harmonious, better-planned whole of well-designed cities, towns and countryside. The key concept they propose is that of the ‘pattern’:

The elements of this [architectural] language are entities called patterns. Each pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice (Alexander et al, 1977:x).

Each pattern has a common structure, including details of the problem it solves, the forces at work, the start and end states of related resources, and relationships to other patterns. (James Coplein has provided a short overview of a typical pattern structure). The idea is to build up a playbook of (de facto) standard approaches to common problems, and the types of behaviour that might solve them, as a way of sharing and reusing knowledge.

I asked around at PASIG to see if anyone had created a reusable set of Digital Preservation Patterns (somebody please tell me if so, it’ll save me heaps of work!), but I drew a blank. So I grabbed the Alexander book (I work in a building containing 18 million books!), and also had a quick look online. The best online resource I found was – which contained lots of familiar names related to programming design patterns (e.g. Erich Gamma, Grady Booch, Martin Fowler, Ward Cunningham). But the original Alexander book also gave me an insight into patterns that I’d never heard of before, in particular the very straightforward way that its patterns related to each other from the general / high level (e.g. patterns about regional, city and town planning), via mid-level patterns (for neighbourhoods, streets and building design), to the extremely detailed (e.g. patterns for where to put beds, baths and kitchen equipment).

This helped me consider what I think are two issues with Digital Preservation: firstly, there’s a lot of jargon (e.g. ‘fixity’, ‘technical metadata’ or ‘file format migration’ – none of which are terms fit for normal conversation). Secondly, many of the Digital Preservation models mismatch concepts at different levels of abstraction and complexity: for example the OAIS places a discrete process labelled Data Management alongside another labelled Ingest, where Ingest is quite a specific, discrete step in the overall picture, but where there’s also a strong case for saying that the whole of Digital Preservation is ‘data management’, including Ingest itself.

Such issues of defining and labelling concepts are common in most computer-technology-related domains, of course, and they’re often harmful (contributing to the common story of failed IT projects and angry developers / customers etc). But the way in which A Pattern Language arranges its patterns at the same levels of abstraction and detail, and in doing so enables drilling-down through region / city / town / neighbourhood / street / building / room, provides an elegant example of how to avoid this trap.

Hence I’ve been working on a model of the Digital Preservation domain that has ‘elevator pitch’ and ‘plain English’ levels of detail before I get to the nitty-gritty of technical details. My intention is to group similarly-sized and equally-complex sets of Digital Preservation processes together in ways that help describe them in clear, jargon-free ways, hence forming a reusable set of patterns that help people work out how to implement Digital Preservation in their own organisational contexts. I will have an opportunity to share this model, and the patterns I derive from it, as it develops. Watch this space.

Alexander, C., Ishikawa, S., Silverstein, M., Jacobson, M., Fiksdahl-King, I. and Angel, S. (1977) A Pattern Language: towns, buildings, construction. 1st edn. New York: Oxford University Press.

Do you know of any work that’s been done to create a Digital Preservation Pattern Language? Would you like to contribute your ideas towards Dave’s idea of creating a playbook of Digital Preservation design patterns? Please let Dave know using the form below…

On the core concepts of digital preservation

Cambridge’s Technical Fellow, Dave Gerrard, shares his learning on digital preservation from the PASIG 2016. As a newcomer to digital preservation, he is sharing his insights as he learns them.

As a relative newbie to Digital Preservation, attending PASIG 2016 was an important step towards getting a picture of the state of the art in digital preservation. One of the most important things for a technician to do when entering a new domain is to get a high-level view of the overall landscape, and build up an understanding of some of the overarching concepts, and last week’s PASIG conference provided a great opportunity to do this.

So this post is about some of those central overarching data preservation concepts, and how they might, or might not, map onto ‘real-world’ archives and archiving. I should also warn you that I’m going to be posing as many questions as answers here: it’s early days for our Polonsky project, after all, so we’re all still definitely in the ‘asking’ phase. (Feel free to answer, of course!) I’ll also be contrasting two particular presentations that were delivered at PASIG, which at first glance have little in common, but which I thought actually made the same point from completely different perspectives.

Perhaps the most obvious, key concept in digital preservation is ‘the archive’: a place where one deposits (or donates) things of value to be stored and preserved for the long term. This concept inevitably influences a lot of the theory and activity related to preserving digital resources, but is there really a direct mapping between how one would preserve ‘real’ objects, in a ‘bricks and mortar’ archive, and the digital domain? The answer appears to be ‘yes and no’: in certain areas (perhaps related to concepts such as acquiring resources and storing them, for example) it seems productive to think in broadly ‘real-world’ terms. Other ‘real-world’ concepts may be problematic when applied directly to digital preservation, however.

For example, my fellow Fellows will tell you that I take particular issue with the word ‘managing’: a term which in digital preservation seems to be used (at least by some people) to describe a particular small set of technical activities related to checking that digital files are still usable in the long-term. (‘Managing’ was used in this context in at least one PASIG presentation). One of the keys to working effectively with Information Systems is to get one’s terminology right, and in particular, to group together and talk about parts of a system that are on the same conceptual level. I.e. don’t muddle your levels of detail, particularly when modelling things. ‘Managing’ to me is a generic, high-level concept, which could mean anything from ‘making sure files are still usable’ to ‘ensuring public-facing staff answer the phone within five rings’ or even ‘making sure the staff kitchen is kept clean’. So I’m afraid that I think it’s an entirely inappropriate word to describe a very specific set of technical activities.

The trouble is, most of the other words we’ve considered for describing the process of ‘keeping files usable’ are similarly ‘higher-level’ concepts… One obvious one (preservation) once again applies to much more of the overall process, and so do many of its synonyms (‘stewardship’, ‘keeping custody of’, etc…) So these are all good terms at that high level of abstraction, but they’re for describing the big picture, not the details. Another term that is more specific, ‘fixity checking’, is maybe a bit too much like jargon…  (We’re still working on this: answers below please!) But the key point is: until one understands a concept well enough to be able to describe it in relatively simple terms, that make sense and fit together logically, building an information system and marshalling the related technology is always going to be tough.

Perhaps the PASIG topic that highlighted the biggest difference between ‘real world’ archiving and digital preservation the most, however, was discussion regarding the increased rate at which preserved digital resources can be ‘touched’ by outside forces. Obviously, nobody stores things in a ‘real-world’ archive in the expectation that they will never be looked at again (do they?), but in the digital realm, there are potentially many more opportunities for resources to be linked directly to the knowledge and information that builds upon them.

This is where the two contrasting presentations came in. The first was Scholarly workflow integration: The key to increasing reproducibility and preservation efficacy, by Jeffrey Spies (@JeffSpies) from the Center for Open Science. Jeffrey clarified exactly how digital preservation in a research data management context can highlight, explicitly, how a given piece of research builds upon what went before, by enabling direct linking to the publications, and (increasingly) to the raw data of peers working in the same field. Once digital research outputs and data are preserved, they are available to be linked to, reliably, in a manner that brings into play entirely new opportunities for archived research that never existed in the ‘real world’ of paper archives. Thus enabling the ‘discovery’ of preserved digital resources is not just about ensuring that resources are well-indexed and searchable, it’s about adding new layers of meaning and interpretation as future scholars use them in their own work. This in turn indicates how digital preservation is a function that is entirely integral to the (cyclical) research process – a situation which is well-illustrated in the 20th slide from Jeffrey’s presentation (if you download it – Figshare doesn’t seem to handle the animation in the slide too well – which sounds like a preservation issue in itself…).

By contrast, Symmetrical Archiving with Webrecorder, a talk by Dragan Espenschied (@despens), was at first glance completely unrelated to the topic of how preserved digital resources might have a greater chance of changing as time passes than their ‘real-world’ counterparts. Dragan was demonstrating the Webrecorder tool for capturing online works of art by recording visits to those works through a browser, and it was during the discussion afterwards that the question was asked: “how do you know that everything has been recorded ‘properly’ and nothing has been missed?”

For me, this question (and Dragan’s answer) struck at the very heart of the same issue. The answer was that each recording is a different object in itself, as the interpretation of the person recording the artwork is an integral part of the object. In fact, Dragan’s exact answer contained the phrase: “when an archivist adds an object to an archive, they create a new object”; the actual act of archiving changes an object’s meaning and significance (potentially subtly, though not always) to an extent that it is not the same object once it has been preserved. Furthermore, the object’s history and significance change once more with every visit to see it, and every time it is used as inspiration for a future piece of work.

Again – I’m a newbie, but I’m told by my fellow Fellows this situation is well understood in archiving and hence may be more of a revelation to me than most readers of this post. But what has changed is the way the digital realm gives us the opportunity not just to record how objects change as they’re used and referred to, but also a chance to make the connections to new knowledge gained from use of digital objects completely explicit and part of the object itself.

This highlights the final point I want to make about two of the overarching concepts of ‘real-world’ archiving and preservation which PASIG indicated might not map cleanly onto digital preservation. The first is the concept of ‘depositing’. According to Jeffrey Spies’s model, the ‘real world’ research workflow of ‘plan the research, collect and analyse the data, publish findings, gain recognition / significance in the research domain, and then finally deposit evidence of this ground-breaking research in an archive’, simply no longer applies. In the new model, the initial ‘deposit’ is made at the point a key piece of data is first captured, or a key piece of analysis is created. Works in progress, early drafts, important communications, grey literature, as well as the final published output, are all candidates for preservation at the point they are first created by the researchers. digital preservation happens seamlessly in the background. The states of the ‘preserved’ objects change throughout.

The second is the concept of ‘managing’ (urgh!), or otherwise ‘maintaining the status quo’ of an object into the long-term future. In the digital realm, there doesn’t need to be a ‘status quo’ – in fact there just isn’t one. We can record when people search for objects, when they find them, when they cite them. We can record when preserved data is validated by attempts to reproduce experiments or re-used entirely in different contexts. We can note when people have been inspired to create new artworks based upon our previous efforts, or have interpreted the work we have preserved from entirely new perspectives. This is genuine preservation: preservation that will help fit the knowledge we preserve today into the future picture. This opportunity would be much harder to realise when storing things in a ‘real-world’ archive, and we need to be careful to avoid thinking too much ‘in real terms’ if we are to make the most of it.

What do you think? Is it fruitful to try and map digital preservation onto real world concepts? Or does doing so put us at risk of missing core opportunities? Would moving too far away from ‘real-world’ archiving put us at risk of losing many important skills and ideas? Or does thinking about ‘the digital data archive’ in terms that are too like ‘the real world’ limit us from making important connections to our data in future?

Where does the best balance between ‘real-world’ concepts and digital preservation lie?