PASIG 2017: honest reflections from a trainee digital archivist

A guest blog post by Kelly, one of the Bodleian Libraries’ graduate digital archivist trainees, on what she learned as a volunteer and attendee of PASIG 2017 Oxford.


Amongst the digital preservation professionals from almost every continent and 130 institutions, myself and my 5 traineeship colleagues were amongst the lecture theatre seats, annexe demos and the awesome artefacts at the Museum of Natural History for PASIG 2017, Oxford. It was a brilliant opportunity at just 6 months into our traineeship to not only apply some of our new knowledge to work at Special Collections, Bodleian Libraries, but we were also able to gain a really current and relevant insight to theories we have been studying as part of our long distance MSc in Digital Curation at Aberystwyth University. The first ‘Bootcamp’ day was exactly what I needed to throw myself in, and it really consolidated my confidence in my understanding of some aspects of the shared language that is used amongst the profession (fixity checks, maturity models…as well as getting to grips with submission information packages, dissemination information packages and everything that occurs in between!).

My pen didn’t stop scribbling all three days, except maybe for tea breaks. Saying that, the demo presentations were also a great time for myself and other trainees to ask questions specifically about workflows and benefits of certain software such as LibNova, Preservica and ResourceSpace.

For want of a better word (and because it really is the truth) PASIG 2017 was genuinely inspiring and there were messages delivered so powerfully I hope that I stay grounded in these for my entire career. Here is what I was taught:

The Community is invaluable. Many of the speakers were quick to assert that sharing practice amongst the digital preservation community is key. This is a value I was familiar with, yet witnessing it happening throughout the conference in such a sincere manner. I can assure you the gratitude and affirmation that followed Eduardo del Valle, University of the Balearic Islands and his presentation: “Sharing my loss to protect your data: A story of unexpected data loss and how to do real preservation” was as encouraging to witness as someone new to the profession as it was to all of the other experienced delegates present. As well as sharing practice, it was clear that the community need to be advocating on behalf of each other. It is time and resource consuming but oh-so important.

Digital archives are preserving historical truths. Yes, the majority of the workflow is technological but the objectives and functions are so much more than technology; to just reduce digital preservation down to this is an oversimplification. It was so clear that the range of use cases presented at PASIG were all driven towards documenting social, political, historical information (and preserving that documentation) that will be of absolute necessity for society and infrastructure in future. Right now, for example, Angeline Takewara and her colleagues at UN MICT are working on a digital preservation programme to ensure absolute accountability and usability of the records of the International Criminal Tribunals of both Rwanda and Yugoslavia. I have written a more specific post on Angeline’s presentation here.

Due to the nature of technology and the digital world, the goalposts will always be moving. For example, Somaya Langley’s talk on the future of digital preservation and the mysteries of extracting data from smart devices will soon become (and maybe already is) a reality for those working with accessions of archives or information management. We should, then, embrace change and embrace the unsure and ultimately ‘get over the need for tidiness’ as pointed out by John Sheridan from The National Archives during his presentation “Creating and sustaining a disruptive digital archive” . This is usually counter-intuitive, but as the saying goes, one of the most dangerous phrases to use is ‘we’ve always done it that way’.

The value of digital material outlives the software, so the enabling of prolonged use of software is a real and current issue. Admittedly, this was a factor I had genuinely not even considered before. In my brain I linked obsolescence with hardware and hardware only. Therefore,  Dr. Natasa Milic-Frayling’s presentation on “Aging of Digital: Managed Services for digital continuity” shed much light on the changing computing ecosystem and the gradual aging of software. What I found especially interesting about the proposed software-continuity plan was the transparency of it; the fact that the client can ask to see the software at any time whilst it is being stabilised and maintained.

Thank you so much PASIG 2017 and everybody involved!

One last thing…in closing, Cliff Lynch, CNI, bought up that there was comparably less Web Archiving content this year. If anybody fancies taking a trainee to Mexico next year to do a (lightning) talk on Bodleian Libraries’ Web Archive I am keen…

 

 

Computers are the apogee of profligacy: a response to THE most important PASIG 2017 presentations

Following the PASIG conference, Cambridge Technical Fellow Dave Gerrard couldn’t simply wait to fire off his thoughts on the global context of digital preservation and how we need to better consider the world around us to work on a global solution and not just one that suits capitalist agenda. We usually preface these blogs with “enjoy” but in this instance, please, find a quiet moment, make yourself comfortable, read on and contemplate the global issues presented passionately presented here.


I’m going to work on a more technical blog about PASIG later, but first I want to get this one off my chest. It’s about the two most important presentations: Angeline Takawira’s Digital preservation at the United Nations Mechanism for International Criminal Tribunals and Keep your eyes on the information, Patricia Sleeman’s discussion of preservation work at the UN Refugee Agency (UNHCR).

Angeline Takawira described, in a very precise and formal manner, how the current best practice in Digital Preservation is being meticulously applied to preserving information from UN war crimes tribunals in The Hague (covering the Balkan conflict) and Arusha, Tanzania (covering the Rwandan genocide). As befitted her work, it was striking how calm Angeline was; how well the facts were stuck to, despite the emotive context. Of course, this has to be the case for work underpinning legal processes: intrusion of emotion into the capture of facts could let those trying to avoid justice escape it.

And the importance of maintaining a dispassionate outlook was echoed in the title of the other talk. “Keep your eyes on the information” was what Patricia Sleeman was told when learning to work with the UNHCR, as to engage too emotionally with the refugee crisis could make vital work impossible to perform. However, Patricia provided some context, in part by playing Head Over Heels, (Emi Mahmoud’s poem about the conflict and refugee crisis in Darfur), and by describing the brave, inspirational people she had met in Syria and Kurdistan. An emotionless response was impossible: the talk resulted in the conference’s longest and loudest applause.

Indeed, I think the audience was so stunned by Patricia’s words that questions were hard to formulate. However, my colleague Somaya at least asked the $64,000 one: how can we help? I’d like to tie this question back to one that Patricia raised in her talk, namely (and I paraphrase here): how do you justify expenditure on tasks like preservation when doing so takes food from the mouths of refugees?

So, now I’m less stunned, here’s my take: feeding refugees solves a symptom of the problem. Telling their stories helps to solve the problem, by making us engage our emotions, and think about how our lives are related to theirs, and about how we behave impacts upon them. And how can we help? Sure, we can help Patricia with her data management and preservation problems. But how can we really contribute to a solution? How can we stop refugee crises occurring in the first place?

We have a responsibility to recognise the connections between our own behaviour and the circumstances refugees find themselves in, and it all comes down, of course, to resources, and the profligate waste of them in the developed world. Indeed, Angeline and Patricia’s talks illustrated the borderline absurdity of a bunch of (mostly) privileged ‘Westerners’ / ‘Northerners’ (take your pick) talking about the ‘preservation’ of anything, when we’re products of a society that’s based upon throwing everything away.

And computers / all things ‘digital’ are at the apogee of this profligacy: Natasa Milic-Frayling highlighted this when she (diplomatically) referred to the way in which the ‘innovators’ hold all the cards, currently, in the relationship with ‘content producers’, and can hence render the technologies upon which we depend obsolete across ever-shorter cycles. Though, after Patricia’s talk, I’m inclined to frame this more in terms of ‘capitalist industrialists generating unnecessary markets at the expense of consumers’; particularly given that, while we were listening to Patricia, the latest iPhone was being launched in the US.

Though, of course, it’s not really the ‘poor consumers’ who genuinely suffer due to planned obsolescence… That would be the people in Africa and the Middle East whose countries are war zones due to grabs for oil or droughts caused by global warming. As the world’s most advanced tech companies, Apple, Google, Facebook, Amazon, Microsoft et al are the biggest players in a society that – at best indirectly, at worst carelessly – causes the suffering of the people Patricia and Angeline are helping and providing justice for. And, as someone typing a blog post using a Macbook Pro that doesn’t even let me add a new battery – I’m clearly part of the problem, not the solution.

So – in answer to Somaya’s question: how can we help? Well, for a start, we can stop fetishising the iPhone and start bigging up Fairphone and Phonebloks. However, keeping the focus on Digital Preservation, we’ve got to be really careful that our efforts aren’t used to support an IT industry that’s currently profligate way beyond moral acceptability. So rather than assuming (as I did above) that all the ‘best-practice’ of digital preservation flows from the ‘developed’ (ahem) world to the ‘developing’, we ought to seek some lessons in how to preserve technology from those who have fewer opportunities to waste it.

Somaya’s already on the case with her upcoming panel at iPres on the 28th September: Then we ought to continue down the road of holding PASIG in Mexico City next year by holding one in Africa as soon as possible. As long as – when we’re there, we make sure we shut up and listen.

PASIG 2017 Twitter round-up

After many months of planning it feels quite strange to us that PASIG 2017 is over. Hosting the PASIG conference in Oxford has been a valuable experience for the DPOC fellows and a great chance for Bodleian Libraries’ staff to meet with and listen to presentations by digital preservation experts from around the world.

In the end 244 conference delegates made their way to Oxford and the Museum of Natural History. The delegates came from 130 different institutions and every continent of the world was represented (…well, apart from Antarctica).

What was especially exciting though were all the new faces. In fact 2/3 of the delegates this year had not been to a PASIG conference before! Is this perhaps a sign that interest in digital preservation is on the rise?

As always at PASIG, Twitter was ablaze with discussion in spite of an at times flaky Wifi connection. Over three days #PASIG17 was mentioned a whopping 5300 times on Twitter and had a “reach” of 1.7 million. Well done everyone on some stellar outreach! Most active Twittering came from the UK, USA and Austria.

Twitter activity by country using #PASIG17 (Talkwalker statistics)

Although it is hard to choose favourites among all the Tweets, a few of the DPOC project’s personal highlights included:

Cambridge Fellow Lee Pretlove lists “digital preservation skills” and why we cannot be an expert in all areas. Tweet by Julian M. Morley

Bodleian Fellow James makes some insightful observations about the incompatibility between tar pits and digital preservation.

Cambridge Fellow Somaya Langley presents in the last PASIG session on the topic of “The Future of Digital Preservation”.  

What were some of your favourite talks and Twitter conversations? What would you like to see more of at PASIG 2018? #futurePASIG

Digital Preservation futurology

I fancy attempting futurology, so here’s a list of things I believe could happen to ‘digital preservation systems’ over the next decade. I’ve mostly pinched these ideas from folks like Dave Thompson, Neil Jefferies, and my fellow Fellows. But if you see one of your ideas, please claim it using the handy commenting mechanism. And because it’s futurology, it doesn’t have to be accurate, so kindly contradict me!

Ingest becomes a relationship, not a one-off event

Many of the core concepts underpinning how computers are perceived to work are crude, paper-based metaphors – e.g. ‘files’, ‘folders’, ‘desktops’, ‘wastebaskets’ etc – that don’t relate to what your computer’s actually doing. (The early players in office computing were typewriter and photocopier manufacturers, after all…) These metaphors have succeeded at getting everyone to use computers, but they’ve also suppressed various opportunities to work smarter, too.

The concept of ingesting (oxymoronic) ‘digital papers’ is obviously heavily influenced by this paper paradigm.  Maybe the ‘paper paradigm’ has misled the archival community about computers a bit, too, given that they were experts at handling ‘papers’ before computers arrived?

As an example of what I mean: in the olden days (25 whole years ago!), Professor Plum would amass piles of important papers until the day he retired / died, and then, and only then, could these personal papers be donated and archived. Computers, of course, make it possible for the Prof both to keep his ‘papers’ where he needs them, and donate them at the same time, but the ‘ingest event’ at the centre of current digital preservation systems still seems to be underpinned by a core concept of ‘piles of stuff needing to be dealt with as a one-off task’. In future, the ‘ingest’ of a ‘donation’ will actually become a regular, repeated set of occurrences based upon ongoing relationships between donors and collectors, and forged initially when Profs are but lowly postgrads. Personal Digital Archiving and Research Data Management will become key; and ripping digital ephemera from dying hard disks will become less necessary as they become so.

The above depends heavily upon…

Object versioning / dependency management

Of course, if Dr. Damson regularly donates materials from her postgrad days onwards, some of these may be updates to things donated previously. Some of them might have mutated so much since the original donation that they can be considered ‘child’ objects, which may have ‘siblings’ with ‘common ancestors’ already extant in the archive. Hence preservation systems need to manage multiple versions of ‘digital objects’, and the relationships between them.

Some of the preservation systems we’ve looked at claim to ‘do versioning’ but it’s a bit clunky – just side-by-side copies of immutable ‘digital objects’, not records of the changes from one version to the next, and with no concept of branching siblings from a common parent. Complex structures of interdependent objects are generally problematic for current systems. The wider computing world has been pushing at the limits of the ‘paper-paradigm’ immutable object for a while now (think Git, Blockchain, various version control and dependency management platforms, etc). Digital preservation systems will soon catch up.

Further blurring of the object / metadata boundary

What’s more important, the object or the metadata? The ‘paper-paradigm’ has skewed thinking towards the former (the sacrosanct ‘digital object’, comparable to the ‘original bit of paper’), but after you’ve digitised your rare book collection, what are Humanities scholars going to text-mine? It won’t be images of pages – it’ll be the transcripts of those (i.e. the ‘descriptive metadata’)*. Also, when seminal papers about these text mining efforts are published, how is this history of the engagement with your collection going to be recorded? Using a series of PREMIS Events (that future scholars can mine in turn), perhaps?

The above talk of text mining and contextual linking of secondary resources raises two more points…

* While I’m here, can I take issue with the term ‘descriptive metadata’? All metadata is descriptive. It’s tautological; like saying ‘uptight Englishman’. Can we think of a better name?

Ability to analyse metadata at scale

‘Delivery’ no longer just means ‘giving users a viewer to look at things one-by-one with’ – it now also means ‘letting people push their Natural Language or image processing algorithms to where the data sits, and then coping with vast streams of output data’.

Storage / retention informed by well-understood usage patterns

The fact that everything’s digital, and hence easier to disseminate and link together than physical objects, also means better understanding how people use our material. This doesn’t just mean ‘wiring things up to Google Analytics’ – advances in bibliometrics that add social / mainstream media analysis, and so forth, to everyday citation counts present opportunities to judge the impact of our ‘stuff’ on the world like never before. Smart digital archives will inform their storage management and retention decisions with this sort of usage information, potentially in fully or semi-automated ways.

Ability to get data out, cleanly – all systems are only ever temporary!

Finally – it’s clear that there are no ‘long-term’ preservation system options. The system you procure today will merely be ‘custodian’ of your materials for the next ten or twenty years (if you’re lucky). This may mean moving heaps of content around in future, but perhaps it’s more pragmatic to think of future preservation systems as more like ‘lenses’ that are laid on top of more stable data stores to enable as-yet-undreamt-of functionality for future audiences?

(OK – that’s enough for now…)

What is holding us back from change?

There are worse spots for a meeting. Oxford. Photo by: S. Mason

Every 3 months the DPOC teams gets together in person in either Oxford, Cambridge or London (there’s also been talk of taking a meeting at Bletchley Park sometime). As this is a collaborative effort, these meetings offer a rare opportunity to work face-to-face instead of via Skype with the endless issues around screen sharing and poor connections. Good ideas come when we get to sit down together.

As our next joint board meeting is next week, it was important to look over the work of the past year and make sure we are happy with the plan for year two. Most importantly, we wanted to discuss the messages we need to give our institutions as we look towards the sustainability of our digital preservation activities. How do we ensure that the earlier work and the work being done by us does not get repeated in 2-5 years time?

Silos in institutions

This is especially complicated when dealing with institutions like Oxford and Cambridge. We are big and old institutions with teams often working in silos. What does siloing have an effect on? Well, everything. Communication, effort, research—it all suffers. Work done previously is done again. Over and over.

The same problems are being tackled within different silos; this is duplicated and wasted effort if they are not communicating their work to each other. This means that digital preservation efforts can be fractured and imbalanced if institutional collaboration is ignored. We have an opportunity and responsibility in this project to get people together and to get them to talk openly about the digital preservation problems they are each trying to tackle.

Managers need to lead the culture change in the institution

While not always the case, it is important that managers do not just sit back and say “you will never get this to work” or “it has always been this way.” We need them on our side; they after often the gatekeepers of silos. We have to bring them together in order to start opening the silos.

It is within their power to be the agents of change; we have to empower them to believe in changing the habits of our institution. They have to believe that digital preservation is worth it if their team will also.

This might be the ‘carrot and stick’ approach or the ‘carrot’ only, but whatever approach is used, the are a number of points we agreed needed to be made clear:

  • our digital collections are significant and we have made assurances about their preservation and long term access
  • our institutional reputation plays a role in the preservation our digital assets
  • digital preservation is a moving target and we must be moving with it
  • digital preservation will not be “solved” through this project, but we can make a start; it is important that this is not then the end.

Roadmap to sustainable digital preservation

Backing up any messages is the need for a sustainable roadmap. If you want change to succeed and if you want digital preservation to be a core activity, then steps must be actionable and incremental. Find out where you are, where you want to go and then outline the timeline of steps it will take to get there. Consider using maturity models to set goals for your roadmap, such as Kenney and McGovern’s, Brown’s or the NDSA model. Each are slightly different and some might be more suitable for your institutions than others, so have a look at all of them.

It’s like climbing a mountain. I don’t look at the peak as I walk; it’s too far away and too unattainable. Instead, I look at my feet and the nearest landmark. Every landmark I pass is a milestone and I turn my attention to the next one. Sometimes I glance up at the peak, still in the distance—over time it starts to grow closer. And eventually, my landmark is the peak.

It’s only when I get to the top that I see all of the other mountains I also have to climb. And so I find my landmarks and continue on. I consider digital preservation a bit of the same thing.

What are your suggestions for breaking down the silos and getting fractured teams to work together? 

Operational Pragmatism in Digital Preservation: a discussion

From Somaya Langley, Policy and Planning Fellow at Cambridge: In September this year, six digital preservation specialists from around the world will be leading a panel and audience discussion. The panel is titled Operational Pragmatism in Digital Preservation: establishing context-aware minimum viable baselines. This will be held at the iPres International Digital Preservation Conference in Kyoto, Japan.


Panellists

Panellists include:

  • Dr. Anthea Seles – The National Archives, UK
  • Andrea K Byrne – Rensselaer Polytechnic Institute, USA
  • Dr. Dinesh Katre – Centre for Development of Advanced Computing (C-DAC), India
  • Dr. Jones Lukose Ongalo – International Criminal Court, The Netherlands
  • Bertrand Caron – Bibliothèque nationale de France
  • Somaya Langley – Cambridge University Library, UK

Panellists have been invited based on their knowledge of a breadth of digital creation, archiving and preservation contexts and practices including having worked in non-Western, non-institutional and underprivileged communities.

Operational Pragmatism

What does ‘operational pragmatism’ mean? For the past year or two I’ve been pondering ‘what corners can we cut’? For over a decade I have witnessed an increasing amount of work in the digital preservation space, yet I haven’t seen the increase in staffing and resources to handle this work. Meanwhile deadlines for transferring digital (and analogue audiovisual) content from carriers are just around the corner (e.g. Deadline 2025).

Panel Topic

Outside of the First World and national institutional/top-tier university context, individuals in the developing world struggle to access basic technology and resources to be able to undertake archiving and preservation of digital materials. Privileged First World institutions (who still struggle with deeply ingrained under-resourcing) are considering Trusted Digital Repository certification, while in the developing world meeting these standards is just not feasible. (Evidenced by work that has taken place in the POWRR project and Anthea Seles’ PhD thesis and more.)

How do we best prioritise our efforts so we can plan effectively (with the current resources we have)? How do we strategically develop these resources in methodical ways while ensuring the critical digital preservation work gets done before it is simply too late?

Approach

This panel discussion will take the form of a series of provocations addressing topics including: fixity, infrastructure and storage, preconditioning, pre-ingest processes, preservation metadata, scalability (including bi-directional scalability), technical policies, tool error reporting and workflows.

Each panellist will present their view on a different topic. Audience involvement in the discussion will be strongly encouraged.

Outcomes

The intended outcome is a series of agreed-upon ‘baselines’ tailored to different cultural, organisational and contextual situations, with the hope that these can be used for digital preservation planning and strategy development.

Further Information

The Panel Abstract is included below.

iPres Digital Preservation Conference program information can be found at: https://ipres2017.jp/program/.

We do hope you’ll be able to join us.


Panel Abstract

Undertaking active digital preservation, holistically and thoroughly, requires substantial infrastructure and resources. National archives and libraries across the Western world have established, or are working towards maturity in digital preservation (often underpinned by legislative requirements). On the other hand, smaller collectives and companies situated outside of memory institution contexts, as well as organisations in non-Western and developing countries, are struggling with the basics of managing their digital materials. This panel continues the debate within the digital preservation community, critiquing the development of digital preservation practices typically from within positions of privilege. Bringing together individuals from diverse backgrounds, the aim is to establish a variety of ‘bare minimum’ baselines for digital preservation efforts, while tailoring these to local contexts.

Six Priority Digital Preservation Demands

Somaya Langley, Cambridge Policy and Planning Fellow, talks about her top 6 demands for a digital preservation system.


Photo: Blazej Mikula, Cambridge University Library

As a former user of one digital preservation system (Ex Libris’ Rosetta), I have spent a few years frustrated by the gap between what activities need to be done as part of a digital stewardship end-to-end workflow – including packaging and ingesting ‘information objects’ (files and associated metadata) – and the maturity level of digital preservation systems.

Digital Preservation Systems Review

At Cambridge, we are looking at different digital preservation systems and what each one can offer. This has involved talking to both vendors and users of systems.

When I’m asked about what my top digital preservation system current or future requirements are, it’s excruciatingly hard to limit myself to a handful of things. However, having previously been involved in a digital preservation system implementation project, there are some high-level takeaways from past experiences that remain with me.

Shortlist

Here’s the current list of my six top ‘digital preservation demands’ (aka user requirements):

Integration (with various other systems)

A digital preservation ‘system’ is only one cog in a wheel within a much larger machine; one piece of a much larger puzzle. There is an entire ‘digital ecosystem’ that this ‘system’ should exist within, and end-to-end digital stewardship workflows are of primary importance. The right amount of metadata and/or files should flow should flow from one system to another. We must also know where the ‘source of truth’ is for each bit.

Standards-based

This seems like a no-brainer. We work in Library Land. Libraries rely on standards. We also work with computers and other technologies that also require standard ways (protocols etc.) of communicating.

For files and metadata to flow from one system to another – whether via import, ingest, export, migration or an exit strategy from a system – we already spend a bunch of time creating mappings and crosswalks from one standard (or implementation of a standard) to another. If we don’t use (or fully implement) existing standards, this means we risk mangling data, context or meaning; potentially losing or not capturing parts of the data; or just wasting a whole lot of time.

Error Handling (automated, prioritised)

There’s more work to be done in managing digital materials than there are people to do it. Content creation is increasing at exponential rates, meanwhile the number of staff (with the right skills) just aren’t. We have to be smart about how we work. This requires prioritisation.

We need to have smarter systems that help us. This includes helping to prioritise where we focus our effort. Digital preservation systems are increasingly incorporating new third-party tools. We need to know which tool reports each error and whether these errors are show-stoppers or not. (For example: is the content no longer renderable versus a small piece of non-critical descriptive metadata that is missing?) We have to accept that, for some errors, we will never get around to addressing them.

Reporting

We need to be able to report to different audiences. The different types of reporting classes include (but are not limited to):

  1. High-level reporting – annual reports, monthly reports, reports to managers, projections, costings etc.)
  2. Collection and preservation management reporting – reporting on successes and failures, overall system stats, rolling checksum verification etc.
  3. Reporting for preservation planning purposes – based on preservation plans, we need to be able to identify subsections of our collection (configured around content types, context, file format and/or whatever other parameters we choose to use) and report on potential candidates that require some kind of preservation action.

Provenance

We need to best support – via metadata – where a file has come from. This, for want of a better approach, is currently being handled by the digital preservation community through documenting changes as Provenance Notes. Digital materials acquired into our collections are not just the files, they’re also the metadata. (Hence, why I refer to them as ‘information objects’.) When an ‘information object’ has been bundled, and is ready to be ingested into a system, I think of it as becoming an ‘information package’.

There’s a lot of metadata (administrative, preservation, structural, technical) that appears along the path from an object’s creation until the point at which it becomes an ‘information package’. We need to ensure we’re capturing and retaining the important components of this metadata. Those components we deem essential must travel alongside their associated files into a preservation system. (Not all files will have any or even the right metadata embedded within the file itself.) Standardised ways of handling information held in Provenance Notes (whether these are from ‘outside of the system’ or created by the digital preservation system) and event information so it can be interrogated and reported on is crucial.

Managing Access Rights

Facilitating access is not black and white. Collections are not simply ‘open’ or ‘closed’. We have a myriad of ways that digital material is created and collected; we need to ensure we can provide access to this content in a variety of ways that support both the content and our users. This can include access from within an institution’s building, via a dedicated computer terminal, online access to anyone in the world, mediated remote access, access to only subsets of a collection, support for embargo periods, ensuring we respect cultural sensitivities or provide access to only the metadata (perhaps as large datasets) and more.

We must set a goal of working towards providing access to our users in the many different (and new-ish) ways they actually want to use our content.

It’s imperative to keep in mind the whole purpose of preserving digital materials is to be able to access them (in many varied ways). Provision of content ‘viewers’ and facilitating other modes of access (e.g. to large datasets of metadata) are essential.

Final note: I never said addressing these concerns was going to be easy. We need to factor each in and make iterative improvements, one step at a time.

C4RR – Containers for Reproducible Research Conference

James shares his thoughts after attending the C4RR Containers for Reproducible Research Conference at the University of Cambridge (27 – 28 June).


At the end of June both Dave and I, the Technical Fellows, attended the C4RR conference/workshop hosted by The Software Sustainability Institute in Cambridge. This event brought together researchers, developers and educators to explore best practices when using containers and the future of research software with containers.

Containers, specially Docker and Singularity are the ‘in’ thing at the moment and it was interesting to hear from a variety of research projects who are using them for reproducible research.

Containers are another form of server virtualisation but are lighter than a virtual machine. Containers and virtual machines have similar resource isolation and allocation benefits, but function differently because containers virtualize the operating system instead of hardware; containers are more portable and efficient.

 
Comparison of VM vs Container (Images from docker.com)

Researchers described how they were using Docker, one of the container implementations, to package the software used in their research so they could easily reproduce their computational environment across several different platforms (desktop, server and cluster). Others were using Singularity, another container technology, when implementing containers on a HPC (High-Performance Computing) Cluster due to restrictions of the Docker requirements for root access. It was clear from the talks, there is a rapid development of these technologies and ever increasing complexity of the computing environments involved, which does make me worry how these might be preserved.

Near the end of the second day, Dave and I gave a 20 minute presentation to encourage the audience to think more about preservation. As the audience were all evangelists for container technology it made sense to try to tap into them to promote building preservation into their projects.


Image By Raniere Silva

One aim was to get people to think about their research after the project was over. There is often a lack of motivation to think about how others might reproduce the work, whether that’s six months into the future let alone 15+ years from now.

Another area we briefly covered was relating to depositing research data. We use DROID to scan our repositories to identify file formats which relies on the technical registry of PRONOM. We put out a plea to the audience to ask for help with creating new file signatures for unknown file formats.

I had some great conversations with others over the two days and my main takeaway from the event was that we should look to attend more non-preservation specific conferences with a view to promote preservation in other computer-related areas of study.

Our slides from the event have been posted by The Software Sustainability Institute via Google.

DPASSH: Getting close to producers, consumers and digital preservation

Sarah shares her thoughts after attending the DPASSH (Digital Preservation in the Arts, Social Sciences and Humanities) Conference at the University of Sussex (14 – 15 June).


DPASSH is a conference that the Digital Repository Ireland (DRI) puts on with a host organisation. This year, it was hosted by the Sussex Humanities Lab at the University of Sussex, Brighton. What is exciting about this digital preservation conference is that it brings together creators (producers) and users (consumers) with digital preservation experts. Most digital preservation conferences end up being a bit of an echo chamber, full of practitioners and vendors only. But what about the creators and the users? What knowledge can we share? What can we learn?

DPASSH is a small conference, but it was an opportunity to see what researchers are creating and how they are engaging with digital collections. For example in Stefania Forlini’s talk, she discussed the perils of a content-centric digitisation process where unique print artefacts are all treated the same; the process flattens everything into identical objects though they are very different. What about the materials and the physicality of the object? It has stories to tell as well.

To Forlini, books span several domains of sensory experience and our digitised collections should reflect that. With the Gibson Project, Forlini and project researchers are trying to find ways to bring some of those experiences back through the Speculative W@nderverse. They are currently experimenting with embossing different kinds of paper with a code that can be read by a computer. The computer can then bring up the science fiction pamphlets that are made of that specific material. Then a user can feel the physicality of the digitised item and then explore the text, themes and relationships to other items in the collection using generous interfaces. This combines a physical sensory experience with a digital experience.

For creators, the decision of what research to capture and preserve is sometimes difficult; often they lack the tools to capture the information. Other times, creators do not have the skills to perform proper archival selection. Athanasios Velios offered a tool solution for digital artists called Artivity. Artivity can capture the actions performed on a digital artwork in certain programs, like Photoshop or Illustrator. This allows the artist to record their creative process and gives future researchers the opportunity to study the creative process. Steph Taylor from CoSector suggested in her talk that creators are archivists now, because they are constantly appraising their digital collections and making selection decisions.  It is important that archivists and digital preservation practitioners empower creators to make good decisions around what should be kept for the long-term.

As a bonus to the conference, I was awarded with the ‘Best Tweet’ award by the DPC and DPASSH. It was a nice way to round out two good, informative days. I plan to purchase many books with my gift voucher!

I certainly hope they hold the conference next year, as I think it is important for researchers in the humanities, arts and social sciences to engage with digital preservation experts, archivists and librarians. There is a lot to learn from each other. How often do we get our creators and users in one room with us digital preservation nerds?

Preserving research – update from the Cambridge Technical Fellow

Cambridge’s Technical Fellow, Dave, discusses some of the challenges and questions around preserving ‘research output’ at Cambridge University Library.


One of the types of content we’ve been analysing as part of our initial content survey has been labelled ‘research output’. We knew this was a catch-all term, but (according to the categories in Cambridge’s Apollo Repository), ‘research output’ potentially covers: “Articles, Audio Files, Books or Book Chapters, Chemical Structures, Conference Objects, Datasets, Images, Learning Objects, Manuscripts, Maps, Preprints, Presentations, Reports, Software, Theses, Videos, Web Pages, and Working Papers”. Oh – and of course, “Other”. Quite a bundle of complexity to hide behind one simple ‘research output’ label.

One of the categories in particular, ‘Dataset’, zooms the fractal of complexity in one step further. So far, we’ve only spoken in-depth to a small set of scientists (though our participation on Cambridge’s Research Data Management Project Group means we have a great network of people to call on). However, both meetings we’ve had indicate that ‘Datasets’ are a whole new Pandora’s box of complicated management, storage and preservation challenges.

However – if we pull back from the complexity a little, things start to clarify. One of the scientists we spoke to (Ben Steventon at the Steventon Group) presented a very clear picture of how his research ‘tiered’ the data his team produced, from 2-4 terabyte outputs from a Light Sheet Microscope (at the Cambridge Advanced Imaging Centre) via two intermediate layers of compression and modelling, to ‘delivery’ files only megabytes in size. One aspect of the challenge of preserving such research then, would seem to be one of tiering preservation storage media to match the research design.

(I believe our colleagues at the JISC, who Cambridge are working with on the Research Data Management Shared Service Pilot Project, may be way ahead of us on this…)

Of course, tiering storage is only one part of the preservation problem for research data: the same issues of acquisition and retention that have always been part of archiving still apply… But that’s perhaps where the ‘delivery’ layer of the Steventon Group’s research design starts to play a role. In 50 or 100 years’ time, which sets of the research data might people still be interested in? It’s obviously very hard to tell, but perhaps it’s more likely to be the research that underpins the key model: the major finding?

Reaction to the ‘delivered research’ (which included papers, presentations and perhaps three or four more from the list above) plays a big role, here. Will we keep all 4TBs from every Light Sheet session ever conducted, for the entirety of a five or ten-year project? Unlikely, I’d say. But could we store (somewhere cold, slow and cheap) the 4TBs from the experiment that confirmed the major finding?

That sounds a bit more within the realms of possibility, mostly because it feels as if there might be a chance that someone might want to work with it again in 50 years’ time. One aspect of modern-day research that makes me feel this might be true is the complexity of the dependencies between pieces of modern science, and the software it uses in particular. (Blender, for example, or Fiji). One could be pessimistic here and paint a negative scenario of ‘what if a major bug is found in one of those apps, that calls into question the science ‘above it in the chain’. But there’s an optimistic view, here, too… What if someone comes up with an entirely new, more effective analysis method that replaces something current science depends on? Might there not be value in pulling the data from old experiments ‘out of the archive’ and re-running them with the new kit? What would we find?

We’ll be able to address some of these questions in a bit more detail later in the project. However, one of the more obvious things talking to scientists has revealed is that many of them seem to have large collections of images that need careful management. That seems quite relevant to some of the more ‘close to home’ issues we’re looking at right now in The Library.