Self-archiving the DPOC research outputs

The Digital Preservation at Oxford and Cambridge project ended on the 31st of December 2018. Although follow-on digital preservation projects are continuing at both organisations, the initial DPOC project itself has been wrapped up. This also means that activity on the www.dpoc.ac.uk blog and our Twitter hash (#dp0c) are being wound down.

To give the outputs from the DPOC project a good chance of remaining accessible in the future, we have been planning our ‘project funeral’ over the past few months. Keep on reading to find out how we archived the DPOC project’s research outputs and how you can access it in the future.

This blog has two sections:

  • Section 1: Archiving of external project outputs
  • Section 2: Archiving of internal project outputs

SECTION 1: EXTERNAL PROJECT OUTPUTS

Making use of our institutional repositories

The DPOC blog, a WordPress site maintained by Bodleian Libraries’ Systems and Services (BDLSS), has been used to disseminate external project outputs over the past 2.5 years. While the WordPress platform is among the less complex applications for BDLSS to maintain, it is still an application based platform which requires ongoing maintenance which may alter the functionality, look and feel of the DPOC blog over time. It cannot be guaranteed that files uploaded to the blog remain accessible and persistently citable over time. This is a known issue for research websites (even digital preservation ones!). For this reason, any externally facing project outputs have instead been deposited with our institutional repositories ORA (Oxford) and Apollo (Cambridge). The repositories, rather than the DPOC blog, are the natural homes for the project’s outputs.

The deposits to ORA and Apollo range from datasets, reports, abstracts, chapters and posters created by the DPOC Fellows. A full list of externally available outputs is available on our resource page, or by searching for the keyword “DPOC” on ORA and Apollo.

Image Capture: Public data sets, journals, and other research outputs from the DPOC project can be accessed through Apollo and ORA

 

Archiving our social media

One of the deposited datasets cover our social media activities. The social media dataset contains exports of all WordPress blog posts, social media statistics, and Twitter data.

A full list of Tweets which have used the #dp0c tag between August 2016 and February 2019 can be downloaded by external users from ORA. Due to Twitter’s Terms of Service, only Tweet identifiers are available as part of the public dataset. However, full Tweets generated by the project team have also been retained under embargo for internal staff use only.

As part of wrapping up the DPOC project, the blog will also be amended to reflect that it is no longer actively updated. However, as we want to keep a record of the original look of the site before these edits Bodleian Libraries’ Electronic Manuscripts and Archives are currently crawling the site. To view an archived version of dpoc.ac.uk please visit Bodleian Libraries’ archive.it page.

 


SECTION 2: INTERNAL PROJECT DOCUMENTATION

Appraising internal project documentation

Over the past 2.5 years the DPOC project has created a large body of internal documentation as an outcome of its research activities. We wanted to choose wisely what documentation to keep and what documentation to dispose of, so that other library staff can easily navigate and make use of the project outputs.

The communication plan which was created at start of the project was valuable in the appraisal process, helping us both locate and make decisions about what content to keep. Our communication plan listed:

  1. How project decisions would be recorded
  2. How different communication platforms and project management tools (such as SharePoint, Asana and Slack) would be used and backed up
  3. And which standards for file naming and versioning the Fellows would use

 

Accessing internal project documentation

In October-December both organisations appraised the content which was on the joint DPOC SharePoint site, and moved material of enduring value into local SharePoint instances for each institution. This way the documentation could be made available to other library staff rather than DPOC project members only.

We had largely followed the file naming standards outlined in the communication plan, but work was still required to manually clean up some file names. Additional contextualising descriptions were added to make content more easily understandable by staff who have not previously come across the project.

Image Caption: SharePoint

Oxford also used its departmental Confluence page which integrates with the SharePoint instance. Code written during the project is managed in GitLab.

Image Caption: Confluence


SUCCESSION PLANNING

Oxford: Although some of the DPOC Fellows are continuing work on other digital preservation related projects at Bodleian Libraries, ownership of documents, repository datasets and the WordPress website was formalised and assigned to the Head of Digital Collections and Preservation. This role (or the successor of this role) will make curatorial and preservation decisions about any DPOC project outputs managed by Bodleian Libraries.

CambridgePreservation activities will continue at CUL following on from the DPOC project in 2019. Questions regarding DPOC datasets and internal documentation hosted at CUL should be addressed to digitialpreservation[AT]lib.cam[DOT]ac.uk


SUMMARY

  • For a list of publicly available project outputs, please visit the resource page or search for the keyword “DPOC” on ora.ox.ac.uk and repository.cam.ac.uk
  • An archived version of dpoc.ac.uk is available through Bodleian Libraries’ modern archives.  Alternatively, the UK Web Archive and the Internet Archive also stores crawled version of the site.
  • If you are a CUL member of staff looking for internal project documentation, please contact  digitialpreservation[AT]lib.cam[DOT]ac.uk
  • If you are a Bodleian Libraries member of staff looking for internal project documentation, please contact digitalpreservation[AT]bodleian.ox[DOT]ac.uk

Planning your (digital) funeral: for projects

Cambridge Policy & Planning Fellow, Somaya, writes about her paper and presentation from the Digital Culture Heritage Conference 2017. The conference paper, Planning for the End from the Start: an Argument for Digital Stewardship, Long-Term Thinking and Alternative Capture Approaches, looks at considering digital preservation at the start of a digital humanities project and provides useful advice for digital humanities researchers to use in their current projects.


In August I presented at the Digital Cultural Heritage 2017 international conference in Berlin (incidentally, my favourite city in the whole world).

Berlin - view from the river Spree. Photo: Somaya Langley

Berlin – view from the river Spree. Photo: Somaya Langley

I presented the Friday morning Plenary session on Planning for the End from the Start: an Argument for Digital Stewardship, Long-Term Thinking and Alternative Capture Approaches. Otherwise known as: ‘planning for your funeral when you are conceived’. This is a presentation that represents challenges faced by both Oxford and Cambridge and the thinking behind this has been done collaboratively by myself and my Oxford Policy & Planning counterpart, Edith Halvarsson.

We decided it was a good idea to present on this topic to an international digital cultural heritage audience, who are likely to also experience similar challenges as our own researchers. It is based on some common digital preservation use cases that we are finding in each of our universities.

The Scenario

A Digital Humanities project receives project funding and develops a series of digital materials as part of the research project, and potentially some innovative tools as well. For one reason or another, ongoing funding cannot be secured and so the PIs/project team need to find a new home for the digital outputs of the project.

Example Cases

We have numerous examples of these situations at Cambridge and Oxford. Many projects containing digital content that needs to be ‘rehoused’ are created in the online environment, typically as websites. Some examples include:

Holistic Thinking

We believe that thinking holistically right at the start of a project can provide options further down the line, should an unfavourable funding outcome be received.

So it is important to consider holistic thinking, specifically a Digital Stewardship approach (incorporating Digital Curation & Digital Preservation).

Models for Preservation

Digital materials don’t necessarily exist in a static form and often they don’t exist in isolation. It’s important to think about digital content as being part of a lifecycle and managed by a variety of different workflows. Digital materials are also subject to many risks so these also need to be considered.

Some models to frame thinking about digital materials:

Documentation

It is incredibly important to document your project and when handing over the responsibility of your digital materials and data, also handing over documentation to someone responsible for hosting or preserving your digital project will need to rely on this information. Also ensuring the implementation of standards, metadata schemas and persistent identifiers etc.

This can include providing associated materials, such as:

Data Management Plans

Some better use of Data Management Plans (DMPs) could be:

  • Submitting DMPs alongside the data
  • Writing DMPs as dot-points rather than prose
  • Including Technical Specifications such as information about code, software, software versions, hardware and other dependencies

An example of a DMP from Cambridge University’s Dr Laurent Gatto: Data Management Plan for a Biotechnology and Biological Sciences Research Council

Borrowing from Other Disciplines

Rather than having to ‘rebuild the wheel’, we should also consider borrowing from other disciplines. For example, borrowing from the performing arts we might provide similar documents and information such as:

  • Technical Rider (a list of requirements for staging a music gig and theatre show)
  • Stage Plots (layout of instruments, performers and other equipment on stage)
  • Input Lists (ordered list of the different audio channels from your instruments/microphones etc. that you’ll need to send to the mixing desk)

For digital humanities projects and other complex digital works, providing simple and straight forward information about data flows (including inputs and outputs) will greatly assist digital preservationists in determining where something has broken in the future.

Several examples of Technical Riders can be found here:

Approaches

Here are some approaches to consider in regards to interim digital preservation of digital materials:

Bundling & Bitstream Preservation

The simplest and most basic approach may be to just zip up files and undertake bitstream preservation. Bitstream preservation only ensures that the zeroes and ones that went into a ‘system’ come out as the same zeroes and ones. Nothing more.

Exporting / Migrating

Consider exporting digital materials and/or data plus metadata into recognised standards as a means of migrating into another system.

For databases, the SIARD (Software Independent Archiving of Relational Databases) standard may be of use.

Hosting Code

Consider hosting code within your own institutional repository or digital preservation system (if your organisation has access to this option) or somewhere like GitHub or other services.

Packing it Down & ‘Putting on Ice’

You may need to consider ‘packing up’ your digital materials and doing it in a way that you can ‘put it on ice’. Doing this in a way that – when funding is secured in the future – it can be somewhat simply be brought back to life.

An example of this is the the work that Peter Sefton, from the University of Sydney in Australia, has been trialling. Based on Omeka, he has created a version of code called OzMeka. This is an attempt at a standardised way of being able to handle research project digital outputs that have been presented online. One example of this is Dharmae.

Alternatively, the Kings Digital Lab, provide infrastructure for eResearch and Digital Humanities projects that ensure the foundations of digital projects are stable from the get-go and mitigates risks regarding longer-term sustainability of digital content created as part of the projects.

Maintaining Access

This could be done through traditional web archiving approaches, such as using tools Web Archiving Tools (Heritrix or HTTrack) or downloading video materials using Video Download Helper for video. Alternatively, if you are part of an institution, the Internet Archive’s ArchiveIt service may be something you want to consider and can work with your institution to implement this.

Hosted Infrastructure Arrangements

Finding another organisation to take on the hosting of your service. If you do manage to negotiate this, you will need to either put in place a contract or Memorandum of Understanding (MOU) as well as handing over various documentation, which I have mentioned earlier.

Video Screen Capture

A simple way of attempting to document a journey through a complex digital work (not necessarily online, this can apply to other complex interactive digital works as well), may be by way of capturing a Video Screen Capture.

Kymata Atlas - Video Screen Capture still

Kymata Atlas – Video Screen Capture still

Alternatively, recording a journey through an interactive website using the Webrecorder, developed by Rhizome, which will produce WARC web archive files.

Documenting in Context

Another means of understanding complex digital objects is to document the work in the context in which it was experienced. One example of this is the work of Robert Sakrowski and Constant Dullart, netart.database.

An example of this is the work of Dutch and Belgian net.artists JODI (Joan Heemskerk & Dirk Paesmans) shown here.

JODI - netart.database

JODI – netart.database

Borrowing from documenting and archiving in the arts, an approach of ‘documenting around the work‘ might be suitable – for example, photographing and videoing interactive audiovisual installations.

Web Archives in Context

Another opportunity to understand websites – if they have been captured by the Internet Archive – is viewing these websites using another tool developed by Rhizome, oldweb.today.

An example of the Cambridge University Library website from 1997, shown in a Netscape 3.04 browser.

Cambridge University Library website in 1997 via oldweb.today

Cambridge University Library website in 1997 via oldweb.today

Conclusions

While there is no one perfect solution and each have their own pros and cons, using an approach that combines different methods might make your digital materials available post the lifespan of your project. These methods will help ensure that digital material is suitably documented, preserved and potentially accessible – so that both you and others can use the data in an ongoing manner.

Consider:

  • How you want to preserve the data?
  • How you want to provide access to your digital material?
  • Developing a strategy including several different methods.

Finally, I think this excerpt is relevant to how we approach digital stewardship and digital preservation:

“No man is an island entire of itself; every man is a piece of the continent, a part of the main” – Meditation XVII, John Donne

We are all in this together and rather than each having to troubleshoot alone and building our own separate solutions, it would be great if we can work to our strengths in collaborative ways, while sharing our knowledge and skills with others.

Six Priority Digital Preservation Demands

Somaya Langley, Cambridge Policy and Planning Fellow, talks about her top 6 demands for a digital preservation system.


Photo: Blazej Mikula, Cambridge University Library

As a former user of one digital preservation system (Ex Libris’ Rosetta), I have spent a few years frustrated by the gap between what activities need to be done as part of a digital stewardship end-to-end workflow – including packaging and ingesting ‘information objects’ (files and associated metadata) – and the maturity level of digital preservation systems.

Digital Preservation Systems Review

At Cambridge, we are looking at different digital preservation systems and what each one can offer. This has involved talking to both vendors and users of systems.

When I’m asked about what my top digital preservation system current or future requirements are, it’s excruciatingly hard to limit myself to a handful of things. However, having previously been involved in a digital preservation system implementation project, there are some high-level takeaways from past experiences that remain with me.

Shortlist

Here’s the current list of my six top ‘digital preservation demands’ (aka user requirements):

Integration (with various other systems)

A digital preservation ‘system’ is only one cog in a wheel within a much larger machine; one piece of a much larger puzzle. There is an entire ‘digital ecosystem’ that this ‘system’ should exist within, and end-to-end digital stewardship workflows are of primary importance. The right amount of metadata and/or files should flow should flow from one system to another. We must also know where the ‘source of truth’ is for each bit.

Standards-based

This seems like a no-brainer. We work in Library Land. Libraries rely on standards. We also work with computers and other technologies that also require standard ways (protocols etc.) of communicating.

For files and metadata to flow from one system to another – whether via import, ingest, export, migration or an exit strategy from a system – we already spend a bunch of time creating mappings and crosswalks from one standard (or implementation of a standard) to another. If we don’t use (or fully implement) existing standards, this means we risk mangling data, context or meaning; potentially losing or not capturing parts of the data; or just wasting a whole lot of time.

Error Handling (automated, prioritised)

There’s more work to be done in managing digital materials than there are people to do it. Content creation is increasing at exponential rates, meanwhile the number of staff (with the right skills) just aren’t. We have to be smart about how we work. This requires prioritisation.

We need to have smarter systems that help us. This includes helping to prioritise where we focus our effort. Digital preservation systems are increasingly incorporating new third-party tools. We need to know which tool reports each error and whether these errors are show-stoppers or not. (For example: is the content no longer renderable versus a small piece of non-critical descriptive metadata that is missing?) We have to accept that, for some errors, we will never get around to addressing them.

Reporting

We need to be able to report to different audiences. The different types of reporting classes include (but are not limited to):

  1. High-level reporting – annual reports, monthly reports, reports to managers, projections, costings etc.)
  2. Collection and preservation management reporting – reporting on successes and failures, overall system stats, rolling checksum verification etc.
  3. Reporting for preservation planning purposes – based on preservation plans, we need to be able to identify subsections of our collection (configured around content types, context, file format and/or whatever other parameters we choose to use) and report on potential candidates that require some kind of preservation action.

Provenance

We need to best support – via metadata – where a file has come from. This, for want of a better approach, is currently being handled by the digital preservation community through documenting changes as Provenance Notes. Digital materials acquired into our collections are not just the files, they’re also the metadata. (Hence, why I refer to them as ‘information objects’.) When an ‘information object’ has been bundled, and is ready to be ingested into a system, I think of it as becoming an ‘information package’.

There’s a lot of metadata (administrative, preservation, structural, technical) that appears along the path from an object’s creation until the point at which it becomes an ‘information package’. We need to ensure we’re capturing and retaining the important components of this metadata. Those components we deem essential must travel alongside their associated files into a preservation system. (Not all files will have any or even the right metadata embedded within the file itself.) Standardised ways of handling information held in Provenance Notes (whether these are from ‘outside of the system’ or created by the digital preservation system) and event information so it can be interrogated and reported on is crucial.

Managing Access Rights

Facilitating access is not black and white. Collections are not simply ‘open’ or ‘closed’. We have a myriad of ways that digital material is created and collected; we need to ensure we can provide access to this content in a variety of ways that support both the content and our users. This can include access from within an institution’s building, via a dedicated computer terminal, online access to anyone in the world, mediated remote access, access to only subsets of a collection, support for embargo periods, ensuring we respect cultural sensitivities or provide access to only the metadata (perhaps as large datasets) and more.

We must set a goal of working towards providing access to our users in the many different (and new-ish) ways they actually want to use our content.

It’s imperative to keep in mind the whole purpose of preserving digital materials is to be able to access them (in many varied ways). Provision of content ‘viewers’ and facilitating other modes of access (e.g. to large datasets of metadata) are essential.

Final note: I never said addressing these concerns was going to be easy. We need to factor each in and make iterative improvements, one step at a time.