Preserving research – update from the Cambridge Technical Fellow

Cambridge’s Technical Fellow, Dave, discusses some of the challenges and questions around preserving ‘research output’ at Cambridge University Library.


One of the types of content we’ve been analysing as part of our initial content survey has been labelled ‘research output’. We knew this was a catch-all term, but (according to the categories in Cambridge’s Apollo Repository), ‘research output’ potentially covers: “Articles, Audio Files, Books or Book Chapters, Chemical Structures, Conference Objects, Datasets, Images, Learning Objects, Manuscripts, Maps, Preprints, Presentations, Reports, Software, Theses, Videos, Web Pages, and Working Papers”. Oh – and of course, “Other”. Quite a bundle of complexity to hide behind one simple ‘research output’ label.

One of the categories in particular, ‘Dataset’, zooms the fractal of complexity in one step further. So far, we’ve only spoken in-depth to a small set of scientists (though our participation on Cambridge’s Research Data Management Project Group means we have a great network of people to call on). However, both meetings we’ve had indicate that ‘Datasets’ are a whole new Pandora’s box of complicated management, storage and preservation challenges.

However – if we pull back from the complexity a little, things start to clarify. One of the scientists we spoke to (Ben Steventon at the Steventon Group) presented a very clear picture of how his research ‘tiered’ the data his team produced, from 2-4 terabyte outputs from a Light Sheet Microscope (at the Cambridge Advanced Imaging Centre) via two intermediate layers of compression and modelling, to ‘delivery’ files only megabytes in size. One aspect of the challenge of preserving such research then, would seem to be one of tiering preservation storage media to match the research design.

(I believe our colleagues at the JISC, who Cambridge are working with on the Research Data Management Shared Service Pilot Project, may be way ahead of us on this…)

Of course, tiering storage is only one part of the preservation problem for research data: the same issues of acquisition and retention that have always been part of archiving still apply… But that’s perhaps where the ‘delivery’ layer of the Steventon Group’s research design starts to play a role. In 50 or 100 years’ time, which sets of the research data might people still be interested in? It’s obviously very hard to tell, but perhaps it’s more likely to be the research that underpins the key model: the major finding?

Reaction to the ‘delivered research’ (which included papers, presentations and perhaps three or four more from the list above) plays a big role, here. Will we keep all 4TBs from every Light Sheet session ever conducted, for the entirety of a five or ten-year project? Unlikely, I’d say. But could we store (somewhere cold, slow and cheap) the 4TBs from the experiment that confirmed the major finding?

That sounds a bit more within the realms of possibility, mostly because it feels as if there might be a chance that someone might want to work with it again in 50 years’ time. One aspect of modern-day research that makes me feel this might be true is the complexity of the dependencies between pieces of modern science, and the software it uses in particular. (Blender, for example, or Fiji). One could be pessimistic here and paint a negative scenario of ‘what if a major bug is found in one of those apps, that calls into question the science ‘above it in the chain’. But there’s an optimistic view, here, too… What if someone comes up with an entirely new, more effective analysis method that replaces something current science depends on? Might there not be value in pulling the data from old experiments ‘out of the archive’ and re-running them with the new kit? What would we find?

We’ll be able to address some of these questions in a bit more detail later in the project. However, one of the more obvious things talking to scientists has revealed is that many of them seem to have large collections of images that need careful management. That seems quite relevant to some of the more ‘close to home’ issues we’re looking at right now in The Library.

When was that?: Maintaining or changing ‘created’ and ‘last modified’ dates

Sarah has recently been testing scenarios to investigate the question of changes in file ‘date created’ and ‘last modified’ metadata. When building training, it’s always best to test out what your advice before giving it and below is the result of Sarah’s research with helpful screenshots.


Before doing some training that involved teaching better recordkeeping habits to staff, I ran some tests to be sure that I was giving the right advice when it came to created and last modified dates. I am often told by people in the field that these dates are always subject to change—but are they really? I knew I would tell staff to put created dates in file names or in document headers in order to retain that valuable information, but could the file maintain the correct embedded date anyways?  I set out to test a number of scenarios on both my Mac OS X laptop and Windows desktop.

Scenario 1: Downloading from cloud storage (Google Drive)

This was an ALL DATES change for both Mac OS X and Windows.

Scenario 2: Uploading to cloud storage (Google Drive)

Once again this was an ALL DATES change for both systems.

Note: I trialled this a second time with the Google Drive for PC application and in OS X and found that created and last modified dates do not change when the file is uploaded or downloaded the Google Drive folder on the PC. However, when in Google Drive via the website, the created date is different (the date/time of upload), though the ‘file info’ will confirm the date has not changed. Just to complicate things.

Scenario 3: Transfer from a USB

Mac OS X had no change to the dates. Windows showed an altered created date, but maintained the original last modified date.

Scenario 4: Transfer to a USB

Once again there was no change of a dates in the Mac OS X. Windows showed an altered created date, but maintained the original last modified date.

Note: I looked into scenarios 3 and 4 for Windows a bit further and saw that Robocopy is an option as a command prompt that will allow directories to be copied across and maintains those date attributes. I copied a ‘TEST’ folder containing the file from the Windows computer to the USB, and back again. It did what was promised and there were no changes to either dates in the file. It is a bit annoying that an extra step is required (that many people would find technically challenging and therefore avoid).

Scenario 5: Moving between folders

No change across either systems. This was a relief for me considering how often I move files around my directories.

Conclusions

When in doubt (and you should always be in doubt), test the scenario. Even when I tested these scenarios three of four times, it did not always come out with the same result. That alone should make one cautious. I still stick to putting created date in the file name and in the document itself (where possible), but it doesn’t meant I always receive documents that way.

Creating a zip of files/folders before transfer is one method of preserving dates, but I had some weird issues trying to unzip the file in cloud storage that took a few tries before the dates remained preserved. It is also possible to use Quickhash for transferring files unchanged (and it generates a checksum).

I ignored the last accessed date during testing, because it was too easy to accidentally double-click a file and change it (as you can see happened to my Windows 7 test version).

Has anyone tested any other scenarios to assess when file dates are altered? Does anyone have methods for transferring files without causing any change to dates?

An approach to selecting case studies

Cambridge Policy & Planning Fellow, Somaya, writes about a case study approach developed by the Cambridge DPOC Fellows for CUL. Somaya’s first blog post about the case studies looks at the selection methodology the Cambridge DPOC fellows used to choose their final case studies.


Physical format digital carriers. Photo: Somaya Langley

Background & approach

Cambridge University Library (CUL) has moved to a ‘case study’ approach to the project. The case studies will provide an evidence-based foundation for writing a policy and strategy, developing a training programme and writing technical requirements within the time constraints of the project.The case studies we choose for the DPOC project will enable us to test hands-on day-to-day tasks necessary for working with digital collection materials at CUL. They also need to be representative of our existing collections and future acquisitions, our Collection Development Policy FrameworkStrategic Plan,  our current and future audiences, while considering the ‘preservation risk’ of the materials.

Classes of material

Based on the digital collections surveying work I’ve been doing, our digital collections fall into seven different ‘classes’:

  1. Unpublished born-digital materials – personal and corporate papers, digital archives of significant individuals or institutions
  2. Born-digital university archives – selected records of the University of Cambridge
  3. Research outputs – research data and publications (including compliance)
  4. Published born-digital materials – physical format carriers (optical media), eBooks, web archives, archival and access copies of electronic subscription services, etc.
  5. Digitised image materials – 2D photography (and 3D imaging)
  6. Digital (and analogue) audiovisual materials – moving image (film and video) and sound recordings
  7. In-house created content – photography and videography of events, lectures, photos of conservation treatments, etc.

Proposed case studies

Approximately 40 potential case studies suggested by CUL and Affiliated Library staff were considered. These proposed case studies were selected from digital materials in our existing collections, current acquisition offers, and requests for assistance with digital collection materials, from across Cambridge University. Each proposed case study would allow us to trial different tools (and digital preservation systems), approaches, workflow stages, and represent different ‘classes’ of material.

Digital lifecycle stages

The selected stages are based on a draft Digital Stewardship End-to-End Workflow I am developing. The workflow includes approximately a dozen different stages. It is based on the Digital Curation Centre’s Curation Lifecycle Model, and is also aligned with the Digital POWRR (Preserving Digital Objects with Restricted Resources) Tool Evaluation Grid.

There are also additional essential concerns, including:

  • data security
  • integration (with CUL systems and processes)
  • preservation risk
  • remove and/or delete
  • reporting
  • resources and resourcing
  • system configuration

Selected stages for Cambridge’s case studies

Dave, Lee and I discussed the stages and cut it down to the bare-minimum required to test out various tasks as part of the case studies. These stages include:

  1. Appraise and Select
  2. Acquire / Transfer
  3. Pre-Ingest (including Preconditioning and Quality Assurance)
  4. Ingest (including Generate Submission Information Package)
  5. Preservation Actions (sub-component of Preserve)
  6. Access and Delivery
  7. Integration (with Library systems and processes) and Reporting

Case study selection

In order to produce a shortlist, I needed to work out a parameter best suited in order to rank the proposed case studies from a digital preservation perspective. The initial parameter we decided on was complexity. Did the proposed case study provide enough technical challenges to fully test out what we needed to research?

We also took into account a Streams Matrix (still in development) that outlines different tasks taken at each of the at each of the selected digital life cycle stages. This would ensure different variations of activities were factored in at each stage.

We revisited the case studies once in ranked order and reviewed them, taking into account additional parameters. The additional parameters included:

  • Frequency and/or volume – how much of this type of material do we have/are we likely to acquire (i.e. is this a type of task that would need to be carried out often)?
  • Significance – how significant is the collection in question?
  • Urgency – does this case study fit within strategic priorities such as the current Cambridge University Library Strategic Plan and Collection Development Policy Framework etc.?
  • Uniqueness – is the case study unique and would it be of interest to our users (e.g. the digital preservation field, Cambridge University researchers)?
  • Value to our users and/or stakeholders – is this of value to our current and future users, researchers and/or stakeholders?

This produced a shortlist of eight case studies. We concluded that each provided different long-term digital preservation issues and were experiencing considerable degrees of ‘preservation risk’.

Conclusion

This was a challenging and time-consuming approach, however it ensures fairness in the selection process. The case studies will enable us to have tangible evidence in which to ground the work of the rest of the project. The Cambridge University Library Polonsky Digital Preservation Project Board have agreed that we will undertake three case studies, including a digitisation case study, a born-digital case study and one more – the details of which are still being discussed. Stay tuned for more updates.

Validating half a million TIFF files. Part One.

Oxford Technical Fellow, James, reports on the validation work he is doing with JHOVE and DPF Manager in Part One of this blog series on validation tools for auditing the Polonsky Digitization Project’s TIFF files.


In 2013, The Bodleian Libraries of the University of Oxford and the Biblioteca Apostolica Vaticana (Vatican Library) joined efforts in a landmark digitization project. The aim was to open up their repositories of ancient texts including Hebrew manuscripts, Greek manuscripts, and incunabula, or 15th-century printed books. The goal was to digitize over one and half million pages. All of this was made possible by funding from the Polonsky Foundation.

As part of our own Polonsky funded project, we have been preparing the ground to validate over half a million TIFF files which have been created from digitization work here at Oxford.

Many in the Digital Preservation field have already written articles and blogs on the tools available for validating TIFF files, Yvonne Tunnat (from ZBW Leibniz Information Centre for Economics) wrote a blog for the Open Preservation Foundation regarding the tools. I also had the pleasure of hearing from Yvonne and Michelle Lindlar (from TIB Leibniz Information Centre for Science and Technology) talk at IDCC 2017 conference on this very subject in more detail when discussing JHOVE in their talk, How Valid Is Your Validation? A Closer Look Behind The Curtain Of JHOVE

The go-to validator for TIFF files?

Preparation for validation

In order to validate the master TIFF files, firstly we needed to retrieve these from our tape storage system; fortunately around two-thirds of the images had already been restored to spinning disk storage as part of another internal project. When the master TIFF files were written to tape this included MD5 hashes of the files, so as part of this validation work we will confirm the fixity of all the files. Our network storage system had plenty of room to accommodate all the required files, so we began auditing what still needed to be recovered.

Whilst the auditing and retrieval was progressing, I set about investigating validating a sample set of master TIFF files using both JHOVE and DPF Manager to get an estimate on the time it would take to process the approximate 50 TB of files. I was also interested to compare the results of both tools when faced with invalid or corrupted sample sets of files.

We setup a new virtual machine server in order to carry out the validation workload; this allowed us to scale this machine’s performance as required. Both validation tools were going to be run on a RedHat Linux environment and both would be run from the command line.

It quickly became clear that JHOVE was going to be able to validate the TIFF files a lot quicker than DPF Manager. If DPF Manager is being used as part of one of your workflows, you may not have noticed any real-time penalty when processing small numbers of files, however with a large batch, the time difference with the two tools was noticeable.

Potential alternative for TIFF validation?

During the testing I noticed there were several issues with DPF Manager, including the lack of being able to specify the number of threads the process could use, which I suspect resulted in the poor initial performance. I dutifully reported the bug to the DPF community GitHub and was pleased to see an almost instant response stating that it would be resolved in the next monthly release. I do love Open Source projects, and I think this highlights the importance of those using the tools being responsible for improving them. Without community engagement, these projects are liable to run out of steam and slowly die.

I’m going to reserve judgement on the tools until the next release of DPF Manager. We will then also be in a position to report back on our findings from this validation case study. So check back with our blog for Part Two.

I would be interested to hear from anyone else who might have been faced with validating large batches of files, what tools are you using? what challenges have you faced? Do let me know!

Visit to the National Archives: herons and brutalism

An update from Edith Halvarsson about the DPOC team’s trip to visit the National Archives last week. Prepare yourself for a discussion about digital preservation, PRONOM, dark archives, and wildlife!


Last Thursday DPOC visited the National Archives in London. David Clipsham kindly put much time into organising a day of presentations with the TNA’s developers, digitization experts and digital archivists. Thank you Diana, David & David, Ron, Ian & Ian, Anna and Alex for all your time and interesting thoughts!

After some confusion, we finally arrived at the picturesque Kew Gardens station. The area around Kew is very sleepy, and our first thought on arrival was “is this really the right place?” However, after a bit more circling around Kew, you definitely cannot miss it. The TNA is located in an imposing brutalist building, surrounded by beautiful nature and ponds built as flood protection for the nation’s collections. They even have a tame heron!

After we all made it on site, the day the kicked off with an introduction from Diana Newton (Head of Digital Preservation). Diana told us enthusiastically about the history of the TNA and its Digital Records Infrastructure. It was really interesting to hear how much has changed in just six years since DRI was launched – both in terms of file format proliferation and an increase in FOI requests.

We then had a look at TNA’s ingest workflows into Preservica and storage model with Ian Hoyle (Senior Developer) and David Underdown (Senior Digital Archivist). It was particularly interesting to hear about the TNA’s decision to store all master file content on offline tape, in order to bring down the archive’s carbon footprint.

After lunch with Ron Davies (Senior Project Manager), Anna de Sousa and Ian Henderson spoke to us about their work digitizing audiovisual material and 2D images. Much of our discussion focused on standards and formats (particularly around A/V). Alex Green and David Clipsham then finished off the day talking about born-digital archive accession streams and PRONOM/DROID developments. This was the first time we had seen the clever way a file format identifier is created – there is much detective work required on David’s side. David also encouraged us and anyone else who relies on DROID to have a go and submit something to PRONOM – he even promised its fun! Why not read Jenny Mitcham’s and Andrea Byrne’s articles for some inspiration?

Thanks for a fantastic visit and some brilliant discussions on how digital preservation work and digital collecting is done at the TNA!

Training begins: personal digital archiving

Outreach & Training Fellow, Sarah, has officially begun training and capacity building with session on personal digital archiving at the Bodleian Libraries. Below Sarah shares how the first session went and shares some personal digital archiving tips.


Early Tuesday morning and the Weston Library had just opened to readers. I got to town earlier than usual, stopping to get a Melbourne-style flat white at one of my favourite local cafes – to get in me in the mood for public speaking. By 9am I was in the empty lecture theatre, fussing over cords, adjusting lighting and panicking of the fact I struggled to log in to the laptop.

At 10am, twenty-one interested faces were seated with pens at the ready; there was nothing else to do but take a deep breath and begin.

In the 1.5 hour session, I covered the DPOC project, digital preservation and personal digital archiving. The main section of the training was learning about personal digital archiving, preservation lifecycle and the best practice steps to follow to save your digital stuff!

The steps of the Personal Digital Archiving & Preservation Lifecycle are intended to help with keeping your digital files organised, findable and accessible over time. It’s not prescriptive advice, but it is a good starting point for better habits in your personal and work lives. Below are tips for every stage of the lifecycle that will help build better habits and preserve your valuable digital files.

Keep Track and Manage:

  • Know where your digital files are and what digital files you have: make a list of all of the places you keep your digital files
  • find out what is on your storage media – check the label, read the file and folder names, open the file to see the content
  • Most importantly: delete or dispose of things you no longer need.
    • This includes: things with no value, duplicates, blurry images, previous document versions (if not important) and so on.

Organise:

  • Use best practice for file naming:
    • No spaces, use underscores _ and hyphens – instead
    • Put ‘Created Date’ in the file name using yyyymmdd format
    • Don’t use special characters <>,./:;'”\|[]()!@£$%^&*€#`~
    • Keep the name concise and descriptive
    • Use a version control system for drafts (e.g. yyyymmdd_documentname_v1.txt)
  • Use best practice for folder naming;
    • Concise and descriptive names
    • Use dates where possible (yyyy or yyyymmdd)
    • keep file paths short and avoid a deep hierarchy
    • Choose structures that are logical to you and to others
  • To rename large groups of image files, consider using batch rename software

Describe:

  • Add important metadata directly into the body of a text document
    • creation date & version dates
    • author(s)
    • title
    • access rights & version
    • a description about the purpose or context of the document
  • Create a README.txt file of metadata for document collections
    • Be sure to list the folder names and file names to preserve the link between the metadata and the text file
    • include information about the context of the collection, dates, subjects and relevant information
    • this is a quick method for creating metadata around digital image collections
  • Embed the metadata directly in the file
  • for image and video: be sure to add subjects, location and a description of the trip or event
  • Add tags to documents and images to aid discoverability
  • Consider saving the ‘Creation Date’ in the file name, a free text field in the metadata, in the document header or in a README text file if it is important to you. In some cases transferring the file (copying to new media, uploading to cloud storage) will change the creation date and the original date will be lost. The same goes for saving as a different file type. Always test before transfer or ‘Save As’ actions or record the ‘Creation Date’ elsewhere.

Store:

  • Keep two extra backups in two geographically different locations
  • Diversify your backup storage media to protect against potential hardware faults
  • Try to save files in formats better suited to long-term access (for advice on how to choose file formats, visit Stanford University Libraries)
  • refresh your storage media every three to five years to protect against loss of hardware failure
  • do annual spot checks, including checking all backups. This will help check for any loss, corruption or damaged backups. Also consider checking all of the different file types in your collection, to ensure they are still accessible, especially if not saved in a recommended long-term file format.

Even I can admit I need better personal archiving habits. How many photographs are still on my SD cards, waiting for transfer, selection/deletion and renaming before saving in a few choice safe backup locations? The answer is: too many. 

Perhaps now that my first training session is over, I should start planning my personal side projects. I suspect clearing my backlog of SD cards is one of them.

Useful resources on personal digital archiving:

DPC Technology Watch Report, “Personal digital archiving” by Gabriela Redwine

DPC Case Note, “Personal digital preservation: Photographs and video“, by Richard Wright

Library of Congress “Personal Archiving” website, which includes guidance on preserving specific digital formats, videos and more

 

A view from the basement – a visit the DPC Glasgow

Last Monday, Sarah, Edith and Lee visited the Digital Preservation Coalition (DPC) at their DPC Glasgow Office on University Gardens. The aim of the visit was to understand how the DPC has and will lend support to the DPOC project. The DPOC team is very fortunate in having the DPC’s expertise, resources and services at their disposal as a supporting partner in the project and we were keen to find out more.

Plied with tea, coffee and Sharon McMeekin’s awesome lemon cake, William Kilbride gave us an overview of the DPC, explaining that that they are not-for-profit membership based organisation who used to mainly cater for the UK and Ireland. However, international agencies are now welcome (UN, NATO, ICC to name a few) and this has changed the nature of their program and the features that they offer (website, streaming, event recording). They are vendor neutral but do have a ‘Commercial Supporter’ community to help support events and raise funds for digital preservation work. They have six members of staff working from the DPC Glasgow and DPC York offices. They focus upon four main areas of:

  • Workforce Development, Training and Skills
  • Communication and Advocacy
  • Research and Practice
  • Partnerships and Sustainability

William explained the last three areas and Sharon gave us an overview of the work that she does for developing workforce skills and offering training events, especially the ‘Getting Started in Digital Preservation’ and ‘Making Progress’ workshops. The DPC also provide Leadership Scholarships to help develop knowledge and CPD in digital preservation, so please do apply for those if you are working somewhere that can spare your time out of the office but can’t fund you.

In terms of helping DPOC, the DPC can help with hosting events (such as PASIG 2017) and provide supporting training resources for our organisations. They can also help with procurement processes, auditing as well as calling on the wealth of advice gained from their six members of staff.

We left feeling that, despite working as a collaborative team with colleagues we can already bounce ideas off, we had a wider support network that we could call on, guide us and help us share our work more widely. From a skills and training perspective, the idea that they are happy to review, comment and suggest further avenues for the skills needs analysis toolkit to ensure it will benefit of the wider community is of tremendous use. Yet this is one such example, and help with procurement, policy development and auditing is also something they are willing to help the project with.

It is reassuring that the DPC are there and have plenty of experience to share in the digital preservation sphere. Tapping into networks, sharing knowledge and collaborating really is the best way to help achieve a coherent, sustainable approach to digital preservation and helps those working in it to focus on specific tasks rather than try and ‘reinvent the wheel’ when somebody else has already spent time on it.

DPC Student Conference – What I Wish I Knew Before I Started

At the end of January, I went to the Chancellor’s Hall at the University of London’s Art Deco style Senate House. Near to the entrance of the Chancellor’s Hall was Room 101. Rumours circulated amongst the delegates keenly awaiting the start of the conference that the building and the room were the inspiration for George Orwell’s Nineteen Eighty-Four.

Instead of facing my deepest and darkest digital preservation fears in Senate House, I was keen to see and hear what the leading digital preservation trainers and invited speakers at different stages of their careers had to say. For the DPOC project, I wanted to see what types of information were included in introductory digital preservation training talks, to witness the styles of delivery and what types of questions the floor would raise to see if there were any obvious gaps in the delivery. For the day’s programme, presenters’ slides and Twitter Storify, may I recommend that you visit the DPC webpage for this event:

http://www.dpconline.org/events/past-events/wiwik-2017

The take away lesson from the day, is just do something, don’t be afraid to start. Sharon McMeekin showed us how much the DPC can help (see their new website, it’s chock full of digital preservation goodness) and Steph Taylor from CoSense showed us that you can achieve a lot in digital preservation just through keeping an eye on emerging technologies and that you spend most of your time advocating that digital preservation is not just backing up. Steph also reinforced to the student delegation that you can approach members of the digital preservation community, they are all very friendly!

From the afternoon session, Dave Thompson reminded those assembled that we also need to think about the information age that we live in, how people use information, how they are their own gatekeepers to their digital records and how recordkeepers need to react to these changes, which will require a change in thinking from traditional recordkeeping theory and practice. As Adrian Brown put it for digital archivists, “digital archivists are archivists with superpowers”. One of those superpowers is the ability to adapt to your working context and the technological environment. Digital preservation is a constantly changing field and the practitioner needs to be able to adapt and change to the environment around them in a chameleon like manner to get their institution’s work preserved. Jennifer Febles reminded us that is also OK to say that “you don’t know” when training people, you can go away and learn or even learn from other colleagues. As for the content of the day, there were no real gaps, the day programme was spot on as far as I could tell from the delegates.

Whilst reflecting on the event on the journey back on the train (and whilst simultaneously being packed into the stifling hot carriage like a sweaty sardine), the one thing that I really wanted to find out was what the backgrounds of the delegates were. More specifically, what ‘information schools’ they were attending, what courses they were undertaking, how much their modules concerned digital recordkeeping and their preservation, and, most importantly, what they are being taught in those modules.

My thoughts then drifted towards thinking of those who have been given the label of ‘digital preservation experts’. They have cut their digital preservation teeth after their formal qualifications and training in an ostensibly different subject. Through a judicious application and blending of discipline-specific learning, learning about related fields they then apply this learning to their specific working context. Increasingly, in the digital world, those from a recordkeeping background need to embrace computer science skills and applications, especially for those where coding and command line operation is not a skill they have been brought up with. We seem to be at a point where the leading digital preservation practitioners are plying their trade (as they should) and not teaching their trade in a formal education setup. A very select few are doing both but if we pulled practitioners into formal digital preservation education programmes, would we then drain the discipline of innovative practice? Should digital preservation skills (which DigCurV has done well to define) be better suited to one big ‘on the job’ learning programme rather than more formal programmes. A mix of both would be my suggestion but this discussion will never close.

Starting out in digital preservation may seem terribly daunting, with so much to learn as there is so much going on. I think that the ‘information schools’ can equip students with the early skills and knowledge but from then on, the experience and skills is learned on the job. The thing that makes the digital preservation community standout is that people are not afraid to share their knowledge and skills for the benefit of preserving cultural heritage for the future.

Data reproducibility, provenance capture and preservation

An update from the Cambridge Fellows about their visit to the Cambridge Computer Laboratory to learn about the team’s research on provenance metadata.


In amongst preparing reports for the powers that be and arranging vendor meetings, Dave and Lee took a trip over to the William Gates Building which houses the University of Cambridge’s Computer Laboratory. The purpose of the visit was to find out about the Digital Technology Group’s  projects from one of their Senior Research Associates, Dr. Ripduman Sohan. 

The particular project was the FRESCO project which stands for Fabric For Reproducible Computing. You can find out more about the strands of this project here: https://www.cl.cam.ac.uk/research/dtg/fresco. The link to the poster is especially useful and clearly and succintly captures the key points of the meeting far better than my meeting notes.

Cambridge Computer Laboratory - FRESCO Poster

FRESCO Poster. Image credit: Cambridge Computer Laboratory.

The discussion on provenance was of interest to me coming from an recordkeeping background and hearing it discussed in computer science terms. What he was talking about and what archivists do really wasn’t a million miles apart – just that the provenance capture on the data happens in nanoseconds on mind blowing amounts of data.

Rip’s approach, to my ears at least, was refreshing. He believes that computer scientists should start to listen to, move across into and understand ‘other’ domains like the humanities. Computer science should be ‘computing for the future of the planet’ and not a subject that should impose itself on other disciplines which creates a binary choice of the CompSci way or the highway. This is so they can use their computer science skills to help both future research and the practitioners working with humanities information and data.

Polonsky Fellows visit Western Bank Library at Sheffield University

Overview of DPOC’s visit to the Western Bank Library at Sheffield University by James Mooney, Technical Fellow at Bodleian Libraries, Oxford.
___________________________________________________________________________

The Polonsky Fellows were invited to the Western Bank Library at Sheffield University to speak with Laura Peaurt and other members of the Library. The aim of the meeting was to discuss the experiences of using and implementing Ex Libris’ Rosetta product.

After arriving by train, it was just a quick tram ride to Western Bank campus at Sheffield University, then we had the fun of using the paternoster lift in the Western Bank Library to arrive at our meeting, it’s great to see this technology has been preserved and still in use.

Paternoster lifts still in use at the Western Library. Image Credit: James Mooney

We met with Laura Peaurt (Digital Preservation Manager), Chris Jones (Library Systems Manager) and Angus Taggart (Library Systems Manager – Research).

Andy Bussey, Head of Digital Services & Systems was kind enough to give us an hour of his time at the start of the meeting, allowing us to discuss parts of the procurement and implementation process.

When working out the requirements for the system, Sheffield was able to collaborate with the White Rose University Consortium (the Universities of Leeds, Sheffield and York) to work out an initial scope.

When reviewing the options both open source and proprietary products were considered. For the Western Library and the University back in 2014, after a skills audit, the open source options had to be ruled out due to a lack of technical and developmental skills to customise or support them. I’m sure if this was revisited today the outcome may well have been different as the team has grown and gained experience and expertise. Many organisations may find it easier to budget for a software package and support contract with a vendor than to pursue the creation of several new employment positions.

With that said, as part of the implementation of Rosetta, Laura’s role was created as there was an obvious need for a Digital Preservation manager, we then went on to discuss the timeframe of the project and then moved onto the configuration of the product with Laura providing a live demonstration of the product whilst talking about the current setup, the scalability of the instances and the granularity of the sections within Rosetta.

During the demonstrations we discussed what content was held in Rosetta, how people had been trained with Rosetta and what feedback they had received so far. We reviewed the associated metadata which had been stored with the items that had been ingested and went over the options regarding integration with a Catalogue and/or Archival Management System.

After lunch we went on discuss the workflows currently being used with further demonstrations so we could see an end-to-end examples including what ingest rules and polices were in place along with what tools were in use and what processes were carried out. We then looked at how problematic items were dealt with in the Technical Analysis Workbench, covering the common issues and how additional steps in the ingest process can minimise certain issues.

As part of reviewing the sections of Rosetta we also inspected of Rosetta’s metadata model, the DNX (Digital Normalised XML) and discussed ingesting born-digital content and associated METS files.

Western Library. Image Credit: A J Buildings Library.

We visited Sheffield with many questions and during the course of the discussions throughout the day many of these were answered but as the day came to a close we had to wrap up the talks and head back to the train station. We all agreed it had been an invaluable meeting and sparked further areas of discussion. Having met face to face and with an understanding of the environment at Sheffield will make future conversations that much easier.