About Sarah

Digital Preservation Specialist - Outreach and Training: Bodleian Libraries, Oxford University

Digital Preservation at Oxford Open Days

Oxford Fellow, Sarah, describes the DPOC team’s pop-up exhibition “Saving Digital,” held at the Radcliffe Science Library during Oxford Open Days #OxOpenDay. The post describes from the equipment and games the team showcased over the two days and some of the goals they had in mind for this outreach work.


On 27 June and 28 June, Oxford ran Open Days for prospective students. The city was alive with open doors and plenty of activity. It was the perfect opportunity for us to take our roadshow kit out and meet with prospective students with a pop-up exhibition called “Saving Digital”. The Radcliffe Science Library (RSL) on Parks Road kindly hosted the DPOC team and all of our obsolete media for two day in their lounge area.

The pop-up exhibition hosted at the RSL

We set up our table with a few goals in mind:

  • to educate prospective students about the rapid pace of technology and the concern about how we’re going to read digital data off them in the future (we educated a few parents as well!)
  • to speak with library and university staff about their digital dilemmas and what we at the digital preservation team could do about it
  • to raise awareness about the urgency and need of digital preservation in all of our lives and to inform more people about our project (#DP0C)

To achieve this, we first drew people in with two things: retro gaming and free stuff.

Last minute marketing to get people to the display. It worked!

Our two main games were the handheld game, Galaxy Invader 1000, and Frak! for the BBC Micro.

Frak! on the BBC Micro. The yellow handheld console to the right is Galaxy Invader 1000.

Galaxy Invader 1000 by CGL (1980) is a handheld game, which plays a version of Space Invaders. This game features a large multi-coloured display and 3 levels of skill. The whole game was designed to fit in 2 kilobytes of memory. 

Frak! (1984) was a game released for the BBC Micro in 1984 under the Aardvark software label. It was praised for excellent graphics and game play. In the side scrolling game, you play a caveman named Trogg. The aim of the game is to cross a series of platforms while avoiding dangers that include various monsters named Poglet and Hooter. Trogg is armed with a yo-yo for defence. 

Second, we gave them some digestible facts, both in poster form and by talking with them:

Saving Digital poster

Third, we filled the rest of the table with obsolete media and handheld devices from about the last forty years—just a small sample of what was available! This let them hold some of the media of the past, marvel over how little it could hold, but how much it could do for the time. And then we asked them how would they read the data off it today. That probably concerned parents more than their kids as several of them admitted to having important digital stuff either still on VHS or miniDV tapes, or on 3.5-inch disks! It got everyone thinking at least.

A lot of obsolete media all in one place.

Lastly, an enthusiastic team with some branded t-shirts made to emulate our most popular 1st generation badge, which was pink with a 3.5-inch disk in the middle. We gave away our last one during Open Days! But don’t worry, we have some great 2nd generation badges to collect now.

An enthusiastic team always helps. Especially if they are willing to demo the equipment.


A huge thank you to the RSL for hosting us for two days—we’ll be back on the 16th of July if you missed us and want to visit the exhibition! We’ll have a few extra retro games on hand and some more obsolete storage media!

Our poster was found on display in the RSL.

Update on the training programme pilot

Sarah, Oxford’s Outreach and Training Fellow, has been busy since the new year designing and a running a digital preservation training programme pilot in Oxford. It consisted of one introductory course on digital preservation and six other workshops. Below is an update on what she did for the pilot and what she has learnt over the past few months.


It’s been a busy few months for me, so I have been quiet on the blog. Most of my time and creative energy has been spent working on this training programme pilot. In total, there were seven courses and over 15 hours of material. In the end, I trialled the courses on over 157 people from Bodleian Libraries and the various Oxford college libraries and archives. Many attendees were repeats, but some were not.

The trial gave me an opportunity to test out different ideas and various topics. Attendees were good at giving feedback, both during the course and after via an online survey. It’s provided me with further ideas and given me the chance to see what works or what doesn’t. I’ve been able to improve the experience each time, but there’s still more work to be done. However, I’ve already learned a lot about digital preservation and teaching.

Below are some of the most important lessons I’ve learned from the training programme pilot.

Time: You always need more

I found that I almost always ran out of time at the end of a course; it left no time for questions or to finish that last demo. Most of my courses could have either benefited from less content, shorter exercises, or just being 30 minutes longer.

Based on feedback from attendees, I’ll be making adjustments to every course. Some will be longer. Some will have shorter exercises with more optional components and some will have slightly less content.

While you might budget 20 minutes for an activity, you will likely use 5-10 minutes more. But it varies every time due to the attendees. Some might have a lot of questions, but others will be quieter. It’s almost better to overestimate the time and end early than rush to cover everythhing. People need a chance to process the information you give them.

Facilitation: You can’t go it alone

In only one of my courses did I have to facilitate alone. I was run off my feet for the 2 hours because it was just me answering questions during  exercises for 15 attendees. It doesn’t sound like a lot, but I had a hoarse voice by the end from speaking for almost 2 hours!

Always get help with facilitation—especially for workshops. Someone to help:

  • answer questions during exercises,
  • get some of the group idea exercises/conversations started,
  • make extra photocopies or print outs, and
  • load programs and files onto computers—and then help delete them after.

It is possible to run training courses alone, but having an extra person makes things run smoother and saves a lot of time. Edith and James have been invaluable support!

Demos: Worth it, but things often go wrong

Demos were vital to illustrate concepts, but they were also sometimes clunky and time consuming to manage. I wrote up demo sheets to help. The demos relied on software or the Internet—both which can and will go wrong. Patience is key; so is accepting that sometimes things will not go right. Processes might take a long time to run or the course concludes before the demo is over.

The more you practice on the computer you will be using, the more likely things will go right. But that’s not always an option. If it isn’t, always have a back up plan. Or just apologise, explain what should have happened and move on. Attendees are generally forgiving and sometimes it can be turned into a really good teaching moment.

Exercises: Optional is the way to go

Unless you put out a questionnaire beforehand, it is incredibly hard to judge the skill level of your attendees. It’s best to prepare for all levels. Start each exercise slow and have a lot of optional work built in for people that work faster.

In most of my courses I was too ambitious for the time allowed. I wanted them to learn and try everything. Sometimes I wasn’t asking the right questions on the exercises either. Testing exercises and timing people is the only way to tailor them. Now that I have run the workshops and seen the exercises in action, I have a clearer picture of what I want people to learn and accomplish—now I just have to make the changes.

Future plans

There were courses I would love to run in the future (like data visualisation and digital forensics), but I did not have the time to develop. I’d like to place them on a roadmap for future training. As well as reaching out more to the Oxford colleges, museums and other departments. I would also like to tailor the introductory course a bit more for different audiences.

I’d like to get involved with developing courses like Digital Preservation Carpentry that the University of Melbourne is working on. The hands-on workshops excited and challenged me the most. Not only did others learn a lot, but so did I. I would like to build on that.

At the end of this pilot, I have seven courses that I will finalise and make available through a creative commons licence. What I learned when trying to develop these courses is that there isn’t always a lot of good templates available on the Internet to use as a starting point—you have to ask around for people willing to share.

So, I am hoping to take the work that I’ve done and share it with the digital preservation community. I hope they will be useful resources that can be reused and repurposed. Or at the very least, I hope it can be used as a starting point for inspiration (basic speakers notes included).

These will be available via the DPOC website sometime this summer, once I have been able to make the changes necessary to the slides and exercises—along with course guidance material. It has been a rewarding experience (as well as an exhausting one); I look forward to developing and delivering more digital preservation training in the future.

Email preservation 2: it is hard, but why?

A post from Sarah (Oxford) with input from Somaya (Cambridge) about the 24 January 2018 DPC event on email archiving from the Task Force on Technical Approaches to Email Archives.

The discussion of the day circulated around what they had learnt during the year of the task force, that personal and public stories are buried in email, considerable amounts of email have been lost over previous decades, that we should be treating email as data (it allows us to understand other datasets), that current approaches to collecting and preserving email don’t work as they’re not scalable and the need for the integration of artificial intelligence and machine learning (this is already taking place in legal professions with ‘predictive coding’ and clustering technologies) to address email archives, including natural language processing functions is important.


Back in July, Edith attended the first DPC event on email preservation, presented by the Task Force on Technical Approaches to Email Archives. She blogged about here. In January this year, Somaya and I attended the second event hosted again by the DPC.

Under the framework of five working groups, this task force has spent 12 months (2017) focused on five separate areas of the final report, which is due out in around May this year:

  • The Why: Overview / Introduction
  • The When/Who/Where: Email Lifecycles Perspectives
  • The What: The Needs of Researchers
  • The How: Technical Approaches and Solutions
  • The Path Forward: Sustainability & Community Development

The approach being taken is technical, rather than on policy. Membership of the task force includes the DPC, representatives from universities and national institutions from around the world and technology companies including Google and Microsoft.

For Chris Prom (from University of Illinois Urbana Champaign, who authored the 2011 DPC Technology Watch Report on Preserving Email) and Kate Murray’s (Library of Congress and contributor to FADGI) presentation about the work they have been doing, you can view their slides here. Until the final report is published, I have been reviewing the preliminary draft (of June 2017) and available documents to help develop my email preservation training course for Oxford staff in April.

So, when it comes to email preservation, most of the tools and discussions focus on processing email archives. Very little of the discussion has to do with the preservation of email archives over time. There’s a very good reason for this. Processing email archives is the bottleneck in the process, the point at which most institutions are still stuck at. It is hard to make decisions around preservation, when there is no means for collecting email archives or processing them in a timely manner.

There were many excellent questions and proposed solutions from the speakers at the January event. Below are some of the major points from the day that have informed my thinking of how to frame training on email preservation:

Why are email archives so hard to process?

  1. They are big. Few people cull their emails and over time they build up. Reply and ‘reply all’ functions expand out emails chains and attachments are growing in size and diversity. It takes a donor a while to prepare their email archives, much less for an institution to transfer and process them.
  2. They are full of sensitive information. Which is hard to find. Many open source technology assisted review (TAR) tools miss sensitive information. Software used for ‘predictive coding’ and machine learning for reviewing email archives are well out of budget for heritage institutions. Manual review is far too labour intensive.
  3. There is no one tool that can do it all. Email preservation requires ‘tool chaining’ in order to transfer, migrate and process email archives. There are a very wide variety of email software programs which in turn create a many different email file format types. Many of the tools used in email archive processing are not compatible with each of the different email file types; this requires a multiple file format migrations to allow for processing. For a list of some of the current available tools, see the Task Force’s list here.

What are some of the solutions?

  1. Tool chaining will continue. It appears for now, tool chaining is here to stay, often mixing proprietary with open source tools to get workflows running smoothly. This means institutions will need to invest in establishing email processing workflows: the software, people who know about how to handle different email formats etc.
  2. What about researchers? Access to emails is tightly controlled due to sensitivity restraints, but is there space to get researchers to help with the review? If they use the collection for research, could they also be responsible for flagging anything deemed as sensitive? How could this be done ethically?
  3. More automation. Better tool development to assisted with TAR. Reviewing processes must become more automated if email archives are ever to be processed. The scale of work is increasing and traditional appraisal approaches (handling one document at a time) and record schedules are no longer suitable.
  4. Focus on bit-level preservation first. Processing of email archives can come later, but preserving it needs to start on transfer. (But we know users want access and our institutions want to provide this access to email archives.)
  5. Perfection is no longer possible. While archivists would like to be precise, in ‘scaling up’ email archive processing we need to think about it as ‘big data’ and take a ‘good enough’ approach.

Institutional risk and born-digital content: the shutdown of DCist #IDPD17

Another post for today’s International Digital Preservation Day 2017. Outreach and Training Fellow, Sarah, discusses just how real institutional risk is and how it can lead to a loss of born digital archives — a risk that digital-only sites like DCist have recently proven. Read more about the Gothamist’s website shutdowns this November.


In today’s world, so much of what we create and share exists only in digital form. These digital-only creations are referred to as born-digital — they were created digitally and they often continue in that way. And so much of our born-digital content is shared online. We often take for granted content on the Internet, assuming it will always be there. But is it? Likely it will at least be captured by the Internet Archive’s Wayback Machine or a library web archiving equivalent. But is that actually enough? Does it capture a complete, usable record? What happens when a digital-only creation, like a magazine or newspaper, is shut down?

Institutional risk is real. In the commercial world of born-digital content that persists only in digital form, the risk of loss is high.

Unfortunately, there’s recently been a very good example of this kind of risk when the Gothamist shut down its digital-only content sites such as the DCist. This happened in early November this year.

The sites and all the associated content was completely removed from the Internet by the morning of 3 November. Gone. Taken down and replaced with a letter from billionaire CEO, Joe Ricketts, justifying the shutdown because despite its enormous popularity and readership, it just wasn’t “economically successful.”

Wayback Machine’s capture of the redirect page and Ricketts’ letter

The DCist site and all of its content was gone completely; readers instead were redirected to another page entirely to read Joe Ricketts’ letter. Someone had literally pulled the plug on the whole thing.

Internet Archive’s 3 November 2017 capture, showing a redirect from the DCist.com page. DCist was gone from the Internet.

The access to content was completely lost, save for what the Internet Archive captured and what content was saved by creators elsewhere. But access to the archives of 13 years of DCist content was taken from the Internet and its millions of readers. At that point all we had were some web captures, incomplete records of the content left to us.

The Internet Archive’s web captures for DCist.com over the past 13 years.

What would happen to the DCist’s archive now? All over Twitter people were being sent to Internet Archive or to check Google’s cache to download the lost content. But as Benjamin Freed pointed out in his recent Washingtonian article:

“Those were noble recommendations, but would have been incomplete. The Wayback Machine requires knowledge about URLs, and versions stored in Google’s memory banks do not last long enough. And, sure, many of the subjects DCist wrote about were covered by others, but not all of them, and certainly not with the attitude with which the site approached the world.”

As Freed reminds us “A newspaper going out of business is tragic, but when it happens, we don’t torch the old issues or yank the microfilms from the local library.” In the world of born-digital content, simply unplugging the servers and leaving the digital archive to rot means that at best, we may only have an incomplete record of the 1,000s of articles and content of a community.

If large organisations are not immune to this kind of institutional risk, what about the small ones? The underfunded ones?

To be clear, I think web archiving is important and I have used it a number of times when a site is no longer available — it’s a valuable resource. But it only goes so far and sometimes the record of website is incomplete. So what else can we do? How can we keep the digital archive alive? The good news is that while Ricketts has put the DCist site back up as an “archive” — it’s more like a “digital graveyard” that he could pull the plug on again any time he wants. How do you preserve something so fragile, so at risk? The custodians of the digital content care little for it, so how will it survive for the future?

The good news is that the DCist archive may have another home, not just one that survives on the mercy of a CEO.

The born-digital archives of the DCist require more than just a functioning server over time to ensure access. Fortunately, there are places where digital preservation is happening to all kinds of born-digital collections and there are passionate people who are custodians of this content. These custodians care about keeping it accessible and understandable for future generations. Something that Joe Ricketts clearly does not.


What are your thoughts on this type of institutional risk and its impacts on digital preservation? How can we preserve this type of content in the future? Is web archiving enough or do we need a multi-prong approach? Share your thoughts below and on Twitter using the #IDPD17 hashtag.

 

International Digital Preservation Day 2017 #IDPD17

It is International Digital Preservation Day. Today, around the world we celebrate the field that is fighting against time and technology to make sure that our digital “things” survive. And in turn, we are trying to make time and technology work with us.


We’re the people that see a 5.25” floppy disk and think “I bet I can read that. I wonder what I’ll find?” and we’re already making a list of where we can find the hardware and software to read it. We’re already dating it to wonder what kind of files would be on it, what software created those files—can we still find them? We’re willing to try, because every day that disk is ageing and every day is the possibility that when we get around to reading it, the data might be corrupted.

We’re the people fighting against the inevitable technological obsolescence, juggling media carriers, file formats, technological failures, software obsolescence and hardware degradation. It is like a carefully coordinated dance, where one wrong thing can end up in some sort of error. A file can’t open, or if I can open it what am I even staring at? We’re trying to save our digital world, before it degrades and corrupts.

Sometimes it’s not always that dire, but it’s the knowledge that if something gets overlooked, at some point – often in the blink of an eye – something will be lost. Something will be damaged. It’s like playing a kind of Russian roulette, expect for those of us who are custodians of unique digital collections, we can’t take those chances. We cannot lose our digital assets, our digital “things” that we collect on behalf of the public, or for compliance reasons, or because we are keeping a record of the now for the future. After all, we have stories to tell, histories to save – what is it that we want to leave for the future?

If we don’t consider preserving our digital “things” now, then we might not leave a story behind to tell.

For some reason, while this is an issue we all struggle with (raise your hand if you’ve lost a digital file in your life or if your computer/tablet/phone has crashed and you lost everything and didn’t have a backup) digital preservation is still something people don’t know about or just don’t talk about. Why is something that we are all struggling with ignored so much? Is it because we’re not speaking up enough? Is it because people just lose their stuff and move on, forgetting about it? When so much of our lives’ records are now only digital, how can we just forget what we lose? How can we not care?

The truth is we should. And we should all be looking to digital preservation in one form or another. From individuals to big business, digital preservation matters. It’s not just for the cultural heritage and higher education institutions to “do” or to “worry” about. It involves you too.

The good news is that the world is starting to catch on. They are starting to look to us, the digital preservation practitioners, to see what they should do. They are starting to worry, starting to see the cracks in the digital world. Nothing lasts forever and sometimes in the digital world, it can be gone in a second with just a flick of a switch. Maybe it lives on somewhere, on those motionless hard drives, but without active management and commitment, even those hard drives will fail you some days. The events around the Gothamist’s shut down of its online news sites (inc. DCist and LAist) has highlighted this. The recent Slate article of streaming only services has us worried about preservation of TV and film content that is born digital and so centralised, that it cannot rely on a LOCKSS-based approach (Lots of Copies Keeps Stuff Safe).

These are of course just some of the things we need to worry about. Just some of things we’ll have to try to save. There’s still the other approximately 2.5 quintillion bytes (or roughly about 2.5 exabytes or 2.5 billion gigabytes) of data being created around the world each day to worry about. We’re not going to keep it all, but we’re going to want to keep some of it. And that some of it is rapidly increasing.

So this International Digital Preservation Day, I encourage everyone to think about their digital lives, at home and at work, and think about what you need to do to make your digital “things” last. There are a field of experts in the world, who are here to help. We are no further than a tweet away. We survive by collaborating and helping each other. And we’re here to help you save the bits.


Want to learn more?

Visit the Digital Preservation Coalition for advice, reports and further information: http://www.dpconline.org/ 

Speak to the digital preservation hive mind on Twitter using any of these hashtags: #digitalpreservation #digipres #digpres

For more International Digital Preservation Day activities, visit: http://www.dpconline.org/events/international-digital-preservation-day or check out the hashtag #IDPD17

DPOC: 1 year on

Oxford’s Outreach & Training Fellow, Sarah, reflects on how the first year of the DPOC project has gone and looks forward to the big year ahead.


A lot can happen in a year.

A project can finally get a name, a website can launch and a year of auditing can finally reach completion. It has been a long year of lessons and finding things for the Oxford DPOC team.

While project DR@CO and PADLOC never got off the ground, we got the DPOC Project. And with it has come a better understanding of our digital preservation practices at Bodleian Libraries. We’re starting year two with plenty of informed ideas that will lead to roadmaps for implementation and a business case to help continue to move Oxford forward with a digital preservation programme.

Auditing our collections

For the past year, Fellows have been auditing the many collections. The Policy and Planning Fellow spent nearly 6 months tracking down the digitized content of Bodleian Libraries across tape storage and many legacy websites. There was more to be found on hard drives under desks, on network drives and CDs. What Edith found was 20 years of digitized images at Bodleian Libraries. From that came a roadmap and recommendations to improve storage, access and workflows. Changes have already been made to the digitization workflow (we use jpylyzer now instead of jhove) and more changes are in progress.

James, the Technical Fellow at Oxford, has been looking at validating and characterising the TIFFs we have stored on tape, especially the half a million TIFFs from the Polonsky Foundation Digitization Project. There were not only some challenges to recovering the files from tape to disk for the characterisation and validating process, but there was issue with customising the output from JHOVE in XML. James did find a workaround to getting the outputs into a reporting tool for assessment in the end, but not without plenty of trial and error. However, we’re learning more about our digitized collections (and the preservation challenges facing them) and during year 2 we’ll be writing more about that as we continue to roadmap our future digital preservation work.

Auditing our skills

I spoke to a lot of staff and ran an online survey to understand the training needs of Bodleian Libraries. It is clear that we need to develop a strong awareness about digital preservation and its fundamental importance to the long-term accessibility of our digital collections. We also need to create a strong shared language in order to have these important discussions; this is important when we are coming together from several different disciplines, each with a different language. As a result, some training has begun in order to get staff thinking about the risks surrounding the digital content we use every day, in order to later translate it into our collections. The training and skills gaps identified from the surveys done in year 1 will continue to inform the training work coming in year 2.

 

What is planned for year 2?

Now that we have a clearer picture of where we are and what challenges are facing us, we’ve been putting together roadmaps and risk registers. This is allowing us to look at what implementation work we can do in the next year to set us up for the work of the next 3, 5, 10, and 15 years. There are technical implementations we have placed into a roadmap to address the major risks highlighted in our risk register. This work is hopefully going to include things like implementing PREMIS metadata and file format validation. This work will prepare us for future preservation planning.

We also have a training programme roadmap and implementation timeline. While not all of the training can be completed in year 2 of the DPOC project, a start can be made and materials prepared for a future training programme. This includes developing a training roadmap to support the technical implementations roadmap and the overall digital preservation roadmap.

There is also the first draft of our digital preservation policy to workshop with key stakeholders and develop into a final draft. There are roles and responsibilities to review and key stakeholders to work with if we want to make sustainable changes to our existing workflows.

Ultimately, what we are working towards is an organisational change. We want more people to think about digital preservation in their work. We are putting forward sustainable recommendations to help develop an ongoing digital preservation programme. There is still a lot a work ahead of us — well beyond the final year of this project — but we are hoping that what we have started will keep going even after the project reaches completion.

 

 

What is holding us back from change?

There are worse spots for a meeting. Oxford. Photo by: S. Mason

Every 3 months the DPOC teams gets together in person in either Oxford, Cambridge or London (there’s also been talk of taking a meeting at Bletchley Park sometime). As this is a collaborative effort, these meetings offer a rare opportunity to work face-to-face instead of via Skype with the endless issues around screen sharing and poor connections. Good ideas come when we get to sit down together.

As our next joint board meeting is next week, it was important to look over the work of the past year and make sure we are happy with the plan for year two. Most importantly, we wanted to discuss the messages we need to give our institutions as we look towards the sustainability of our digital preservation activities. How do we ensure that the earlier work and the work being done by us does not get repeated in 2-5 years time?

Silos in institutions

This is especially complicated when dealing with institutions like Oxford and Cambridge. We are big and old institutions with teams often working in silos. What does siloing have an effect on? Well, everything. Communication, effort, research—it all suffers. Work done previously is done again. Over and over.

The same problems are being tackled within different silos; this is duplicated and wasted effort if they are not communicating their work to each other. This means that digital preservation efforts can be fractured and imbalanced if institutional collaboration is ignored. We have an opportunity and responsibility in this project to get people together and to get them to talk openly about the digital preservation problems they are each trying to tackle.

Managers need to lead the culture change in the institution

While not always the case, it is important that managers do not just sit back and say “you will never get this to work” or “it has always been this way.” We need them on our side; they after often the gatekeepers of silos. We have to bring them together in order to start opening the silos.

It is within their power to be the agents of change; we have to empower them to believe in changing the habits of our institution. They have to believe that digital preservation is worth it if their team will also.

This might be the ‘carrot and stick’ approach or the ‘carrot’ only, but whatever approach is used, the are a number of points we agreed needed to be made clear:

  • our digital collections are significant and we have made assurances about their preservation and long term access
  • our institutional reputation plays a role in the preservation our digital assets
  • digital preservation is a moving target and we must be moving with it
  • digital preservation will not be “solved” through this project, but we can make a start; it is important that this is not then the end.

Roadmap to sustainable digital preservation

Backing up any messages is the need for a sustainable roadmap. If you want change to succeed and if you want digital preservation to be a core activity, then steps must be actionable and incremental. Find out where you are, where you want to go and then outline the timeline of steps it will take to get there. Consider using maturity models to set goals for your roadmap, such as Kenney and McGovern’s, Brown’s or the NDSA model. Each are slightly different and some might be more suitable for your institutions than others, so have a look at all of them.

It’s like climbing a mountain. I don’t look at the peak as I walk; it’s too far away and too unattainable. Instead, I look at my feet and the nearest landmark. Every landmark I pass is a milestone and I turn my attention to the next one. Sometimes I glance up at the peak, still in the distance—over time it starts to grow closer. And eventually, my landmark is the peak.

It’s only when I get to the top that I see all of the other mountains I also have to climb. And so I find my landmarks and continue on. I consider digital preservation a bit of the same thing.

What are your suggestions for breaking down the silos and getting fractured teams to work together? 

DPASSH: Getting close to producers, consumers and digital preservation

Sarah shares her thoughts after attending the DPASSH (Digital Preservation in the Arts, Social Sciences and Humanities) Conference at the University of Sussex (14 – 15 June).


DPASSH is a conference that the Digital Repository Ireland (DRI) puts on with a host organisation. This year, it was hosted by the Sussex Humanities Lab at the University of Sussex, Brighton. What is exciting about this digital preservation conference is that it brings together creators (producers) and users (consumers) with digital preservation experts. Most digital preservation conferences end up being a bit of an echo chamber, full of practitioners and vendors only. But what about the creators and the users? What knowledge can we share? What can we learn?

DPASSH is a small conference, but it was an opportunity to see what researchers are creating and how they are engaging with digital collections. For example in Stefania Forlini’s talk, she discussed the perils of a content-centric digitisation process where unique print artefacts are all treated the same; the process flattens everything into identical objects though they are very different. What about the materials and the physicality of the object? It has stories to tell as well.

To Forlini, books span several domains of sensory experience and our digitised collections should reflect that. With the Gibson Project, Forlini and project researchers are trying to find ways to bring some of those experiences back through the Speculative W@nderverse. They are currently experimenting with embossing different kinds of paper with a code that can be read by a computer. The computer can then bring up the science fiction pamphlets that are made of that specific material. Then a user can feel the physicality of the digitised item and then explore the text, themes and relationships to other items in the collection using generous interfaces. This combines a physical sensory experience with a digital experience.

For creators, the decision of what research to capture and preserve is sometimes difficult; often they lack the tools to capture the information. Other times, creators do not have the skills to perform proper archival selection. Athanasios Velios offered a tool solution for digital artists called Artivity. Artivity can capture the actions performed on a digital artwork in certain programs, like Photoshop or Illustrator. This allows the artist to record their creative process and gives future researchers the opportunity to study the creative process. Steph Taylor from CoSector suggested in her talk that creators are archivists now, because they are constantly appraising their digital collections and making selection decisions.  It is important that archivists and digital preservation practitioners empower creators to make good decisions around what should be kept for the long-term.

As a bonus to the conference, I was awarded with the ‘Best Tweet’ award by the DPC and DPASSH. It was a nice way to round out two good, informative days. I plan to purchase many books with my gift voucher!

I certainly hope they hold the conference next year, as I think it is important for researchers in the humanities, arts and social sciences to engage with digital preservation experts, archivists and librarians. There is a lot to learn from each other. How often do we get our creators and users in one room with us digital preservation nerds?

Preserving research – update from the Cambridge Technical Fellow

Cambridge’s Technical Fellow, Dave, discusses some of the challenges and questions around preserving ‘research output’ at Cambridge University Library.


One of the types of content we’ve been analysing as part of our initial content survey has been labelled ‘research output’. We knew this was a catch-all term, but (according to the categories in Cambridge’s Apollo Repository), ‘research output’ potentially covers: “Articles, Audio Files, Books or Book Chapters, Chemical Structures, Conference Objects, Datasets, Images, Learning Objects, Manuscripts, Maps, Preprints, Presentations, Reports, Software, Theses, Videos, Web Pages, and Working Papers”. Oh – and of course, “Other”. Quite a bundle of complexity to hide behind one simple ‘research output’ label.

One of the categories in particular, ‘Dataset’, zooms the fractal of complexity in one step further. So far, we’ve only spoken in-depth to a small set of scientists (though our participation on Cambridge’s Research Data Management Project Group means we have a great network of people to call on). However, both meetings we’ve had indicate that ‘Datasets’ are a whole new Pandora’s box of complicated management, storage and preservation challenges.

However – if we pull back from the complexity a little, things start to clarify. One of the scientists we spoke to (Ben Steventon at the Steventon Group) presented a very clear picture of how his research ‘tiered’ the data his team produced, from 2-4 terabyte outputs from a Light Sheet Microscope (at the Cambridge Advanced Imaging Centre) via two intermediate layers of compression and modelling, to ‘delivery’ files only megabytes in size. One aspect of the challenge of preserving such research then, would seem to be one of tiering preservation storage media to match the research design.

(I believe our colleagues at the JISC, who Cambridge are working with on the Research Data Management Shared Service Pilot Project, may be way ahead of us on this…)

Of course, tiering storage is only one part of the preservation problem for research data: the same issues of acquisition and retention that have always been part of archiving still apply… But that’s perhaps where the ‘delivery’ layer of the Steventon Group’s research design starts to play a role. In 50 or 100 years’ time, which sets of the research data might people still be interested in? It’s obviously very hard to tell, but perhaps it’s more likely to be the research that underpins the key model: the major finding?

Reaction to the ‘delivered research’ (which included papers, presentations and perhaps three or four more from the list above) plays a big role, here. Will we keep all 4TBs from every Light Sheet session ever conducted, for the entirety of a five or ten-year project? Unlikely, I’d say. But could we store (somewhere cold, slow and cheap) the 4TBs from the experiment that confirmed the major finding?

That sounds a bit more within the realms of possibility, mostly because it feels as if there might be a chance that someone might want to work with it again in 50 years’ time. One aspect of modern-day research that makes me feel this might be true is the complexity of the dependencies between pieces of modern science, and the software it uses in particular. (Blender, for example, or Fiji). One could be pessimistic here and paint a negative scenario of ‘what if a major bug is found in one of those apps, that calls into question the science ‘above it in the chain’. But there’s an optimistic view, here, too… What if someone comes up with an entirely new, more effective analysis method that replaces something current science depends on? Might there not be value in pulling the data from old experiments ‘out of the archive’ and re-running them with the new kit? What would we find?

We’ll be able to address some of these questions in a bit more detail later in the project. However, one of the more obvious things talking to scientists has revealed is that many of them seem to have large collections of images that need careful management. That seems quite relevant to some of the more ‘close to home’ issues we’re looking at right now in The Library.

When was that?: Maintaining or changing ‘created’ and ‘last modified’ dates

Sarah has recently been testing scenarios to investigate the question of changes in file ‘date created’ and ‘last modified’ metadata. When building training, it’s always best to test out what your advice before giving it and below is the result of Sarah’s research with helpful screenshots.


Before doing some training that involved teaching better recordkeeping habits to staff, I ran some tests to be sure that I was giving the right advice when it came to created and last modified dates. I am often told by people in the field that these dates are always subject to change—but are they really? I knew I would tell staff to put created dates in file names or in document headers in order to retain that valuable information, but could the file maintain the correct embedded date anyways?  I set out to test a number of scenarios on both my Mac OS X laptop and Windows desktop.

Scenario 1: Downloading from cloud storage (Google Drive)

This was an ALL DATES change for both Mac OS X and Windows.

Scenario 2: Uploading to cloud storage (Google Drive)

Once again this was an ALL DATES change for both systems.

Note: I trialled this a second time with the Google Drive for PC application and in OS X and found that created and last modified dates do not change when the file is uploaded or downloaded the Google Drive folder on the PC. However, when in Google Drive via the website, the created date is different (the date/time of upload), though the ‘file info’ will confirm the date has not changed. Just to complicate things.

Scenario 3: Transfer from a USB

Mac OS X had no change to the dates. Windows showed an altered created date, but maintained the original last modified date.

Scenario 4: Transfer to a USB

Once again there was no change of a dates in the Mac OS X. Windows showed an altered created date, but maintained the original last modified date.

Note: I looked into scenarios 3 and 4 for Windows a bit further and saw that Robocopy is an option as a command prompt that will allow directories to be copied across and maintains those date attributes. I copied a ‘TEST’ folder containing the file from the Windows computer to the USB, and back again. It did what was promised and there were no changes to either dates in the file. It is a bit annoying that an extra step is required (that many people would find technically challenging and therefore avoid).

Scenario 5: Moving between folders

No change across either systems. This was a relief for me considering how often I move files around my directories.

Conclusions

When in doubt (and you should always be in doubt), test the scenario. Even when I tested these scenarios three of four times, it did not always come out with the same result. That alone should make one cautious. I still stick to putting created date in the file name and in the document itself (where possible), but it doesn’t meant I always receive documents that way.

Creating a zip of files/folders before transfer is one method of preserving dates, but I had some weird issues trying to unzip the file in cloud storage that took a few tries before the dates remained preserved. It is also possible to use Quickhash for transferring files unchanged (and it generates a checksum).

I ignored the last accessed date during testing, because it was too easy to accidentally double-click a file and change it (as you can see happened to my Windows 7 test version).

Has anyone tested any other scenarios to assess when file dates are altered? Does anyone have methods for transferring files without causing any change to dates?