Transcribing interviews

The second instalment of Lee’s experience running a skills audit at Cambridge University Library. He explains what is needed to be able to transcribe the lengthy and informative interviews with staff.


There’s no ground-breaking digital preservation goodness contained within this post so you have permission to leave this page now. However, this groundwork is crucial to gaining an understanding of how institutions can prepare for digital preservation skills and knowledge development. It may also be useful to anyone who is preparing to transcribe recorded interviews.

Post-interview: transcribing the recording

Once you have interviewed your candidates and made sure that you have all the recordings (suitably backed up three times into private, network free storage like an encrypted USB stick so as to respect privacy wishes), it is time to transcribe.

So, what do you need?

  • A very quiet room. Preferably silence, where there are no distractions and where you can’t distract people. You may wish to choose the dictation path and if you do that in an open plan office, you may attract attention. You will also be reciting information that you have assured will remain confidential.
  • Audio equipment. You will need a device that can play your audio files and has an audio control player built into it. You can use your device’s speakers, headphones, preferably with a control device built into the wire, or foot pedal.
  • Time. Bucket loads of it. If you are doing other work, this needs to become the big rock in your time planning, everything else should be mere pebbles and sand. This is where manager support is really helpful, as is…
  • Understanding. The understanding that this will rule your working life for the next month or two and the understanding of those around the size of the task of what you are doing. To have an advocate who has experience of this type of work before is invaluable.
  • Patience. Of a saint.
  • Simple transcription rules. Given the timeframes of the project, complex transcription would have been too time consuming. Please see the following work below, as used by the University of California, San Diego, it’s really useful with nice big text.
    Dresing, Thorsten/Pehl, Thorsten/Schmieder, Christian (2015): Manual (on) Transcription. Transcription Conventions, Software Guides and Practical Hints for Qualitative Researchers. 3rd English Edition. Marburg Available Online: http://www.audiotranskription.de/english/transcription-practicalguide.htm
    (Last accessed: 27.06.2017). ISBN: 978-3-8185-0497-7.

Cropped view of person hands typing on laptop computer. Image credit: Designed by Freepik

What did you do?

Using a Mac environment, I imported the audio files for transcription into a desktop folder and created a play list in iTunes. I reduced the iTunes application to the mini player view and opened up Word to type into. I plugged in my headphones and pressed play and typed as I was listening.

If you get tired typing, the Word application on my Mac has a nifty voice recognition package. It’s uncannily good now. Whilst I tried to route the output sound into the mic by using Soundflower (I wasted time doing this as when the transcription did yield readable text, it used words worthy of inciting a Mary Whitehouse campaign) I did find that dictation provided a rest for weary fingers. After a while, you will probably need to rest a weary voice, so you can switch back to typing.

When subjects starting talking quickly, I needed a way to slow them down as constantly pressing pause and rewind got onerous. A quick fix for this was to download Audacity. This has the function to slow down your sound files. Once the comedic effect of voice alteration has worn off, it becomes easier to transcribe as you don’t have to pause and rewind as much.

Process wise, it doesn’t sound much and it isn’t. It’s just the sheer hours of audio that needs to be made legible through listening, rewinding an typing.

How can the process be made (slightly) easier?

  • Investigate transcription technology and processes. Investigate technologies available beforehand that you can access. I wish I had done this rather than rely on the expectation that I would be just listening and typing. I didn’t find a website with the answer but a thoughtful web search can help you with certain parts of the transcription method.
  • Talk slowly. This one doesn’t apply to the transcription process but the interview process. Try and ask the questions a little bit slower than you usually would as the respondent will subconsciously mimic your speed of delivery and slow themselves down

Hang on in there, it’s worth it

Even if you choose to incorporate the suggestions above, be under absolutely no illusions: transcription is a gruelling task. That’s not a slight against the participants’ responses for they will be genuinely interesting and insightful. No, it’s a comment on the frustration of the process and sheer mental grind of getting through it. I must admit I had only come to a reasonably happy transcription method by the time I had reached number fourteen (of fifteen). However, the effort is completely worth it. In the end, I now have around 65,000 quality words (research data) to analyse to understand what existing digital skills, knowledge, ways of learning and managing change exist within my institution that can be fed into the development of digital preservation skills and knowledge.

DPASSH: Getting close to producers, consumers and digital preservation

Sarah shares her thoughts after attending the DPASSH (Digital Preservation in the Arts, Social Sciences and Humanities) Conference at the University of Sussex (14 – 15 June).


DPASSH is a conference that the Digital Repository Ireland (DRI) puts on with a host organisation. This year, it was hosted by the Sussex Humanities Lab at the University of Sussex, Brighton. What is exciting about this digital preservation conference is that it brings together creators (producers) and users (consumers) with digital preservation experts. Most digital preservation conferences end up being a bit of an echo chamber, full of practitioners and vendors only. But what about the creators and the users? What knowledge can we share? What can we learn?

DPASSH is a small conference, but it was an opportunity to see what researchers are creating and how they are engaging with digital collections. For example in Stefania Forlini’s talk, she discussed the perils of a content-centric digitisation process where unique print artefacts are all treated the same; the process flattens everything into identical objects though they are very different. What about the materials and the physicality of the object? It has stories to tell as well.

To Forlini, books span several domains of sensory experience and our digitised collections should reflect that. With the Gibson Project, Forlini and project researchers are trying to find ways to bring some of those experiences back through the Speculative W@nderverse. They are currently experimenting with embossing different kinds of paper with a code that can be read by a computer. The computer can then bring up the science fiction pamphlets that are made of that specific material. Then a user can feel the physicality of the digitised item and then explore the text, themes and relationships to other items in the collection using generous interfaces. This combines a physical sensory experience with a digital experience.

For creators, the decision of what research to capture and preserve is sometimes difficult; often they lack the tools to capture the information. Other times, creators do not have the skills to perform proper archival selection. Athanasios Velios offered a tool solution for digital artists called Artivity. Artivity can capture the actions performed on a digital artwork in certain programs, like Photoshop or Illustrator. This allows the artist to record their creative process and gives future researchers the opportunity to study the creative process. Steph Taylor from CoSector suggested in her talk that creators are archivists now, because they are constantly appraising their digital collections and making selection decisions.  It is important that archivists and digital preservation practitioners empower creators to make good decisions around what should be kept for the long-term.

As a bonus to the conference, I was awarded with the ‘Best Tweet’ award by the DPC and DPASSH. It was a nice way to round out two good, informative days. I plan to purchase many books with my gift voucher!

I certainly hope they hold the conference next year, as I think it is important for researchers in the humanities, arts and social sciences to engage with digital preservation experts, archivists and librarians. There is a lot to learn from each other. How often do we get our creators and users in one room with us digital preservation nerds?

Policy ramblings

For the second stage of the DPOC project Oxford and Cambridge have started looking at policy and strategy development. As part of the DPOC deliverables, the Policy and Planning Fellows will be collaborating with colleagues to produce a digital preservation policy and strategy for their local institutions. Edith (Policy and Planning Fellow at Oxford) blogs about what DPOC has been up to so far.


Last Friday I met with Somaya (Policy and Planning Fellow) and Sarah (Training and Outreach Fellow) at the British Library in London. We spent the day discussing review work which DPOC has done of digital preservation policies so far. The meeting also gave us a chance to outline an action plan for consulting stakeholders at CUL and Bodleian Libraries on future digital preservation policy development.

Step 1: Policy review work
Much work has already gone into researching digital preservation policy development [see for example the SCAPE project and OSUL’s policy case study]. As considerable effort has been exerted in this area, we want to make sure we are not reinventing the wheel while developing our own digital preservation policies. We therefore started by reading as many digital preservation policies from other organisations as we could possibly get our hands on. (Once we ran out of policies in English, I started feeding promising looking documents into Google Translate with a mixed bag of results.) The policy review drew attention to aspects of policies which we felt were particular successful, and which could potentially be re-purposed for the local CUL and Bodleian Libraries contexts.

My colleague Sarah helped me with the initial policy review work. Between the two of us we read 48 policies dating from 2008-2017. However, determining which documents were actual policies was trickier than we had first anticipated. We found that documents named ‘strategy’ sometimes read as policy, and documents named policy sometimes read as more low level procedures. For this reason, we decided to add another 12 strategy documents to the review which had strong elements of policy in them. This brought us up to a round 60 documents in total.

So we began reading…. But we soon found that once you are on your 10th policy of the day, you start to get them muddled up. To better organise our review work, we decided to put them into a classification system developed by Kirsten Snawder (2011) and adapted by Madeline Sheldon (2013). Snawder and Sheldon identified nineteen common topics from digital preservation policies. The topics range from ‘access and use’ to ‘preservation planning’ [for the full list of topics, see Sheldon’s article on The Signal from 2013]. I was interested in seeing how many policies would make direct reference to the Open Archival Information System (OAIS) reference model, so I added this in as an additional topic to the original nineteen identified by Snawder and Sheldon.

Reviewing digital preservation policies written between 2008-2017

Step 2: Looking at findings
Interestingly, after we finished annotating the policy documents we did not find a correlation between covering all of Snawder and Sheldon’s nineteen topics and having what we perceived as an effective policy. Effective in this context was defined as the ability of the policy to clearly guide and inform preservation decisions within an organisation. In fact, the opposite was more common as we judged several policies which had good coverage of topics from the classification system to be too lengthy, unclear, and sometimes inaccessible due to heavy use of digital preservation terminology.

In terms of OAIS, another interesting finding was that 33 out of 60 policies made direct reference to the OAIS. In addition to these 33, several of the ones which did not make an overt reference to the model still used language and terminology derived from it.

So while we found that the taxonomy was not able to guide us on which policy topics were an absolute essential in all circumstances, using it was a good way of arranging and documenting our thoughts.

Step 3: Thinking about guiding principles for policy writing
What this foray into digital preservation policies has shown us is that there is no ‘one fits all’ approach or a magic formula of topics which makes a policy successful. What works in the context of one institution will not work in another. What ultimately makes a successful policy also comes down to communication of the policy and organisational uptake. However, there are number of high level principles which the three of us all felt strongly about and which we would like to guide future digital preservation policy development at our local institutions.

Principle 1: Policy should be accessible to a broad audience. Contrary to findings from the policy review, we believe that digital preservation specific language (including OAIS) should be avoided at policy level if possible. While reviewing policy statements we regularly asked ourselves:

“Would my mother understand this?”

If the answer is yes, the statement gets to stay. If it is no, maybe consider re-writing it. (Of course, this does not apply if your mother works in digital preservation.)

Principle 2: Policy also needs to be high-level enough that it does not require constant re-writing in order to make minor procedural changes. In general, including individuals’ names or prescribing specific file formats can make a policy go out of date quickly. It is easier to change these if they are included in lower level procedures and guidelines.

Principle 3: Digital preservation requires resources. Getting financial commitment to invest in staff at policy level is important. It takes time to build organisation expertise in digital preservation, but losing it can happen a lot quicker. Even if you choose to outsource several aspects of digital preservation, it is important that staff have skills which enables them to understand and critically assess the work of external digital preservation service providers.

What are your thoughts? Do you have other principles guiding digital preservation policy development in your organisations? Do you agree or disagree with our high-level principles?

Preserving research – update from the Cambridge Technical Fellow

Cambridge’s Technical Fellow, Dave, discusses some of the challenges and questions around preserving ‘research output’ at Cambridge University Library.


One of the types of content we’ve been analysing as part of our initial content survey has been labelled ‘research output’. We knew this was a catch-all term, but (according to the categories in Cambridge’s Apollo Repository), ‘research output’ potentially covers: “Articles, Audio Files, Books or Book Chapters, Chemical Structures, Conference Objects, Datasets, Images, Learning Objects, Manuscripts, Maps, Preprints, Presentations, Reports, Software, Theses, Videos, Web Pages, and Working Papers”. Oh – and of course, “Other”. Quite a bundle of complexity to hide behind one simple ‘research output’ label.

One of the categories in particular, ‘Dataset’, zooms the fractal of complexity in one step further. So far, we’ve only spoken in-depth to a small set of scientists (though our participation on Cambridge’s Research Data Management Project Group means we have a great network of people to call on). However, both meetings we’ve had indicate that ‘Datasets’ are a whole new Pandora’s box of complicated management, storage and preservation challenges.

However – if we pull back from the complexity a little, things start to clarify. One of the scientists we spoke to (Ben Steventon at the Steventon Group) presented a very clear picture of how his research ‘tiered’ the data his team produced, from 2-4 terabyte outputs from a Light Sheet Microscope (at the Cambridge Advanced Imaging Centre) via two intermediate layers of compression and modelling, to ‘delivery’ files only megabytes in size. One aspect of the challenge of preserving such research then, would seem to be one of tiering preservation storage media to match the research design.

(I believe our colleagues at the JISC, who Cambridge are working with on the Research Data Management Shared Service Pilot Project, may be way ahead of us on this…)

Of course, tiering storage is only one part of the preservation problem for research data: the same issues of acquisition and retention that have always been part of archiving still apply… But that’s perhaps where the ‘delivery’ layer of the Steventon Group’s research design starts to play a role. In 50 or 100 years’ time, which sets of the research data might people still be interested in? It’s obviously very hard to tell, but perhaps it’s more likely to be the research that underpins the key model: the major finding?

Reaction to the ‘delivered research’ (which included papers, presentations and perhaps three or four more from the list above) plays a big role, here. Will we keep all 4TBs from every Light Sheet session ever conducted, for the entirety of a five or ten-year project? Unlikely, I’d say. But could we store (somewhere cold, slow and cheap) the 4TBs from the experiment that confirmed the major finding?

That sounds a bit more within the realms of possibility, mostly because it feels as if there might be a chance that someone might want to work with it again in 50 years’ time. One aspect of modern-day research that makes me feel this might be true is the complexity of the dependencies between pieces of modern science, and the software it uses in particular. (Blender, for example, or Fiji). One could be pessimistic here and paint a negative scenario of ‘what if a major bug is found in one of those apps, that calls into question the science ‘above it in the chain’. But there’s an optimistic view, here, too… What if someone comes up with an entirely new, more effective analysis method that replaces something current science depends on? Might there not be value in pulling the data from old experiments ‘out of the archive’ and re-running them with the new kit? What would we find?

We’ll be able to address some of these questions in a bit more detail later in the project. However, one of the more obvious things talking to scientists has revealed is that many of them seem to have large collections of images that need careful management. That seems quite relevant to some of the more ‘close to home’ issues we’re looking at right now in The Library.

When was that?: Maintaining or changing ‘created’ and ‘last modified’ dates

Sarah has recently been testing scenarios to investigate the question of changes in file ‘date created’ and ‘last modified’ metadata. When building training, it’s always best to test out what your advice before giving it and below is the result of Sarah’s research with helpful screenshots.


Before doing some training that involved teaching better recordkeeping habits to staff, I ran some tests to be sure that I was giving the right advice when it came to created and last modified dates. I am often told by people in the field that these dates are always subject to change—but are they really? I knew I would tell staff to put created dates in file names or in document headers in order to retain that valuable information, but could the file maintain the correct embedded date anyways?  I set out to test a number of scenarios on both my Mac OS X laptop and Windows desktop.

Scenario 1: Downloading from cloud storage (Google Drive)

This was an ALL DATES change for both Mac OS X and Windows.

Scenario 2: Uploading to cloud storage (Google Drive)

Once again this was an ALL DATES change for both systems.

Note: I trialled this a second time with the Google Drive for PC application and in OS X and found that created and last modified dates do not change when the file is uploaded or downloaded the Google Drive folder on the PC. However, when in Google Drive via the website, the created date is different (the date/time of upload), though the ‘file info’ will confirm the date has not changed. Just to complicate things.

Scenario 3: Transfer from a USB

Mac OS X had no change to the dates. Windows showed an altered created date, but maintained the original last modified date.

Scenario 4: Transfer to a USB

Once again there was no change of a dates in the Mac OS X. Windows showed an altered created date, but maintained the original last modified date.

Note: I looked into scenarios 3 and 4 for Windows a bit further and saw that Robocopy is an option as a command prompt that will allow directories to be copied across and maintains those date attributes. I copied a ‘TEST’ folder containing the file from the Windows computer to the USB, and back again. It did what was promised and there were no changes to either dates in the file. It is a bit annoying that an extra step is required (that many people would find technically challenging and therefore avoid).

Scenario 5: Moving between folders

No change across either systems. This was a relief for me considering how often I move files around my directories.

Conclusions

When in doubt (and you should always be in doubt), test the scenario. Even when I tested these scenarios three of four times, it did not always come out with the same result. That alone should make one cautious. I still stick to putting created date in the file name and in the document itself (where possible), but it doesn’t meant I always receive documents that way.

Creating a zip of files/folders before transfer is one method of preserving dates, but I had some weird issues trying to unzip the file in cloud storage that took a few tries before the dates remained preserved. It is also possible to use Quickhash for transferring files unchanged (and it generates a checksum).

I ignored the last accessed date during testing, because it was too easy to accidentally double-click a file and change it (as you can see happened to my Windows 7 test version).

Has anyone tested any other scenarios to assess when file dates are altered? Does anyone have methods for transferring files without causing any change to dates?