process

Automated transcription and some risks of machine interpretation

Automated transcription systems can be negatively affected by systematic cultural prejudice... it's important to understand that, just like us, these machine learning systems have learnt from our dominant global culture and society

Daniel Turner

Apr 25, 2023 • 7 min read

Founder and director of Quirkos, Dr Daniel Turner has 20 years of expertise in qualitative research in academia and wider social research

A few months ago, we announced we were working on Quirkos Transcribe - an automated transcription service built into Quirkos Cloud, to allow for quick, accurate and secure transcription of audio files straight into text in your Quirkos projects.

Last time we detailed some of the advantages of the system, but now we are going to go into some details about how it works and some of the important methodological issues to be aware of.

So how does it work?

Well, we are using an open-source transcription tool called Whisper, which is a pretty standard 'tensor' based machine learning model. But what makes it different is the data that is used to train the system to understand spoken words.

For Whisper to learn what language sounds like, the main training data was 680,000 hours of audio, that already had accompanying transcriptions. So these were things like TV shows and podcasts (with subtitles), court recordings, political debates (think C-SPAN or Hansard). Previous training for speech-recognition systems tended to not have much 'text' attached: the system was expected to guess what was being said. Since the Whisper approach included the 'correct' answers (existing transcriptions) for what people were saying, it could do a much better job at learning. The audio recordings and text were in loads of different languages, and resulted in a 'model' - an algorithm that would interpret spoken words in dozens of languages. In fact, this model has been scaled to different sizes, as it's huge. In the full version, Whisper has 1,550 million parameters, and needs 12Gb of RAM to load. There are smaller and less accurate models that have just 39 million parameters, and with a bit of trickery you can get running on a powerful home computer, albeit quite slowly.

Basically, all so called AI (Artificial Intelligence) is nothing of the sort - all this training does not create a conscious intelligent system, just a very complex algorithm (the model) that makes decisions to understand data. But rather than being programmed directly by a human, machine learning (or ML, the better term) uses a few core mathematical approaches to create huge probabilistic models to perform a certain task.

Wait, what?

OK, I'm going to try and give a simplified and crude explanation.

Basically, a computer program has been given a lot of data and asked to find common rules that apply. So you give the computer thousands of recordings where different people, in different places are saying the word tomato (or tomahto depending on your accent). The computer 'listens' to all this data, and creates a little branching model where, if the first sound is 'to' it's 10% likely to be tomato and 90% likely to be something else. If the next sound is 'ma' then it's 20% likely to be to-ma-to, and 20% likely to be Tho-ma-s (even though Thomas starts with TH - phonetically it sounds similar).

So the computer builds, tests, rejects and refines these little probability branches on the training data it has, until it has millions and millions of these 'parameters'. When a bunch of these are put together it will suggest that, when it hears me say 'tomato' there's a 80% probability I'm saying tomato. It might also look at the context of the words around it, and decide that since I'm saying 'I Love Tomato' it could be ‘Tomato’ or ‘Thomas’, but if then I say 'when they are ripe' it might be more likely to be Tomato. This context comes from the model 'learning' the kinds of things people typically say from the data it has heard. This is just one reason to treat such results with caution, as they tend to show the more likely result than the more unusual one.

It's also why these systems can be negatively affected by systematic cultural prejudice. The data set of TV, podcasts and news that were used to train the system will be a dataset that enacts privilege: the voices will likely be predominantly white and male, and even when not, still part of a dominant discriminatory discourse. Thus the system may be more likely to guess that "People with disabilities can't" rather than "People with disabilities can". Prejudices can be inherent to human transcribers as well, but I still think it's important to understand that, just like us, these machine learning systems have learnt from our dominant global culture and society: a society that has an entangled history of colonialisation and intersectionalities of oppression.

There are significant issues with the secretive nature of how the data for training this (and other machine learning tools) was collected, almost certainly without any consent from the people in the recorded voices. This is particularly problematic for the languages of marginalised communities, and I also want to argue that these tools are not neutral, and have been created as part of a continuing process of colonialisation. I would urge people to read this article about the lack of consent from indigenous communities, the risk of appropriation of their language, and potential use of these technologies for new forms of control and distortion of cultures.

Many of these arguments reflect more on the ability of Whisper to translate, as well as transcribe - something we are not intending to support in our implementation. Also, while we have not restricted the languages that our transcription tool will attempt to transcribe, we recommend and only provide options for the 5 languages that have best accuracy: Spanish, Italian, English, Portuguese and German. It's no coincidence these are the languages of colonisers (Dutch and French marginally miss being in this top five for word-error rate). I would always encourage researchers working from and with indigenous languages to include expert transcribers embedded in both the language and culture to better represent these voices.

I'm sometimes asked these days to talk about "AI" and its potential impact on qualitative research, especially the idea that tools like ChatGPT could replace researchers and interpret qualitative data. That's a topic probably worthy of its own blog post, but in general I am highly sceptical about these tools being useful for academic qualitative research. These tools do not, and cannot understand lived experience, and so cannot meaningfully interpret the rich narrative qualitative data most researchers are looking at. They lack the nuance of understanding emotion and culture, and cannot understand a research question, or the experience of injustice that in some way is a focus for so much qualitative research.

There are existing machine learning tools that can do a reasonable job at very basic thematic coding or sentiment analysis. It can guess that angry is negative, bus and train are public transport - essentially working like an advanced thesaurus. But it cannot do the meaning and theory generating that is the main step (and challenge) of qualitative analysis. We've looked at such systems in the past, but they are limited, surprisingly racist(!), and of no use to meaningful, theory generating qualitative research. I just find such tools methodologically problematic for the types of qualitative research myself and our users tend to do.

So why would we launch an automated transcription service, based on similar machine learning principles?

Well, I'd like users to see this tool as a transcription assistant, and not an instant final answer. It's still important to listen, correct and keep re-reading the data. It's easy to edit and change your sources in Quirkos. Machine learning tools like this create approximate mathematical interpretations of spoken language, based on what they've learnt from. And these training data sets themselves have biases, errors and limitations, just like our own ears and brain. Just as you would with a professional transcription service, you need to check that the transcription matches what you hear and want to represent in the data. Transcription is always an interpretative act, even if you do it yourself. And let's be honest, transcribing is a tedious and time-consuming process, and transcription services free up valuable time for spending on recruitment, and more detailed analysis of qualitative data. I also don't think automated transcription services like this will make professional transcribers obsolete - many I know have been using assisted transcription systems for decades to speed-up the process (remember Dragon Naturally Speaking?!), and there will always be a market for higher-accuracy, expert corrected transcripts.

No matter how, or by whom the transcription is done, I always like to keep the audio close when analysing. It's good to keep going back to the recording, hear how people are saying things, remember the emotions and nuances that are so vital to preserve in qualitative research.

But a text interpretation of recordings of interviews, focus groups and other types of data is incredibility useful in qualitative research: it makes it much easier to apply most types of analysis, and gives quotes for writing up and sharing research findings. And in some situations an automated secure service is a better option than professional services – you can assure participants that no-one else will hear the recording, and when dealing with difficult research topics, you can prevent sharing traumatic testimony with transcribers. The way we have implemented our transcription system means it’s end-to-end encrypted, and not shared with any other service or data providers. The ability to quickly get very accurate transcripts back (even before the participant has walked out the door) and the ease of transcribing hours or days of audio changes how you think about what you can include as ‘data’.

You can watch a webinar on these issues and benefits I gave as part of the CAQDAS Network Webinar series.

Quirkos Transcribe is an optional add-on for subscribers to Quirkos Cloud, you will need to have an active Quirkos Cloud subscription for it to work. It will cost an extra $12 a month when paid annually, for 50 audio hours of transcription a month, probably the cheapest option around. It’s launching next month, when we will have another post with more details, and some of the ways that Quirkos Transcribe can improve research, and the way you think about data.

All technology and digital tools, from CAQDAS to Word has an impact on qualitative research, and we feel the best approach is to understand it, be reflexive about it, and consider whether it is complementary with our epistemology. We hope that the accessibility, affordability, speed, accuracy and security of our transcription system will not only make life easier for qualitative researchers, but can change the way we think about data and research design.

You can try Quirkos right now, in your browser with a 14 day free trial, and tune in soon to learn more!

Sign up for more like this.