Outsourcing decision making: AI, ethics, and qualitative research

Outsourcing decision making: AI, ethics, and qualitative research

So last year, ChatGPT 3 (and 4) were made public, and people have made a lot of noise about using this chat-focused LLM for doing qualitative analysis. I find this a really bad idea for many reasons, so I think it’s worth listing some 8 of my top reasons not to use ChatGPT for qualitative analysis:

Doesn’t give good, repeatable results

In my own testing with ChatGPT (3.5 Turbo) it almost seems to deliberately ignore some things. A interview where a person clearly discussed a research project in Africa, ChatGPT refused to see the context, falsely claiming "The provided text does not explicitly mention the specific population". It also seems to guess what might be in the data as opposed to what was actually said, falsely claiming the interviewee felt that 'rigor and validity of the research findings' was important for qualitative research - when the interviewee clearly felt the opposite.

The themes generated by ChatGPT for the interview were plausible (only 5 long themes were suggested) but for one prompt ChatGPT claimed it couldn't provide quotes for those themes, stating it would only 'generate hypothetical examples based on the context' – not a very useful qualitative tool. However, in another nearly identical query it did provide a verbatim quote from the interview. These kinds of erratic responses are frequently reported for ChatGPT. In general the output seems convincing, but fails to hold up to detailed scrutiny, and is certainly not reliable enough to be something I would use in a research setting at the moment.

Bias in the training data / uncertainty

Large LLM models like ChatGPT and Llama are created by giving an algorithm a huge amount of text data (think things like Wikipedia, Reddit posts, news articles, even whole novels) that it learns patterns from: grammar and word structure, and the most likely words that will be used in certain contexts and styles. This data includes the illegal scraping of copyrighted data, and with mass amounts of input from exploited workers in developing countries, in a system that far from being automatic and clever, requires a huge amount of intervention to try and correct ‘objectionable’ and incorrect outputs.

OpenAI refuses to detail the data sources that were used to train ChatGPT, likely because there are almost certainly legal and privacy issues, which have led to it being banned in many countries. There are also worrying security issues, for example certain prompts can get ChatGPT to leak it's training data, including personally identifiable information. There was no consent from the creators of the work used to train the algorithm, so lawsuits from authorsand news publications are currently in play. Image generating AI tools like Stable Diffusion used datasets that included pictures of children being sexually abused to train their datasets, and had to be removed. It is unlikely that the datasets for training ChatGPT were chosen with more care.

We’ve talked before about the bias which can be inherent in machine learning systems when they are trained on data sources from the internet that reflect wider prejudice and bias in our society, and how these tools will happily reproduce racial stereotypes for one example. Professor Damien P. Williams describes these systems as ‘bias optimizers’, and notes their potential to amplify bias in their training data, and lead to racist and discriminatory interpretations of data.

However, one thing that they almost certainly weren't trained on is qualitative research, which almost never has publicly available participant data or coding frameworks that these tools could learn from. Journal articles and thesis will be present in the training, but these rarely include the complete transcripts of qualitative data. So it likely has very little 'experience' with typical qualitative data, or how researchers interpret it.

So while we can’t know for sure what data was used to train ChatGPT (which is problematic in itself), we can be fairly sure that it does not include much qualitative data, and contains inherent bias. This lack of transparency and evidence of discriminatory output should be enough reason for most researchers not to use such a tool. We urge researchers to take a reflexive stance on their own interpretations, and at the moment we can’t do the same for ChatGPT.

You don’t know how the decisions are made

Should you use for academic research a tool that states: "ChatGPT may give you inaccurate information. It’s not intended to give advice"? Should you use a tool where the training data is secret, and there is no way to verify or understand how decisions in interpretation were made? A tool where the process is a black box and output is not replicable?

At the end of the day, the main decision making is on you, even after using the tool to summarise or code data – to decide which codes are relevant or appropriate. But there is no way to see why certain codes were suggested (or why some were not). It's also impossible to tell which codes are 'hallucinations' in the bizarre post-truth language of AI marketing – codes that were not present in the data at all, but appeared in the output as a statistical glitch. This leads to a failure of trust in the process, as even with a human 'interpreter' of the results, some of the original suggestions may be invalid.

Outputs are based on statistical / probabilistic outputs

How do these ‘Large Language Models’ like ChatGPT work? We’ve actually covered this a bit in a previous blog post, but essentially they are very large, branching probabilistic models, that aim to predict what the next word in a sentence would be. LLMs are based on a statistical model derived from reducing text training data to tokens: a representative unit of a word, part of a word, or concept (a bit like codes), and how often the tokens connect with each other (weights) to predict which things are likely to be related (tomato is strongly weighted to food and soup, but less so to spades).

In LLMs like ChatGPT the ‘input’ is the data you input for a session (say your interview data), and the queries you ask of it (in the form of a sentence). ChatGPT parses the data and your prompt, and using the ‘weights’ derived from its training data, delivers the most statistically probable response as a text string. Some have argued differently, but despite the inputs, outputs and training data being text based (and therefore qualitative, right?) LLMs are based on reducing qualitative text data to a statistical, probabilistic model. It’s quantitative, not qualitative. That’s OK, if it fits with your ontological and epistemological approach, but I’d strongly argue that it is a quantification of qualitative data, something most qualitative approaches abhor.

Why is this problematic? One of the most important reasons for me is actually also around weighting. If you are just counting the number of times someone says something negative, the “I don’t like this flavour of ice-cream” gets the same significance in the project as someone saying “I hate ice-cream because it killed my only daughter”. This is a ridiculous example, but one of these statements is more significant than the other, and both have very different meanings and motivations behind them. But for any system that is doing basic counting, unless one of those statements is ‘weighted’ so that it is given more significance than the other, these are just two people that dislike ice cream, possibly out of hundreds that do like ice cream, so they become statistically insignificant. In qualitative research, it’s unusual to write-off any respondents as being insignificant, actually I often argue that understanding the unusual responses is most important to understanding the dataset as a whole.

So if these models are by default trained to predict the most likely outcomes, when we are using them to help with interpreting qualitative data, can they surprise us? They might be able to suggest patterns in the data or interpretations we hadn’t thought of, but can they suggest new or innovative interpretations? It seems unlikely, and for me that limits their use in much qualitative analysis, where we are doing theory generation, and trying to create new understandings and often challenge the status-quo. They also lack lived experience – they can’t empathise pain, empathy, frustration and other things that I’d argue can be important to understanding a lot of qualitative research data.

Privacy / security concerns

OpenAI claims to be private, secure and GDPR compliant, but clearly is not if most reporting on it is correct. ChatGPT is still a 'mechanical turk', a tool that seems automatic but relies on human input to train, correct and report issues. None of the 3rd party contracting agencies used to check ChatGPT prompts are listed – and from these reports, 3rd party contractors are able to read your prompts to fix errors and report illegal usage. Similar scandals came to light for Microsoft and WhatsApp, a service that claimed to be end-to-end encrypted but was not. There is a legal requirement in GDPR to list all subcontractors that can access data, but even the ChatGPT signup process contains a GDPR violation – it uses Mandrill (a MailChimp product) to verify user emails, passing your data to them, but this is not listed as one of their sub-processors, nor have many other of the contractors above that have worked on ChatGPT.

OpenAI claims that if you are using one of their (expensive) professional subscriptions, their data will not be retained for use in training the algorithm further. However, it does seem that data sharing would still apply to the ‘safety’ checks they run on prompts and data, and which allow data to be shared with third-party subcontractors. And myriad ‘leaks’ and ‘bugs’ indicate that what goes into ChatGPT may be coming out elsewhere.

Exploitative business model

Currently, tools like ChatGPT are being offered essentially for free, but unquestionably the cost of these will increase in the future. We are currently in the 'loss-leading' stage of products like Bard and ChatGPT, but I’m certain that this will change in the future. The model is very simple to Uber: attract users by offering a convenient product below cost, to destroy existing competitors and force a reduction in labour costs.

When autonomous driving failed to materialise (promised to be next year for every year from 2016), and drivers stopped wanting to work for Uber, prices rose and availability decreased. But there are large numbers of entrenched users, and this keeps Uber's valuation high, while finally reaching profitability. The same has been true for AirBnB and Amazon – initially cheap, but now as expensive or more so than the rest of the market, but with lots of loyal users that don’t bother to look elsewhere.

ChatGPT will be no different: lots of money has been invested in the ease of using the system and API so that developers build tools around it, becoming locked into the ecosystem while prices are attractive. By all accounts, pricing is heavily subsidised at the moment, with losses of $540m a year  caused by the huge cost of servers to run the systems offered at a loss (for now). This is unsustainable (for investors) and prices will increase to cover initial losses, and deliver profits to shareholders. OpenAI claims that continuing to run and train such models will require more computer chips than can currently be built, nuclear fusion reactors for power that don’t exist, and vast amounts of fresh water to cool them. This won’t come cheap, either to the environment or to end users if the hype is to be believed.

Even now, I estimate tools like MAXQDA and Atlas.TI that have jumped on the OpenAI bandwagon are paying at least $1 to $12 per query for qualitative analysis (depending on project size). I expect this cost to greatly increase in next few years, and it will be passed on to consumers. Cheaper, or open-source alternatives will also proliferate, but ChatGPT based tools will be locked into these proprietary systems.

Some have claimed that the 'disruption' caused by these AI models is an unavoidable and inevitable influence that qualitative researchers must embrace. But disruption is just a Silicon Valley nonsense word, a enaction of Capitalism that has been criticized at least as far back as Marx and Engels. I see the real cost and aim of 'disruption' is to destroy existing markets to shift power to a small number of privileged white men. Like taxi-apps, renting e-scooters, self-driving cars, VR headsets, cryptocurrency and NFTs, these ‘disruptions’ are designed to maximise hype and company valuations in the early stages, allowing early investors to cash out before the true value of the market and product becomes clear, and the value crashes.

Make no mistake, the hype around OpenAI is no different, but it was essential to the Silicon Valley set to create a new tech product and keep company valuations high after the Bitcoin pyramid scheme collapsed.


OpenAI is a bad company

I think it's also important to understand OpenAI itself. It's technically two organisations: one that is a non-profit (and releases research and open source tools like Whisper for transcription) and a commercial organisation that makes for-profit products. But recently the non-profit arm has been usurped, and used an elaborate coup to remove all women from their board and senior leadership, in what is becoming a systematic trend in the tech industry: solidifying power to white males who are open supporters of the anti-woke/white supremacy movement. Google did something similar a few years ago, firing women and black people from their now disbanded ‘safety’ teams when they raised warnings about AI tools. I would strongly recommend reading some of the work (or listening to some of the excellent talks) from Timnit Gebru and the DAIR Institute she now directs.

OpenAI is deliberately obtuse on transparency and accountability, and are breaking promises they’ve made on trust and public documentation. Since large investments from Microsoft, the focus seems to have shifted to rapidly commercialising existing products, precipiating a perceived AI arms race.

So the main question for me is, if we start using tools like ChatGPT to outsource our decision making, who are we outsourcing it to? What decisions are products that these organisations develop likely to make, and especially make in the future? In qualitative research, will they be making racist and sexist suggestions (as they already do for job applications, loan applications) in line with the political beliefs of their founders like Peter Thiel and Elon Musk?

It’s not ready.

What's the rush? The ease and speed of getting output can be seductive – you can ‘do’ your qualitative analysis in minutes! Use it as a chat-bot to bounce ideas off, without having to talk to a colleague! There's potential for these systems, but there is nothing magical about what exists at the moment. There's no reason that LLMs can't be developed with ethics, transparency, and rigour in mind. But that is not what we get from ChatGPT, and for academic research, its use seems questionable. There are already free and open-source trained alternatives, that you can run on your own computer, without sharing data with anyone, and give you much more information about their training data. I would be very wary of companies who dismiss problems with accuracy, state that there are only benefits or the use of AI in qualitative research is inevitable (don’t we get to choose methods any more?), and claim that criticisms can only come from people who haven't understood ‘AI’.

There are clearly some approaches where using a tool like ChatGPT would be fine, especially when working with large datasets in commercial and public sector areas where the nuance and data security isn’t required. Many organisations are looking for quick, quantitative summaries of qualitative data elements in surveys and feedback forms, and ChatGPT tools can provide that, albeit with questionable accuracy at the moment.

But I don’t think it is a good fit for most academic qualitative research, and we won’t be adding it or similar tools to Quirkos. It’s also not new, as this great post from Christina Silver notes, and there’s a great series of discussions on AI-based tools. I’m not a Luddite (I run a software company!), but I think I know when something is over-hyped and when it isn’t right for a majority of our users.

Quirkos makes simple, accessible qualitative analysis software that helps you use your own mind to interpret, explore and share your qualitative data. Try for free!