Outsourcing decision making: AI, ethics, and qualitative research

In my own testing with ChatGPT (3.5 Turbo) it almost seems to deliberately ignore some things. A interview where a person clearly discussed a research project in Africa, ChatGPT refused to see the context

Outsourcing decision making: AI, ethics, and qualitative research
This post is hosted by Quirkos, qualitative analysis software that makes your qualitative text analysis simple, fun and beautiful. Get closer to your data by starting a free trial today!

Since the rise in large language models (LLMs) like ChatGPT, many researchers may be considering their usage for doing qualitative analysis. However, there are lots of reasons you might want to avoid using large language models in your qualitative research.

When we had first started to test large language models to understand their capabilities, ChatGPT had just released 3.5 Turbo, which is no longer its latest version. However, we've reviewed the evidence 3 years on, and our following criticisms of LLMs still stand. No matter how much data gets added to a large language model and how much user feedback is applied to create new versions, this process doesn’t actually improve the LLM's analytical skills, but only makes it better at approximating the type of response you want to hear. Its understanding of your inputs will never get any better, but it will be able to more easily deceive users into thinking it has the same level of understanding as a human interpreter. To explain this, it helps to understand how large language models work.

How do large language models work?

Essentially, large language models are very large, branching probabilistic models that aim to predict what the next word in a sentence would be. LLMs are based on a statistical model derived from reducing text training data to ‘tokens’: a representative unit of a word, part of a word, or concept (a bit like codes), and how often the tokens connect with each other (weights) to predict which things are likely to be related (‘tomato’ is strongly weighted to ‘food’ and ‘soup’, but less so to ‘spades’).

In LLMs like ChatGPT, the ‘input’ is the data you input for a session (say, your interview data), and the queries you ask of it (in the form of a sentence). ChatGPT parses the data and your prompt, and using the ‘weights’ derived from its training data, delivers the most statistically probable response as a text string.

Why are large language models and AI problematic for qualitative research?

Despite the inputs, outputs and training data being text-based (and therefore qualitative, right?) LLMs are based on reducing qualitative text data to a statistical, probabilistic model. It’s quantitative, not qualitative. That’s okay if it fits with your ontological and epistemological approach, but I’d strongly argue that it is a quantification of qualitative data, something most qualitative approaches abhor.

Large language models weight text in quantitative ways, not qualitative ones. If you are just counting the number of times someone says something negative, “I don’t like this flavour of ice-cream” gets the same significance in the project as someone saying “I hate ice-cream because it killed my only daughter”. This is a ridiculous example, but one of these statements is obviously more significant than the other, and both have very different meanings and motivations behind them. But for any system that is doing basic counting, unless one of those statements is ‘weighted’ so that it is given more significance than the other, these are just two people that dislike ice cream, possibly out of hundreds that do like ice cream, so they become statistically insignificant. In qualitative research, it’s unusual to write off any respondents as being insignificant, and actually I often argue that understanding the unusual responses is most important to understanding the dataset as a whole.

So if these models are by default trained to predict the most likely outcomes, when we are using them to help with interpreting qualitative data, can they surprise us? They might be able to suggest patterns in the data or interpretations we hadn’t thought of, but can they suggest new or innovative interpretations? It seems unlikely, and for me that limits their use in much qualitative analysis, where we are doing theory generation, trying to create new understandings and often challenging the status quo. They also lack lived experience – they can’t
empathise pain, frustration and other emotions that I’d argue are important to understanding a lot of qualitative research data.

Want to learn more about qualitative research? Try our free qualitative research course, an interactive journey through the whole research process from designing a good research question, to collecting qualitative data and qualitative analysis, through to writing up your research.

In the rest of this article, I will give you 9 reasons why you should avoid AI for your next qualitative research project.

Reason 1: Large language models can only guess at the context.

In my own testing with ChatGPT, it often ignores crucial context. After inputting an interview where a person clearly discussed a research project in Africa, ChatGPT refused to see the context, falsely claiming “The provided text does not explicitly mention the specific population”. It also guessed what might be in the data as opposed to what was actually said, falsely claiming the interviewee felt that ‘rigor and validity of the research findings’ was important for qualitative research — when the interviewee actually felt the opposite.

Reason 2: Large language models don't give repeatable results.

Large language models will often contradict themselves. The themes generated by ChatGPT for the interview were plausible (only 5 long themes were suggested) but for one prompt ChatGPT claimed it couldn’t provide quotations for those themes, stating it would only ‘generate hypothetical examples based on the context’ – not a very useful qualitative tool. However, in another nearly identical query it did provide a verbatim quotation from the interview. These kinds of erratic responses are frequently reported for ChatGPT. In general the output seems convincing, but fails to hold up to detailed scrutiny, and is certainly not reliable enough to be something I would use in a research setting at the moment.

Reason 3: Large language models have bias in the training data.

Not only is there bias in how the models were trained, but there is also uncertainty about how the data was sourced. LLMs like ChatGPT and Llama are created by giving an algorithm a huge amount of text data (think Wikipedia, Reddit posts, news articles, and even whole novels) that it learns patterns from: grammar and word structure, and the most likely words that will be used in certain contexts and styles. Gathering a dataset large enough has required the illegal scraping of copyrighted data, and mass amounts of input from exploited workers in developing countries to train the model, in a system that far from being automatic and clever, requires a huge amount of human intervention to try and correct ‘objectionable’ and incorrect outputs.

Because machine learning systems are trained on data sources from the internet, they will inherently reflect the wider prejudices and biases in our society. As an example, these tools will happily reproduce racial stereotypes. Professor Damien P. Williams describes these systems as ‘bias optimizers’, and notes their potential to amplify bias in their training data, leading to racist and discriminatory interpretations of data.

One thing that they almost certainly weren’t trained on is qualitative research, which almost never has publicly available participant data or coding frameworks that these tools could learn from. Journal articles and theses will be present in the training, but these rarely include the complete transcripts of qualitative data. So it likely has very little ‘experience’ with typical qualitative data, or how researchers interpret it.

Reason 4: The data used to train most large language models was unethically gathered.

OpenAI refuses to detail the data sources that were used to train ChatGPT, likely because there are almost certainly legal and privacy issues which have led to it being banned in many countries. There are also worrying security issues; for example certain prompts can get ChatGPT to leak its training data, including personally identifiable information. There was no consent from the creators of the work used to train the algorithm, so lawsuits from authors are currently in play (Stempel, 2023; Vallance, 2023). Image-generating AI tools like Stable Diffusion used datasets that included pictures of children being sexually
abused to train their models, which had to be removed (Montgomery, 2023). It is unlikely that the datasets for training ChatGPT were chosen with more care.

So while we can’t know for sure what data was used to train ChatGPT (which is problematic in itself), we can be fairly sure that it does not include much qualitative data, and contains inherent bias. This lack of transparency and evidence of discriminatory output should be enough reason for most researchers not to use such a tool. We urge researchers to take a reflexive stance on their own interpretations, and at the moment we can’t ask ChatGPT to do the same.

Reason 5: Large language models are often inaccurate and don't explain how they came to their conclusions.

Should you use for academic research a tool that states: “ChatGPT may give you inaccurate information. It’s not intended to give advice”? Should you use a tool where the training data is secret, and there is no way to verify or understand how decisions in interpretation were made? A tool where the process is a black box and output is not replicable?

At the end of the day, the main decision-making is still yours (even after using the tool to summarise or code data) to decide which codes are relevant or appropriate. But there is no way to see why certain codes were suggested (or why some were not). It’s also impossible to tell which codes are ‘hallucinations’, in the bizarre post-truth language of AI marketing – codes that were not present in the data at all, but appeared in the output as a statistical glitch. This leads to a failure of trust in the process, as even with a human ‘interpreter’ of the results, some of the original suggestions may be invalid.

Reason 6: Most large language models are not private or secure.

There are privacy and security concerns with many LLM tools, making them inappropriate for ethical research. OpenAI claims to be private, secure and GDPR compliant, but clearly isn’t if most reporting on it is correct. ChatGPT is still a ‘mechanical Turk’, a tool that seems automatic but relies on human input to train, correct and report issues. None of the third-party contracting agencies used to check ChatGPT prompts are listed – and based on
these reports, third-party contractors are able to read your prompts to fix errors and report illegal usage. Similar scandals came to light for Microsoft and WhatsApp, a service that claimed to be end-to-end encrypted but was not. There is a legal requirement in GDPR to list all subcontractors that can access data, but even the ChatGPT sign-up process contains a GDPR violation – it uses Mandrill (a MailChimp product) to verify user emails, passing your data to them, but this is not listed as one of their sub-processors, nor have many other of the contractors above that have worked on ChatGPT. (As of 6 April 2026, this remains the case!)

OpenAI claims that if you are using one of their (expensive) professional subscriptions, your data will not be retained for use in training the algorithm further. However, it does seem that data sharing would still apply to the ‘safety’ checks they run on prompts and data, and which allow data to be shared with third-party subcontractors. Additionally, myriad ‘leaks’ and ‘bugs’ indicate that what goes into ChatGPT may be coming out elsewhere (Belanger, 2025; Derico, 2023).

Reason 7: Many companies that produce large language models, including OpenAI, use exploitative business models.

The business models of for-profit AI companies like OpenAI are exploitative and unsustainable. Currently, tools like ChatGPT are being offered essentially for free, but unquestionably the cost of these will increase in the future. We are currently in the ‘loss-leading’ stage of products like Gemini and ChatGPT, but I’m certain that this will change in the near future. The model is very similar to Uber: attract users by offering a convenient product below cost, to destroy existing competitors and force a reduction in labour costs.

When autonomous driving failed to materialise (promised to be next year for every year from 2016), and drivers stopped wanting to work for Uber, prices rose and availability decreased. But there are large numbers of entrenched users, and this keeps Uber’s valuation high, while finally reaching profitability. The same has been true for AirBnB and Amazon – initially cheap, but now as expensive or moreso than the rest of the market, but now with lots of loyal users that don’t bother to look elsewhere.

ChatGPT will be no different: lots of money has been invested in the ease of using the system and API so that developers build tools around it, becoming locked into the ecosystem while prices are attractive. By all accounts, pricing is heavily subsidised at the moment, caused by the huge cost of servers to run the systems offered at a loss (for now). As of 2026, OpenAI predicts they will not turn a profit until 2030. This is unsustainable (for investors) and prices will increase to cover billions of dollars in initial losses, to deliver profits to shareholders. OpenAI claims that continuing to run and train such models will require more computer chips than can currently be built, nuclear fusion reactors for power that don’t exist, and vast amounts of fresh water to cool them. This won’t come cheap, either to the environment or to end users, if the hype is to be believed.

Even now, I estimate tools like NVivo, MAXQDA and ATLAS.ti that have jumped on the OpenAI bandwagon are paying at least $1 to $12 per query for qualitative analysis (depending on project size). I expect this cost to greatly increase in next few years, and it will be passed on to consumers. Cheaper or open-source alternatives will also proliferate, but ChatGPT-based tools will be locked into these proprietary systems.

Some have claimed that the ‘disruption’ caused by these AI models is an unavoidable and inevitable influence that qualitative researchers must embrace. But disruption is just a Silicon Valley nonsense word, an enaction of capitalism that has been criticized at least as far back as Marx and Engels. I see the real cost and aim of ‘disruption’ is to destroy existing markets to shift power to a small number of privileged white men. Like taxi apps, renting e-scooters, self-driving cars, VR headsets, cryptocurrency and NFTs, these ‘disruptions’ are designed to maximise hype and company valuations in the early stages, allowing early investors to cash out before the true value of the market and product becomes clear, and the value crashes.

Make no mistake, the hype around OpenAI is no different, but it was essential to the Silicon Valley set to create a new tech product and keep company valuations high after the Bitcoin pyramid scheme collapsed.

Reason 8: Many AI companies are run by unethical people.

I think it's also important to understand OpenAI itself. It's technically two organisations: one that is a non-profit (and releases research and open source tools like Whisper for transcription) and a commercial organisation that makes for-profit products. But recently the non-profit arm has been usurped, and used an elaborate coup to remove all women from their board and senior leadership, in what is becoming a systematic trend in the tech industry: solidifying power to white males who are open supporters of the anti-woke/white supremacy movement. Google did something similar a few years ago, firing women and black people from their now disbanded ‘safety’ teams when they raised warnings about AI tools. I would strongly recommend reading some of the work (or listening to some of the excellent talks) from Timnit Gebru and the DAIR Institute she now directs.

OpenAI is deliberately obtuse on transparency and accountability, and are breaking promises they’ve made on trust and public documentation. Since large investments from Microsoft, the focus seems to have shifted to rapidly commercialising existing products, precipiating a perceived AI arms race.

So the main question for me is, if we start using tools like ChatGPT to outsource our decision making, who are we outsourcing it to? What decisions are products that these organisations develop likely to make, and especially make in the future? In qualitative research, will they be making racist and sexist suggestions (as they already do for job applications, loan applications) in line with the political beliefs of their founders, like Peter Thiel – or Elon Musk, whose AI chatbot Grok has come under fire for making racist and antisemitic remarks?

Reason 9: The 'convenience' of large language models doesn't mix well with rigorous qualitative research methodologies.

What’s the rush? The ease and speed of getting output can be seductive – you can
‘do’ your qualitative analysis in minutes! Use it as a chatbot to bounce ideas off of, without having to talk to a colleague! There’s potential for these systems, but there is nothing magical about what exists at the moment. There’s no reason that LLMs can’t be developed with ethics, transparency, and rigour in mind. But that is not what we get from ChatGPT, and for academic research, its use seems questionable. I would be very wary of companies who dismiss problems with LLM accuracy, state that there are only benefits or that the use of AI in qualitative research is inevitable (don’t we get to choose methods anymore?), and claim that criticisms can only come from people who haven’t understood ‘AI’.

What are my options if I still want to use AI for qualitative research?

Methodologically, your best bet is to go for an open-source large language model. There are already free and open-source alternatives to ChatGPT that you can run on your own computer, without sharing your data with anyone, and which give you much more information about their training data. They often take more time and learning to set up, but are likely the best option if you still want to use a LLM as part of your analysis. Convenience should not be the sole reason you choose a software solution for rigorous research, especially one with this many methodological drawbacks and ethical problems.

There are some qualitative approaches where using a tool like ChatGPT would be fine, especially when working with large datasets in commercial and public sector areas, where nuance and data security isn’t required. Many organisations are looking for quick, quantitative summaries of qualitative data elements in surveys and feedback forms, and LLM tools can provide that, albeit with questionable accuracy at the moment. But I don’t think it is a good fit for most academic qualitative research, and we won’t be adding it or similar tools to Quirkos.

AI and large language models are also not new, as this great post from Christina Silver notes, and there’s also a great series of discussions on AI-based tools on YouTube. I’m not a Luddite (I run a software company!), but I think I know when something is over-hyped and when it isn’t right for a majority of our users.

There are machine learning or 'AI' algorithms beyond large language models, which are more appropriate for qualitative analysis. Our automated transcription model is one of them.

If you're doing qualitative analysis, you probably have a lot of audio and video data to transcribe. But most AI transcription options are environmentally unfriendly and share your data with third parties, making them a poor option for qualitative research with confidentiality requirements. Get the automated transcription with a difference, with Quirkos Transcribe. We encrypt your data end-to-end on our in-house, solar-powered transcription server, so no one else can view or access it except you. Try it out for FREE with a Quirkos Cloud trial!
With flexible canvas views, Quirkos makes qualitative analysis easy, fun and beautiful. Try for free today!

Quirkos makes simple, accessible qualitative analysis software that helps you use your own mind to interpret, explore and share your qualitative data. Try for free!