The dangers of data mining for text

 Alexandre Dulaunoy CC - flickr.com/photos/adulau/12528646393

There is an interesting new article out, which looks at some of the commonly used algorithms in data mining, and finds that they are generally not very accurate, or even reproducible.

 

Specifically, the study by Lancichinetti et al. (2015) looks at automated topic classification using the commonly used latent Dirichlet allocation algorithm (LDA), a machine learning process which uses a probabilistic approach to categorise and filter large groups of text. Essentially this is a common approach used in data mining.

 

But the Lancichinetti et al. (2015) article finds that, even using a well structured source of data, such as Wikipedia, the results are – to put it mildly, disappointing. Around 20% of the time, the results did not come back the same, and when looking at a more complex group of scientific articles, reliability was as low as 55%.

 

As the authors point out, there has been little attempt to test the accuracy and validity of these data mining approaches, but they caution that users should be cautious about relying on inferences using these methods. They then go-on to describe a method that produces much better levels of reliability, yet until now, most analysis would have had this unknown level of inaccuracy: even if the test had been re-run with the same data, there is a good chance the results would have been different!

 

This underlines one of the perils with statistical attempts to mine large amounts of text data automatically: it's too easy to do without really knowing what you are doing. There is still no reliable alternative to having a trained researcher and their brain (or even an average person off the street) reading through text and telling you what it is about. The forums I engage with are full of people asking how they can do qualitative analysis automatically, and if there is some software that will do all their transcription for them – but the realistic answer is nothing like this currently exists.

 

Data mining can be a powerful tool, but it is essentially all based on statistical probabilities, churned out by a computer that doesn't know what it is supposed to be looking at. Data mining is usually a process akin to giving your text to a large number of fairly dumb monkeys on typewriters. Sure, they'll get through the data quickly, but odds are most of it won't be much use! Like monkeys, computers don't have that much intuition, and can't guess what you might be interested in, or what parts are more emotionally important than others.

 

The closest we have come so far is probably a system like IBM's Watson computer, a natural language processing machine which requires a supercomputer with 2,880 CPU cores, 16 terabytes of ram (16,384GB), and is essentially doing the same thing – a really really large number of dumb monkeys, and a process that picks the best looking stats from a lot of numbers. If loads of really smart researchers programme it for months, it can then win a TV show like Jeopardy. But if you wanted to win Family Feud, you'd have to programme it again.

 

Now, a statisical overview can be a good place to start, but researchers need to understand what is going on, look at the results intelligently, and work out what parts of the output don't make sense. And to do this well, you still need to be familiar with some of the source material, and have a good grip on the topics, themes and likely outcomes. Since a human can't read and remember thousands of documents, I still think that for most cases, in-depth reading of a few dozen good sources probably gives better outcomes than statistically scan-reading thousands.

 

Algorithms will improve, as outlined above, and as computers get more powerful and data gets more plentiful, statistical inferences will improve. But until then, most users are better off with a computer as a tool to aid their thought process, not to provide a single statistic answer to a complicated question.

 

Help us welcome Kristin to Quirkos!

So far, Quirkos users have mostly been based in the academic and university based research areas: perhaps not surprising considering where the project grew from. However, from very early on we got a lot of positive feedback from market research companies working with qualitative and text based data, who had many of the same frustrations and issues with qualitative research software that we had in the academic sphere. Indeed, some of the early alpha-testers of Quirkos were based in a typical small, independent market research firm.

 

But it's not really possible to lump all of these groups of researchers together, they have different needs; not just in terms of features in the software (although most of these are very similar), but also in terms of support and case studies. Qualitative market researchers need to engage with their clients in a different way, often using dynamic and visual approaches that Quirkos is ideally suited for.

 

So, to this end, we are very excited to announce a new recruit to the Quirkos offices: Kristin Schroeder, who will be focusing on market research and commercial users. Kristin studied Modern History at Merton College, Oxford, but is a native to the Baltic coast in Germany, and an avid sci-fi fan. She brings with her nearly a decade of sales experience working in Northern Ireland with large commercial clients for global automotive supplier Ryobi. Her extensive track record of international engagement will enable us to work better with users in the UK and abroad.

 

This will allow Daniel to continue his focus on supporting the researchers he knows best, in academia and the public sector, while Kristin can help Quirkos grow into new areas, helping more researchers across the globe to find answers to their questions.

 

New Leith offices for Quirkos

Just in time for the new year, Quirkos is growing!

 

We now need a bigger office to accomodate new hires, so we've moved to the 'Shore' at Leith, the seafront of Edinburgh. We've now got space to grow further, and to entertain visitors, all within walking distance of the sea and a short trip from the centre of Edinburgh. There are many exciting companies around us, and we are happy to be in such a nice part of town, with a different place for lunch and coffee every day of the month!

 

Our new address is:

27 Ocean Drive
Leith
Edinburgh
EH6 6JL

And we've got a new phone number too, 0131 555 3736

 

If you are coming to visit us, please let us know in advance, but the best bet is to set your sat-nav for Tower Place, a little cul-de-sac next to us which usually has some parking. We are just on the corner with Ocean Drive.

Happy new year to you all, and hope to see you soon!

Don't share reports with clients, share your data!

When it comes to presenting findings and insight with colleagues and clients, the procedure is usually the same. Create a written summary report, deliver the Powerpoint presentation, field any questions, repeat until everyone is happy.

 

But this approach tends to produce very static uninspiring reports, and presentations that lack interaction. This often necessitates further sessions, if clients or colleagues have questions that can't be directly answered, want additional clarifications, or the data explored in a different way. And the final reports don't always have the life we'd want for them, ending up on a shelf, or buried in a bulging inbox.

 

But what if rather than sharing a static report, you could actually share the whole research project with your clients? If rather than sending a Powerpoint deck, you could send them all of the data, and let them explore it for themselves? That way, if one of the clients is interested in looking at results from a particular demographic group, they can see it themselves, rather than asking for a report to be generated. If another client wants to see all the instances of negative words being used to describe their brand, they can see all the quotes in one click, and in another all the positive words.

 

In many situations, this would seem like an ideal way to engage with clients, but usually it cannot be facilitated. To send clients a copy of all the data in the project, transcripts, nodes, themes and all would be a huge burden for them to process. Researchers would also assume that few clients would be sufficiently versed in qualitative analysis software to be able to navigate the data themselves.

 

But Quirkos takes a different approach, which opens up new possibilities for sharing data with end users. As it is designed to be usable by complete novices at qualitative research, your project file, and the software interface itself can be used as a feedback tool. Send your clients the project data in a Quirkos file, with a copy of the software that runs live from a USB stick. You can even give them an Android tablet with the data on, which they can explore with a touch interface. They can then quickly filter the data however they like, see all the responses you've coded, or even rearrange your themes or nodes in ways that makes sense for them. The research team have collected the data, transcribed and coded it, but clients can get a real sense of the findings, running searches and queries to explore anything of interest to them.

 

And even when you are doing a presentation, while Quirkos will generate visual graphs and overviews of the data to include as static image files in Powerpoint, why not bring up Quirkos itself, and show the data as a live demonstration? You can show how themes are related, run queries for particular demographics segments, and start a really interactive discussion about the data, where you can field answers to queries in real time, generating easy to understand graphical displays on the fly. Finally, you can generate those static PDF or Word reports to share and cement your insights, but they will have come as a the result of the discussion and exploration of the project you did as collaborators.

 

Isn't it time you stopped sharing dry reports, and started sharing answers?

 

Quirkos launch workshop

This week we had our official launch event for Quirkos, a workshop at the Institute of Education in London, but hosted by the University of Surrey CAQDAS network.

It was a great event, with tea and cake, and more than 30 people turning up on the day. Participants learnt about the philosophy behind Quirkos, how it fits in with the other qualitative analysis software packages on the market, and got an extensive interactive workshop session. We got some great feedback from participants, who seemed really enthusiastic about the potential for using Quirkos in their research, and lots of new ideas to take the project forward.

It is always invaluable to get feedback from new users, and the questions and suggestions raised will all be taken to heart in the next few months, helping us to improve our training and support, and add new features to make working with qualitative data even easier. Inevitably we also found a bug with creating new sources in the Mac version, and we are hoping to have a fix for this by the end of next week.

It was also a good time for reflection, with Quirkos now having been available for two months now. Interest has been amazing, and we already have customers from the UK, USA, Canada, and Australia, and have exceeded the number of licences we expected to sell at this stage! However, this is just the beginning, and in the new year Quirkos will be growing, allowing us to offer a better service and more rapid improvements. We are moving into new offices, staying in Edinburgh, but moving down to the port of Leith, right on the sea front. We will also bring on new staff to focus on the commercial and market research sectors, and help us be better focused for users with different needs.

It's also interesting how enthusiastic people have been about the participatory opportunities which they can envisage using Quirkos, and this is going to be a major focus for us. We are going to start some open-data community projects in the new year, that will provide some great examples of how Quirkos can help participants get engaged with research.

Keep watching this space in the new year for more information about our move, and to introduce the new faces who will be joining Quirkos HQ!

Is qualitative data analysis fracturing?

Having been to several international conferences on qualitative research recently, there has been a lot of discussion about the future of qualitative research, and the changes happening in the discipline and society as a whole. A lot of people have been saying that acceptance for qualitative research is growing in general: not only are there a large number of well-established specialist journals, but mainstream publications are accepting more papers based on qualitative approaches.


At the same time, there are more students in the UK at all levels, but especially starting Masters and PhD studies as I’ve noted before. While some of these students will focus solely on qualitative methods, many more will adopt mixed methods approaches, and want to integrate a smaller amount of qualitative data. Thus there is a strong need, especially at the Masters by research level, for software that’s quicker to learn, and can be well integrated into the rest of a project.


There is also the increasing necessity for academic researchers to demonstrate impact for their research, especially as part of the REF. There are challenges involved with doing this with qualitative research, especially summarising large bodies of data, and making them accessible for the general public or for targeted end users such as policy makers or clinicians. Quirkos has been designed to create graphical outputs for these situations, as well as interactive reports that end-users can explore in their own time.


But another common theme has emerged is the possibility of the qualitative field fracturing as it grows. It seems that there are at least three distinct user groups emerging: firstly there are the traditional users of in-depth qualitative research, the general focus of CAQDAS software. They are experts in the field, are experienced with a particular software package, and run projects collecting data with a variety of methods, such as ethnography, interviews, focus groups and document review.


Recently there has been increased interest in text analytics: the application of ‘big data’ to quantify qualitative sources of data. This is especially popular in social media, looking at millions of Tweets, texts, Facebook posts, or blogs on a particular topic. While commonly used in market research, there are also applications in social and political analysis, for example looking at thousands of newspaper articles for portrayal of social trends. This ‘bid data’ quantitative approach has never been a focus of Quirkos’ approach, although there are many tools out there that work in this way.
Finally, there is increasing interest in qualitative analysis from more mainstream users, people who want to do small qualitative research projects as part of their own organisation or business. Increasingly, people working in public sector organisations, HR or legal have text documents they need to manage and gain a deep understanding of.
Increasingly it seems that a one-size-fits-all solution to training and software for qualitative data analysis is not going to be viable. It may even be the case that different factions of approaches and outcomes will emerge. In some ways this may not be too dissimilar to the different methodologies already used within academic research (ie grounded / emergent / framework analysis), but the numbers of ‘researchers’ and the variety of paradigms and fields of inquiry looks to be increasing rapidly.


These are definitely interesting times to be working in qualitative research and qualitative data analysis. My only hope is that if such ‘splintering’ does occur, we keep learning from each other, and we keep challenging ourselves by exposure to alternative ways of working.

 

 

First Quirkos qualitative on-line workshop - 25th Nov 2014

Places are filling up now for our London launch and workshop on the 9th of December, but you can still come along for a free lunch by booking at this link.

 

However, we will soon be running the first of our monthly on-line workshops on the 25th of November, 4pm (GMT). That's 5pm for most of Western Europe, 11:00am EST on the East Coast of America,  8:00pm PST for those on the West Coast.

 

You can get more information and watch the live stream by following this link to our Google Hangouts page. You don't need to register with Google to take part in the seminar, and you can just watch, or interact with video, audio or chat as you wish. We will be covering a very quick overview, but mostly focusing on some of the more advanced features, such as queries, source and node management, and customising interactive reports.

 

There will be plenty of time for questions and discussion, and you can follow along using the example files provided, or just watch and learn. For this workshop, we won't cover much qualitative theory or methodology, just using Quirkos for qualitative analysis.

 

In future, we will also host some workshops and discussions using Skype, just to try and make sure we are accessible to as many people as possible. We'll also try hosting workshops at different days of the week and times of the day, so that people in different parts of the world can take part.

 

Hope to see some of you there, in the meantime, you can always ask questions by e-mailing info@quirkos.com

 

 

QHR2014 and Victoria, BC

It's been a busy month, starting with our public launch, and including our first international conference, Qualitative Health Research 2014, hosted by the International Institute for Qualitative Methodology at the University of Alberta.

The conference created a great environment to present and discuss qualitative work, in a very supportive and productive atmosphere. I also presented research from the EEiC project with Ghazala Mir as part of a larger symposium on organisational ethnography, and the slides are now up on ResearchGate.

There was lots of interest in Quirkos, especially from educators looking to introduce Qualitative research to beginners, and those wanting to include respondents in the analysis process, for end-to-end participatory research.

This last approach linked in with an engaging and challenging keynote by Dr Margarete Sandelowski, who talked about what she called 'member-checking', something I usually refer to as 'participant validation'. This is essentially the process of getting respondents to the research project to look through and validate outputs, which could be either themes, transcripts, conclusions, or reports. Dr Sandelowski raised many good points about the methodological issues that this can have for a research project. For example, she was concerned that participants could later change their minds about how they perceived events, retract permission for their data to be included, not understand conceptual frameworks being proposed by academics, or be overwhelmed or upset by seeing transcripts of interviews. There was unfortunately very little advice offered for overcoming these hurdles, a point raised in the question and answer session.

Personally, I think that qualitative research is always an ongoing dialogue between researcher and participants, and if participants change their minds about their opinions, or no longer want certain statements to be included, this is a 'no-brainer' situation - those statements are removed. I would never consider this to be a moral conflict - that data cannot be included, no argument. I have had many situations where participants have wanted certain sections of their interviews not to be included in the research, even when annonomised, sometimes because they felt a risk to their privacy or career. Sure, it was sad to have interesting bits of data removed from the project, but that was the right of the participant, and I would never considered including it anyway!

Secondly, as my colleague Dr Ghazala Mir pointed out, it is the responsibility of researchers to explain the theories and models used in the research in a way that participants can understand. Ultimately, impactful research must eventually explain itself to a general public audience, and participant engagement can be a great way to test and trial this.

I believe that the more research is a collaboration with participants the better, both methodologically and ethnically. I also think that consent should be part of a continuous dialogue, not just a one-off event. This will inevitably raise issues during the study, but these are not insurmountable obstacles, but considerations that can be anticipated, and time set aside for dealing with. For myself this is the key to being more inclusive, producing better results, and moving away from the narrow positivist approach so often associated with purely quantitative inquiry.

Quirkos is launched!

Quirkos

It's finally here!

From today, anyone can download the full 1.0 release version of Quirkos for Windows or Mac OS X! Versions for Linux and Android will be appearing later in the month, but since Windows and Mac account for most of our users, we didn't want people to wait any more.

Everyone can use the full version for free for one month, with no restrictions. At the end of the 30 day trial period, you'll need to order a licence to keep using Quirkos, which you can either do by raising a purchase order with us, or by placing a immediate credit/debit card payment on the website, which will get you a licence code e-mailed to you in just a few minutes.

I really want to thank everyone who has provided feedback, suggestions and critique over the last 14 months, Quirkos wouldn't be half as good as it is now without all that input. And it's really exciting to share it with everyone now, and to hear about exciting research projects people are already putting together around Quirkos. Watch this space for some great case studies in the next few months!

 

Announcing Pricing for Quirkos

At the moment, (touch wood!) everything is in place for a launch next week, which is a really exciting place to be after many years of effort. From that day, anyone can download Quirkos, try it free for a month, and then buy a licence if it helps them in their work. We've set up the infrastructure so that people can either place purchase orders through their finance department, or make a direct sale through the website by credit or debit card. We can then provide a licence code immediately, and users can unlock Quirkos and use it without any time limit. We don’t want to tie people into contracts or recurring payments; the licence will not expire, and will entitle you to any future updates for that version.

 

The interest we’ve had from users over the past few months has been overwhelming, and we want to have a flexible price structure that is appropriate for lots of different groups. One of my key aims has been to systematically remove the barriers to doing qualitative research – and price is a big hurdle at the moment. I’ve had conversations with so many people who have taken one look at the licence costs of the major qualitative analysis packages, and walked away. To really open up qualitative research for everyone, that needs to change. Our licence will cost roughly half that of our competitors', and we will offer a range of discounts for teams from different backgrounds.

 

First of all, we think Quirkos will be great for students, not just at a PhD level, but also at Masters or Undergraduate level, when there isn’t always the time to spend learning other qualitative research software. So, we are starting the student licence at £35 (roughly €45, US$60), so that people at all stages of learning can get started with qualitative research.

 

For professional academics and people working in the charity sectors, we will heavily discount the licence cost to £180 (€230 / $290). Already we have had beta-testers in the NHS and local government, and users in government institutions or NGOs, can get a licence for just £320 (€400 / $516).

 

Finally, the full licence for commercial use will be £390 (€490 / $620) and comes with our highest level of customer support. Everyone will be able to access regularly updated discussion forums and on-line learning materials, and professional users will also have access to personal e-mail support with a rapid response rate.

 

We really want to encourage a new generation of qualitative researchers and we think we’ve set a fair price that makes access easy, while allowing us to continue to add new features, and provide a strong level of support. Then you can focus on your data and findings, and not just the tools that help you get results.

 

(These are initial indicative prices, subject to change, and currency rates, local sales tax or VAT may lead to some variation in these numbers)