The dangers of data mining for text

 Alexandre Dulaunoy CC -

There is an interesting new article out, which looks at some of the commonly used algorithms in data mining, and finds that they are generally not very accurate, or even reproducible.


Specifically, the study by Lancichinetti et al. (2015) looks at automated topic classification using the commonly used latent Dirichlet allocation algorithm (LDA), a machine learning process which uses a probabilistic approach to categorise and filter large groups of text. Essentially this is a common approach used in data mining.


But the Lancichinetti et al. (2015) article finds that, even using a well structured source of data, such as Wikipedia, the results are – to put it mildly, disappointing. Around 20% of the time, the results did not come back the same, and when looking at a more complex group of scientific articles, reliability was as low as 55%.


As the authors point out, there has been little attempt to test the accuracy and validity of these data mining approaches, but they caution that users should be cautious about relying on inferences using these methods. They then go-on to describe a method that produces much better levels of reliability, yet until now, most analysis would have had this unknown level of inaccuracy: even if the test had been re-run with the same data, there is a good chance the results would have been different!


This underlines one of the perils with statistical attempts to mine large amounts of text data automatically: it's too easy to do without really knowing what you are doing. There is still no reliable alternative to having a trained researcher and their brain (or even an average person off the street) reading through text and telling you what it is about. The forums I engage with are full of people asking how they can do qualitative analysis automatically, and if there is some software that will do all their transcription for them – but the realistic answer is nothing like this currently exists.


Data mining can be a powerful tool, but it is essentially all based on statistical probabilities, churned out by a computer that doesn't know what it is supposed to be looking at. Data mining is usually a process akin to giving your text to a large number of fairly dumb monkeys on typewriters. Sure, they'll get through the data quickly, but odds are most of it won't be much use! Like monkeys, computers don't have that much intuition, and can't guess what you might be interested in, or what parts are more emotionally important than others.


The closest we have come so far is probably a system like IBM's Watson computer, a natural language processing machine which requires a supercomputer with 2,880 CPU cores, 16 terabytes of ram (16,384GB), and is essentially doing the same thing – a really really large number of dumb monkeys, and a process that picks the best looking stats from a lot of numbers. If loads of really smart researchers programme it for months, it can then win a TV show like Jeopardy. But if you wanted to win Family Feud, you'd have to programme it again.


Now, a statisical overview can be a good place to start, but researchers need to understand what is going on, look at the results intelligently, and work out what parts of the output don't make sense. And to do this well, you still need to be familiar with some of the source material, and have a good grip on the topics, themes and likely outcomes. Since a human can't read and remember thousands of documents, I still think that for most cases, in-depth reading of a few dozen good sources probably gives better outcomes than statistically scan-reading thousands.


Algorithms will improve, as outlined above, and as computers get more powerful and data gets more plentiful, statistical inferences will improve. But until then, most users are better off with a computer as a tool to aid their thought process, not to provide a single statistic answer to a complicated question.


Why qualitative research?

There are lies, damn lies, and statistics

It’s easy to knock statistics for being misleading, or even misused to support spurious findings. In fact, there seems to be a growing backlash at the automatic way that significance tests in scientific papers are assumed to be the basis for proving findings (an article neatly rebutted here in the aptly named post “Give p a chance!”). However, I think most of the time statistics are actually undervalued. They are extremely good at conveying succinct summaries about large numbers of things. Not that there isn’t room for more public literacy about statistics, a charge that can be levied at many academic researchers too.

But there is a clear limit to how far statistics can take us, especially when dealing with complex and messy social issues. These are often the result of intricately entangled factors, decided by fickle and seemingly irrational human beings. Statistics can give you an overview of what is happening, but they can’t tell you why. To really understand the behaviour and decisions of an individual, or a group of actors, we need to get an in-depth knowledge: one data point in a distribution isn’t going to be enough power.

Sometimes, to understand a public health issue like obesity, we need to know about everything from supermarket psychology that promotes unhealthy food, to how childhood depression can be linked with obesity. When done well, qualitative research allows us to look across societal and personal factors, integrating individuals stories into a social narrative that can explain important issues.

To do this, we can observe the behaviour of people in a supermarket, or interview people about their lives. But one of the key factors in some qualitative research, is that we don’t always know what we are looking for. If we explicitly go into a supermarket with the idea that watching shoppers will prove that supermarket two-for-one offers are causing obesity, we might miss other issues: the shelf placement of junk food, or the high cost of fresh vegetables. In the same way, if we interview someone with set questions about childhood depression, we might miss factors like time needed for food preparation, or cuts to welfare benefits.

This open ended, sometimes called ‘semi-structured’, or inductive analytical approach is one of the most difficult, but most powerful methods of qualitative research. Collecting data first, and then using grounded theory in the analytic phase to discover underlying themes from which can build hypotheses, sometimes seems like backward thinking. But when you don’t know what the right questions are, it’s difficult to find the right answers.

More on all this soon…