Unleash the Power of Text Mining

In 2017, we have a great toolbox of informative methods to help us analyse large volumes of text. Sentiment analysis, topic modelling, and named entity recognition are to name but a few of these exciting approaches. Computational power and storage capacity are not the limiting factors on what we could do with the 100 million or so journal articles that comprise the ever-growing research literature so far. But the continued observance of 17th century limitations on how we can use research are simply jarring. Thanks to computers and the internet, we have the ability to do wonderful things, but the licensing and access-restrictions placed on most of the research literature explicitly and artificially prevent most of us from trying. As a result, few researchers bother thinking about using text mining techniques – it is often simpler and easier to just farm-out repetitive large-scale literature analysis tasks to an array of student minions and volunteers to do by-hand – even though computers could and perhaps should be doing these analyses for us.

Inadequate computational access to research has already caused us great harm. Just ask the Ministry of Health in Liberia: they were not pleased to discover, after a lethal Ebola virus outbreak, that vital knowledge locked-away in “forgotten papers” published in the 1980’s, clearly warned that the Ebola virus might be present in Liberia. This information wasn’t in the title, keywords, metadata, or abstract; it was completely hidden behind a paywall. Full text mining approaches would have easily found this buried knowledge and would have provided vital early warning that Ebola could come to Liberia, which might have prevented some deaths during the West African Ebola virus epidemic (2013–2016).

Some subscription-based publishers have been known to use ‘defence’ mechanisms such as ‘trap URLs’ that hinder text miners - making it even harder to do basic research. Whilst other subscription publishers like Royal Society Publishing are helpfully supportive to text miners, as are open access publishers. Hindawi for instance, allows anyone to download every single article they’ve ever published with a single mouse-click. Thanks to open licensing, aggregators like Europe PubMedCentral can bring together the outputs of many different OA publishers, making millions of articles available with a minimum of fuss. It is “no bullshit” access. You want it? You can have it all. No need to beg permission, to spend months negotiating and signing additional contracts, nor to use complicated publisher-controlled access APIs, and their associated restrictions. Furthermore, OA publishers typically provide highly structured full-text XML files which make it even easier for text miners. But only a small fraction of the research literature is openly-licensed open access. It’s for these reasons and more that many of the best text-mining researchers operate-on and enrich our understanding of open access papers-only e.g. Florez-Vargas et al 2016.

So if I had but one wish this Christmas, it would be for the artificial, legally-imposed restrictions on the bulk download and analysis of research texts, to be unambiguously removed for everyone, worldwide – so that no researcher need fear imprisonment or other punitive action, simply for doing justified and ethical academic research. Unchain the literature, and we might be able to properly unleash and apply the collected knowledge of humanity.

This post was written by Ross Mounce and originally posted here.