Machine learning ‘causing science crisis’

The recent post about the reproducibility crisis in science is, to someone who works in science, not big news. Difficulty reproducing experimental results is an age old problem. Reproducibility of an experiment is important, because if an experiment’s results can be replicated by others, it confirms the validity of the research and its conclusions. On the other hand, a rerun of the experiment which yields different results casts serious doubt on the original findings. Such difficulties should be avoidable through rigorous design of the experiment but this does not always happen, often as a result of compromises made due to budgetary constraints. As new analysis techniques become more widely adopted, new systematic sources of irreproducibility are are emerging. Dr Genevera Allen from Rice University in Houston has identified that the increased use of machine learning is becoming a big factor when it comes to reproducibility of experiments.

Machine learning is an incredibly powerful tool, and is rightly being taken advantage of by researchers to help them analyse the large quantities of data generated by most modern experiments. Used in the right way, ML makes it possible to work with volumes of data which would previously have been too large to contemplate. Not only can ML techniques deal with large volumes of data, the technique also has a renowned ability to identify patterns in data which might have eluded more conventional analysis. These methods are undoubtedly helping to move scientific research forward in new directions.

But for all its virtues, ML is no magic bullet. For a machine learning project to be effective, it needs to be trained with sufficient high quality, well distributed data. The training data may be actual data, or data generated to augment small datasets using GANs. Having enough good quality data with which to train the model is vitally important, and the larger the training set, the better the result. But this large quantity in turn means that the veracity of this data can be extremely costly to ascertain. On top of this data preparation challenge, most of the research papers published spend a lot of time on the architecture used by the ML, since this can be an interesting (even glamorous) pursuit in its own right. Less time tends to be devoted to securing large data sets which represent the full domain in an accurate way. This all can lead to flaws in the model; as always, bad data in, bad results out.

Google fell foul of this last year when it created a ‘racist algorithm’ because of mislabelled training data. That is bad enough but what about if this led to the computer labelling a tumour in lung cancer as benign and somebody actually dies as a result?

A better approach

A shift in priorities and working practices is sorely needed in order to deal with these pitfalls, and this is indeed beginning to take place. This shift will be accompanied by specialist tools to facilitate the task of data preparation and labelling in an efficient and focussed way. Using the right tool, what would otherwise be a slow and laborious task becomes an engaging and high-throughput process. Zegami is one such tool, which is being successfully applied by researchers both for the labelling of training sets of images, and for validating the classifications generated by trained models.

The default procedure for doing this would involve adding entries to a spreadsheet row by row by row, whilst repeatedly referring to a separate folder of image thumbnails. Alternatives might be to build an in house tool, but this diverts them away from doing any actual science, and scientists often are not great at building easy to use interfaces.

Screenshot 2019-03-05 at 12.05.57

Zegami however, combines both images and data as a single entity, within a powerful but intuitive user interface. Every image in the dataset is arranged in single zoomable plane along with any associated data values. Numerous tools allow these images to be arranged, grouped or filtered by any parameters available in the metadata, or by visual similarity metrics (themselves derived using unsupervised machine learning methods such as t-SNE). In this way, the set of images can be systematically explored, and entire clusters of images can be efficiently and accurately labelled using just a few clicks. Less clear cut cases can be seamlessly zoomed to maximum resolution for full scrutiny.

zegami blog

 The above figure is a screen shot of a collection of microscopy images. With Zegami you can quickly see an overview of your data set to look for problems. The bottom two circled images show badly photographed specimens and the top circle shows some tweezers have accidentally been photographed. If you were using these data for training a machine learning model such data could cause trouble!

Using Zegami with machine learning takes a lot of the guesswork out of being able to annotate and make sure that your data sets are complete without bias and do not contain poor examples which may skew standard machine learning models

Zegami’s other strength is that it works in any web browser. We can make these data sets available easily online for our collaborators or anyone else to browse. This allows other people to look at and analyse the data, and provides a feedback mechanism to report mistakes.

Machine Learning is here to stay in academia, and also increasingly in business. But it’s important to cut through the hype and understand that, like any process, it relies on sound input. Effort expended on the algorithms themselves must be balanced by commensurate effort invested in the data which drives those algorithms. By shortcutting this step, scientists and businesses risk arriving at a model which falls well short of its potential value, or worse, one with no genuine value at all. However, by adopting the correct approach, aided by the right tools, the effort required to ensure a successful outcome can be dramatically reduced.