Automated image and data clustering

We’ve recently added a powerful unsupervised learning capability to Zegami to help our users gain additional insights from their image and data collections. Here at Zegami we’ve been using clustering techniques for a while, both on images and data for various collections we’ve made, and found them to be an incredibly useful tool. 

Clustering provides a view whereby objects are laid out according to their similarity, which can be based on either data or images.  Similar items are positioned close together, while items that differ substantially are further apart. The result resembles a landscape or map of the data. 

An integrated solution 

Up until now, we’ve treated this technique as a data preprocessing step: something added to the data ‘offline’ before uploading it to Zegami. But these techniques have proven so valuable that we’ve added then as an integrated step in our processing platform, so that it can be performed for every new collection created in Zegami. 

This example collection contains 20k images of dogs of varying breeds. The image similarity data was generated automatically. 

What is it useful for? 

Besides offering a fascinating representation of the data, this view provides some immediate practical benefits: 

Outlier detection 

Items which are significantly different from the crowd often reflect faulty images or incorrect data rows, so finding these is important to maintain a high quality of data. Similarity clustering makes such outliers immediately visible by positioning them apart from the main body of items. 

Efficient labelling/categorisation 

Many workflows involve tagging or categorising items, often to prepare data for machine learning training, or to assess the results of an experiment. 

Data clustering enables a more efficient process. By working on entire clusters of items rather than single instances at a time, it’s possible to gain huge improvements in productivity when labelling. 

Zegami’s cluster filter tool provides further benefits, allowing clusters to be selected on a scatter plot, while the members of the cluster can be examined using one of Zegami’s other views. 

Cluster exploration 

Combining cluster view with our colour-by-column feature is a great way to examine what properties are influencing the clusters. Zegami’s easy to use filters and views provide another means of exploration. 

In this example, we filter on a visual similarity region, with the results grouped by breed.

How does it work? 

Our data clustering process uses the UMAP dimensionality reduction algorithm to distill the number of the columns present in the data down to a representative two. 

UMAP is one of a class of such algorithms which can reduce the dimensionality. A dataset with N numerical columns can be thought of as a set of points residing in N-dimensional space. These algorithms use a variety of techniques to project these many dimensions onto smaller dimensional spaces, whilst preserving distances between points, allowing effective visulisation of the data. 

We use UMAP by default since this tends to give useful results for most data, whilst delivering results within a reasonable amount of time and computational power. 

Turning images to numbers 

These algorithms work well for transforming data, but how do we cluster unstructured data such as images? To position images according to similarity, we first need to convert them into a numerical representation. A simple method for this is to simply use pixel values, perhaps just the brightness from a low-res copy. This can sometimes be effective where the size/shape and layout vary only a little between images, but gives disappointing results where images vary more widely. 

Much more useful numbers can be obtained by using a neural network. Typically, a neural network will be used to ‘classify’ an image into one of a number of types it has been trained to recognise. However, internally the neural network effectively first runs its own pipeline examining the image’s pixels and extracting a set of 1000+ numbers which describe it. While these numbers are largely opaque to a human, they carry a rich amount of abstract but qualitative information about the contents of the image. For example, a network trained to recognise animals might have a number that loosely represents ‘horsiness’, and another which reflects ‘stripiness’. An item with a high value for both could end up classified as a zebra

Pulling out these numerical ‘features’ just before the classification step gives us an excellent high-dimensional numeric representation of each image from which to reduce to x and y coordinates. 

We selected the ResNet-50 network as our default, since it gives reasonable general-purpose results for a range of different input image types; from photographs to microscopy slides, to medical imaging results. Future releases of Zegami will offer some choice over the model used, and even support custom models for feature extraction. 

Zegami’s cloud processing platform automatically begins extracting features from images as soon as they are uploaded and then clusters the results when all the images have been handled. New images can be added to an existing collection at any time, and these will be processed on arrival and added into the clustering results. Likewise, updates to the data will automatically retrigger the data clustering algorithm. 

All you need to do is provide your images and, optionally, some metadata, and the results will appear once the processing is complete.