Sitemap & RSS Feed Tags

Different ways to tackle the data labelling bottleneck in machine learning

Data are the food of machine learning training. There are more and more data everyday. But most of the time, these data are unlabelled. Labelling them manually is expensive and boring.

There are different ways to tackle this problem.

Active learning

Active learning optimises labelling. It extracts the data that must be labelled.

The system requests a manual labelling for identified cases. Those depend on the strategy you choose. I will cover only two of these strategies.

Uncertainty sampling

When using a model, you get more or less certain results. The idea is to identify and label manually the least certain ones.

Active learning

Semi-supervised learning

This method looks for the most certain examples too. It uses them as labelled data.


Links to go further:

Reinforcement learning

In reinforcement learning, you learn from experiences.

You have a target. An agent tries to reach it by performing actions. Its successes and failures train your model in the right direction.

One day, someone told me:

Reinforcement learning is a supervised learning with data. It just gets these data over the time. You can use reinforcement learning when interracting with an environment and limited data.

A tutorial to get started with reinforcement learning:

Data augmentation

From a same data, you can extract several others.

You can take a cat picture. You flip, crop, scale or rotate it. From one labelled picture, you now have four new labelled pictures.

Beautiful cat

A reading about data augmentation:

A small library for image augmentation:

Unsupervised learning

Labelling is a supervised learning problem. To tackle it, you can use unsupervised learning instead. This possibility depends on the problem you have.

Suppose you want to extract anomalies from an ensemble. A supervised learning could label some examples. Then, it trains the model.

Instead, you could cluster similar things together. Groups very different from the others could be anomalies.

It is a simple example. However, it is possible to imagine more complex solutions.

Examples of clustering with K-Means and K-Medoids: Clusterings in machine learning — K-Means and K-Medoids examples


Labelling in machine learning is an issue. It is expensive and leads to boring tasks.

You can keep the costs of labelling down with automatic solutions. I wanted to highlight some methods to do that.