Friday, 2 December 2016

Machine Learning & Representative Training Data

In this blog, I’d like to touch upon the subject of “Representative Training Data’ … as usual, I must issue the standard health warning that the views expressed are my own and not those of my employer.

With so much interest in Machine Learning, I am sure many of you will have heard about the importance of ‘Representative Training Data’.  The term is pretty self explanatory in that if you want to train a machine learning system you obviously need training data … and it would be a good idea if that data looked like the data that was going to be encountered in the real world.

Therefore, the principle is simple … or is it?

In real world applications there are a few additional points you may wish to consider.

Firstly, it’s important to understand how the machine learning algorithm will use the data.  Most algorithms aim to achieve the optimum performance across the entire data set.  Consider a fraud detection system that looks at online spending profiles and decides whether or not each profile is fraudulent.  What happens if only 1 in 100 profiles are fraudulent?  If the learning algorithm minimises the error across the entire data set, the system can achieve a 99% accuracy simply by declaring that every profile is “NOT FRAUDULENT”.  If the data has to be representative of the real world observations then surely it would be wrong to bias the training set by adding in more examples of fraudulent profiles?  In other words, the theory says that the training data must be representative but in practice this doesn’t really work.

There are really only two ways of dealing with this problem.  The first is to artificially bias the training data and the second is to use a machine learning algorithm that incorporates some form of cost function and therefore places a higher emphasis on accurately detecting the fraudulent cases.

My personal preference would always be to look at the cost function as biasing training data can often descend into a tail chase.  You find that increasing the number of data points in one class degrades the performance against another class … so you increase the number of data points in the other class … and so on and so on.


Secondly, it’s very important to ensure that your training data remains representative once the system has gone live.  It is not unusual for the data to change whilst the system is in production.  For example, I once delivered an entity extraction system that was designed to work on English language documents only.  The Client loaded in a document set that contained a number of foreign language documents and suddenly the system was generating all sorts of strange entities.  Fortunately, we had designed the system to monitor various statistics about both the input and the output … when erroneous entities started being generated it blew the statistics and alerted us to the problem.  It is critical that any system monitors its source content to ensure it remains consistent with the representative data used in the training.