Tuesday 29 September 2015

Machine Learning : Training & Test Sets



In my last blog, I touched upon the importance of defining high quality training and test data when deploying a Machine Learning solution.

In this blog, I promised to dig deeper into this subject.  Before I start, I must point out that the views opinions expressed here are entirely my own and do not necessarily represent IBM’s positions, strategies or opinions.


Getting the data right is crucial in any machine learning solution and today I’d like to communicate three key messages:

  1. Understand the difference between academic testing and real world testing.
  2. Ensure you have representative data.
  3. Look at the data … take some time to read through it … you’d be surprised how revealing it can be.
So what do I mean by understand the difference between academic testing and real world testing?

In the academic world, researchers want to measure how effective different algorithms are.  This is normally done using a training set, a test set and a blind test set.  Basically we take a set of ground truth data comprising example input data together with the outputs we would expect the solution to generate.  We split that data into three and then use a third of the data to train the machine learning.  Machine learning algorithms can normally be tuned in some way so we use the second set of data as a Test Set.  This allows us to adjust the various training parameters, test and re-train to ensure that we have the optimum configuration.  Finally, once we’re happy that we have the optimum solution, the Blind Test set is used to formally evaluate the solution's performance.  The Blind Test data has been kept completely isolated from the other data sets so there is no chance of contamination in the process.

This method is ideal for an academic evaluation, however in practical applications of Machine Learning there are other considerations.  Imagine you are developing a Question Answering solution for a bank.  What is most important to you?  Deploying the most effective machine learning solution or deploying a solution that will always generate the correct answers?  The two are not necessarily the same thing.  Often we start projects with very little real world data and, by splitting that data into thirds, we immediately reduce the amount and quality of training data available.  Alternatively, if we simply use all the available data as training data, then we have no way of testing the system so that we know how it will behave against previously unseen data.  The counter-counter-argument is that if our available data is really that limited then even breaking out a blind test set still does not give us confidence that the tool will work against previously unseen data.

Unless we are working in a perfect environment where we know there is a huge set of statistically significant representative data, I prefer a boot strapping approach.  I like to build systems where the customer knows and understand that the system will work for data that is in the training set.  If we encounter previously unseen data, then we add it to the training data and continue.  In practical terms this means adopting a process along the following lines (for a QA system):

  1. Collect as many questions as possible … ideally from a live system.
  2. Train the machine learning solution using all data.
  3. Test the solution automatically using the same data and ensure it is generating the answers you expect.  Note don’t assume that it will have learned to answer all questions correctly as very few machine learning technologies do.
  4. Test with actual Users – real Users tend to misspell terms or enter different variants of questions.
  5. Identify any questions, i.e. previously unseen questions, that did not exist in your ground truth and add them to your ground truth.
  6. Re-train the solution and keep iterating until you are satisfied with the accuracy of the solution.
  7. Deploy to production and keep monitoring to ensure you pick up any previously unseen questions.

A key element of the process I outlined above is ensuring you have representative data.  This is vitally important in any machine learning application.  If the system has not been trained with representative data you cannot expect it to perform well.  Gathering representative data is often challenging; how do you collect data for an operational solution before deploying that operational solution?  There are approaches you could consider.  My preferred approach is to start small with a set of data developed internally.  Note that data will not be representative as you have developed it yourself.  However, you can use that data to build a prototype that you then test with your employees and business partners.  They will enter more realistic data, but still not fully representative, that will allow you to improve your prototype before field testing with end Users.  At that stage you will need to position the technology appropriately and ensure the Users understand that they are part of the development process.  Finally, once you are satisfied you can deploy to a production environment ... but keep monitoring!

When working with your training data, it’s really important to take the time to look at the actual data.  I personally like to read through data in its most raw form.  Often you will get summary reports saying that a solution is only 70% accurate or that certain groups of Users are unhappy.  Look at the data!  See exactly what is being asked of the solution and that will help you to understand the real issues.  You should expect to see ambiguous questions and inputs that would be difficult for a human being to interpret.  That doesn’t mean that you should accept the inaccuracy in the system – just that you may need to work on the User Interface or the processes for handling ambiguity or some other aspect of the solution.  You can only make wise decisions if you really understand the data so don’t be seduced by summary performance reports.

In my next blog I will talk more about representative training data and how that data is actually used by machine learning algorithms.

No comments:

Post a Comment