In my last blog, I touched upon the importance of defining
high quality training and test data when deploying a Machine Learning solution.
In this blog, I promised to dig deeper into this
subject. Before I start, I must point
out that the views opinions expressed here are entirely my own and do not
necessarily represent IBM’s positions, strategies or opinions.
Getting the data right is crucial in any machine learning solution and today I’d like to communicate three key messages:
- Understand the difference between academic
testing and real world testing.
- Ensure you have representative data.
- Look at the data … take some time to read
through it … you’d be surprised how revealing it can be.
So what do I mean by understand
the difference between academic testing and real world testing?
In the academic world, researchers want to measure how
effective different algorithms are. This
is normally done using a training set, a test set and a blind test set. Basically we take a set of ground truth data
comprising example input data together with the outputs we would expect the solution to generate. We split that data
into three and then use a third of the data to train the machine learning. Machine learning algorithms can normally be
tuned in some way so we use the second set of data as a Test Set. This allows us to adjust the various training
parameters, test and re-train to ensure that we have the optimum configuration. Finally, once we’re happy that we have the
optimum solution, the Blind Test set is used to formally evaluate the solution's
performance. The Blind Test data has been kept
completely isolated from the other data sets so there is no chance of
contamination in the process.
This method is ideal for an academic evaluation, however in
practical applications of Machine Learning there are other considerations. Imagine you are developing a Question
Answering solution for a bank. What is
most important to you? Deploying the
most effective machine learning solution or deploying a solution that will
always generate the correct answers? The
two are not necessarily the same thing. Often we start projects with very little real world data and, by splitting that data into thirds, we immediately reduce the amount and quality of training data available. Alternatively, if we simply use all the available data as training data, then we have no way of testing the system so that we know how it will behave against previously unseen data. The counter-counter-argument is that if our available data is really that limited then even breaking out a blind test set still does not give us confidence that the tool will work against previously unseen data.
Unless we are working in a perfect environment where we know there is a huge set of statistically significant representative data, I prefer a boot strapping approach. I like to build systems where the customer knows and understand that the system will work for data that is in the training set. If we encounter previously unseen data, then we add it to the training data and continue. In practical terms this means adopting a process along the
following lines (for a QA system):
- Collect as many questions as possible … ideally
from a live system.
- Train the machine learning solution using all
data.
- Test the solution automatically using the same data
and ensure it is generating the answers you expect. Note don’t assume that it will have learned
to answer all questions correctly as very few machine learning technologies do.
- Test with actual Users – real Users tend to
misspell terms or enter different variants of questions.
- Identify any questions, i.e. previously unseen
questions, that did not exist in your ground truth and add them to your ground
truth.
- Re-train the solution and keep iterating until
you are satisfied with the accuracy of the solution.
- Deploy to production and keep monitoring to
ensure you pick up any previously unseen questions.
A key element of the process I outlined above
is ensuring you have representative data. This is vitally important in any machine
learning application. If the system has
not been trained with representative data you cannot expect it to perform
well. Gathering representative data is
often challenging; how do you collect data for an operational solution before
deploying that operational solution?
There are approaches you could consider.
My preferred approach is to start small with a set of data developed
internally. Note that data will not be
representative as you have developed it yourself. However, you can use that data to build a
prototype that you then test with your employees and business partners. They will enter more realistic data, but
still not fully representative, that will allow you to improve your prototype
before field testing with end Users. At
that stage you will need to position the technology appropriately and ensure
the Users understand that they are part of the development process. Finally, once you are satisfied you can deploy to a production environment ... but keep monitoring!
When working with your training data, it’s really
important to take the time to look at the actual data.
I personally like to read through data in its most raw form. Often you will get summary reports saying that a
solution is only 70% accurate or that certain groups of Users are unhappy. Look at the data! See exactly what is being asked of the
solution and that will help you to understand the real issues. You should expect to see ambiguous questions
and inputs that would be difficult for a human being to interpret. That doesn’t mean that you should accept the
inaccuracy in the system – just that you may need to work on the User Interface
or the processes for handling ambiguity or some other aspect of the
solution. You can only make wise decisions
if you really understand the data so don’t be seduced by summary performance
reports.
In my next blog I will talk more about representative training data and how that data is actually used by machine learning algorithms.