Tuesday, 29 September 2015

Machine Learning : Training & Test Sets



In my last blog, I touched upon the importance of defining high quality training and test data when deploying a Machine Learning solution.

In this blog, I promised to dig deeper into this subject.  Before I start, I must point out that the views opinions expressed here are entirely my own and do not necessarily represent IBM’s positions, strategies or opinions.


Getting the data right is crucial in any machine learning solution and today I’d like to communicate three key messages:

  1. Understand the difference between academic testing and real world testing.
  2. Ensure you have representative data.
  3. Look at the data … take some time to read through it … you’d be surprised how revealing it can be.
So what do I mean by understand the difference between academic testing and real world testing?

In the academic world, researchers want to measure how effective different algorithms are.  This is normally done using a training set, a test set and a blind test set.  Basically we take a set of ground truth data comprising example input data together with the outputs we would expect the solution to generate.  We split that data into three and then use a third of the data to train the machine learning.  Machine learning algorithms can normally be tuned in some way so we use the second set of data as a Test Set.  This allows us to adjust the various training parameters, test and re-train to ensure that we have the optimum configuration.  Finally, once we’re happy that we have the optimum solution, the Blind Test set is used to formally evaluate the solution's performance.  The Blind Test data has been kept completely isolated from the other data sets so there is no chance of contamination in the process.

This method is ideal for an academic evaluation, however in practical applications of Machine Learning there are other considerations.  Imagine you are developing a Question Answering solution for a bank.  What is most important to you?  Deploying the most effective machine learning solution or deploying a solution that will always generate the correct answers?  The two are not necessarily the same thing.  Often we start projects with very little real world data and, by splitting that data into thirds, we immediately reduce the amount and quality of training data available.  Alternatively, if we simply use all the available data as training data, then we have no way of testing the system so that we know how it will behave against previously unseen data.  The counter-counter-argument is that if our available data is really that limited then even breaking out a blind test set still does not give us confidence that the tool will work against previously unseen data.

Unless we are working in a perfect environment where we know there is a huge set of statistically significant representative data, I prefer a boot strapping approach.  I like to build systems where the customer knows and understand that the system will work for data that is in the training set.  If we encounter previously unseen data, then we add it to the training data and continue.  In practical terms this means adopting a process along the following lines (for a QA system):

  1. Collect as many questions as possible … ideally from a live system.
  2. Train the machine learning solution using all data.
  3. Test the solution automatically using the same data and ensure it is generating the answers you expect.  Note don’t assume that it will have learned to answer all questions correctly as very few machine learning technologies do.
  4. Test with actual Users – real Users tend to misspell terms or enter different variants of questions.
  5. Identify any questions, i.e. previously unseen questions, that did not exist in your ground truth and add them to your ground truth.
  6. Re-train the solution and keep iterating until you are satisfied with the accuracy of the solution.
  7. Deploy to production and keep monitoring to ensure you pick up any previously unseen questions.

A key element of the process I outlined above is ensuring you have representative data.  This is vitally important in any machine learning application.  If the system has not been trained with representative data you cannot expect it to perform well.  Gathering representative data is often challenging; how do you collect data for an operational solution before deploying that operational solution?  There are approaches you could consider.  My preferred approach is to start small with a set of data developed internally.  Note that data will not be representative as you have developed it yourself.  However, you can use that data to build a prototype that you then test with your employees and business partners.  They will enter more realistic data, but still not fully representative, that will allow you to improve your prototype before field testing with end Users.  At that stage you will need to position the technology appropriately and ensure the Users understand that they are part of the development process.  Finally, once you are satisfied you can deploy to a production environment ... but keep monitoring!

When working with your training data, it’s really important to take the time to look at the actual data.  I personally like to read through data in its most raw form.  Often you will get summary reports saying that a solution is only 70% accurate or that certain groups of Users are unhappy.  Look at the data!  See exactly what is being asked of the solution and that will help you to understand the real issues.  You should expect to see ambiguous questions and inputs that would be difficult for a human being to interpret.  That doesn’t mean that you should accept the inaccuracy in the system – just that you may need to work on the User Interface or the processes for handling ambiguity or some other aspect of the solution.  You can only make wise decisions if you really understand the data so don’t be seduced by summary performance reports.

In my next blog I will talk more about representative training data and how that data is actually used by machine learning algorithms.

Friday, 28 August 2015

Applying Machine Learning To Real World Problems

As the Chief Architect for IBM Watson Tooling, I am passionate about applying Machine Learning to real world business problems.  Inventing, implementing and applying Machine Learning solutions has been my (professional) life for over 20 years.

There is a huge amount of excitement around Artificial Intelligence (AI) right now and understandably Machine Learning is getting a lot of attention.   The last thing anyone wants is for failed projects to trigger a new AI Winter so it's really important that we get this right.  I therefore thought it would be helpful to publish a series of articles on the practical application of Machine Learning.

This blog is pitched at a non technical audience, however I'm more than happy to spin off deep techie conversations if required.  As a Master Inventor and senior technical leader in IBM, I must stress that this is my personal blog and that the views and opinions expressed are entirely my own.  They do not necessarily represent IBM's positions, strategies or opinions.

Machine Learning is all about computers learning from real world experience.  For example, if you want to build a system that answers questions on your corporate web site, you start by giving the computer a set of example questions and the answers you would like in response.  The Machine Learning system takes this training data and learns how to answer Questions.  Similarly, if you want to train a system to recognise people by the sound of their voices you collect examples of their voice recordings and label each recording to build a training set.

At a high level, this is an attractive message as it leaves the audience with the impression that everything is easy.  Grab some data, run it through a Machine Learning algorithm and you have a run-time system that you can apply easily to your live, operational data.

The devil though is in the detail and there are some important points that are worth noting:
  1. Machine Learning can only be effective when provided with good quality training data that is representative of the live operational data.  It's not a short cut!  Ensuring you have the right data is an important and time consuming aspect of the project.  I estimate that 80-90% of the time I spend on analytics projects is spent getting the data right.
  2. Machine Learning systems come in many different forms.  There are neural networks, probabilistic classifiers, Markov models, fuzzy networks and rules based systems.  All of these different AI algorithms can be trained using Machine LearningMachine Learning simply describes how the algorithm is trained.  For example, I have a long history with rules based systems in defence applications.  Defence customers liked rules based systems as they could understand why a system took a certain action in response to an input.  However, I often hear rules based systems criticised because the rules are hard to define.  In my Defence systems, the rules were derived using Machine Learning.
  3. be realistic about what you expect of a Machine Learning system.  Quite often there is a belief that if you throw a Machine Learning, or any AI, tool at a large pot of data it will discover something important.  For example, I have worked with law enforcement agencies where we were analysing huge amounts of data as part of criminal investigations.  Sometimes, the answer just didn't exist in the data!  In a Question Answering system, sometimes the Questions are so ambiguous that a human expert would struggle to understand.  We shouldn't expect AI systems to solve unsolvable problems.
  4. think hard about how you assess the accuracy of your system.  As with training data, it's important that any test data is representative of the live, operational data.  However, remember that these are statistical systems that can be skewed by the data.  Consider a Counter Fraud solution where the system has to identify cases of fraud.  If 75% of the test data cases are not fraudulent, then I can achieve an accuracy of 75% just by never declaring a case to be fraudulent.  Conversely, if I declare all cases to be fraudulent I achieve a perfect performance in predicting fraud but generate a huge number of false alarms in the process.  There's a whole raft of work on the science of measuring accuracy that I will discuss in a later blog, however the key point to understand is that the test data can be skewed to alter the performance metrics.
  5. also remember to factor in the cost impact of any decisions.  Consider a Customer Relationship Management (CRM) solution.  You may develop a Machine Learning system to predict the likelihood of a customer leaving and going to an alternative supplier.  A system may be 100% accurate in predicting that some customers are going to stay whilst failing to predict that other more valuable customers are going to leave.  In tuning the system it's important to consider the cost impact of a wrong decision.
  6. don't assume that Machine Learning will easily out perform other analysis techniques.  One of my earliest experiences was in applying Machine Learning in Formula 1 where many of the applications were in control systems.  Control theory is a huge discipline with massive amounts of research and a whole academic and professional discipline behind it.  Many of the systems we were looking at had been thoroughly analysed and modeled using control theory.  These existing engineering approaches generally performed better.  However, within control theory there were (are) specific problems that may benefit from a Machine Learning approach.  Engineers working in control systems may use Machine Learning as one of the tools in their toolkit.
Understanding these basic concepts will help you to make good decisions in exploiting Machine Learning technology.  I've been working in this field for over 20 years and have seen many very successful projects.  The benefits of this technology are immense if you understand the basic principles and apply them correctly.

In my next blog I will talk more about training and test data and how to ensure you get the data right.