Insight of the Week: Supervised Machine Learning and the Bias/Variance Trade-Off

by Stefan Viorel Mihai, Research Assistant, LDTRC

Digital Twins embody the driving force behind the Fourth Industrial Revolution, that is the promise of bridging the physical world and its virtual counterpart in a way that enables full-duplex, real-time, reliable communication between the two entities. With the advent of Big Data, IIoT, Cyber Physical Factories, and Artificial Intelligence, this no longer looks like a far-fetched idea, becoming instead an increasingly relevant objective for researchers to achieve. However, building such a complex system requires a strong grasp of the technologies involved and good foresight into risks and issues that might pose a challenge along the way. In this context, this week’s meeting of the London Digital Twin Research Centre focused on discussing one of the most prominent challenges in Machine Learning: the Bias/Variance Trade-Off.

Across the many definitions of Machine Learning (ML), one that comprehensively covers all the capabilities and, at the same time, the limitations of ML is given by Tom Mitchell:

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

To exemplify, for a classification problem (T), the classification accuracy (P) should increase with the amount of training data (E) – Note: not always true!

This definition successfully encompasses the three main components on the check-list of a ML project:

  • The task we want to achieve using Machine Learning, which must be clear from the start;
  • The adequate choice of training data, which must be relevant to the targeted task;
  • The right pick of a performance metric, which must be a good indicator of how well the algorithm performs the task we have assigned to it.

Thankfully, we already know what kind of problems can be solved by Machine Learning. In fact, there are different types of ML models that are well suited to accomplish a diverse array of tasks. In the figure below, a brief review of the three main types of ML are introduced, together with their characteristics and typical usages.

As already noted, however, the issue of Bias/Variance Trade-Off plagues all of Machine Learning, no matter the type of model or task it is meant to achieve. In this week’s meeting, we discussed how underfitting and overfitting affect ML models, exemplified on some popular and simple Supervised Learning algorithms. Additionally, we stressed the importance of choosing training data and a performance measure that are relevant to the task at hand.

Supervised Machine Learning

Above are some of the more popular Supervised Learning algorithms, and although most of them can achieve both Regression and Classification, some are better suited for one purpose than the other.

In order to visualize the effects of the Bias/Variance Trade-Off on the performance of ML algorithms, we looked at two models, Linear Regression and Logistic Regression to illustrate the challenges that Machine Learning models typically face.

Linear Regression

In regression algorithms, the goal is to predict a continuous-valued output (e.g.: house prices, temperature, stock prices, etc.). As such, given the data described as a feature vector x, the model will come up with a function (named a hypothesis) that will map the often multi-dimensional feature vector to a one-dimensional continuous value, that will be an estimate of the target output, y.

For example, given data described as a 4-dimensional feature vector:

[latex]\begin{equation}x = \begin{bmatrix} x_{1}, x_{2}, x_{3}, x_{4} \end{bmatrix} \end{equation}

…a trained Linear Regression algorithm will come up with a hypothesis like:

$$h_{trained} = b + w_{1} x_{1} + w_{2} x_{2} + w_{3} x_{3} + w_{4} x_{4} $$

…where $b$ and $w_{k}$ are the parameters of the model, or otherwise called the bias and the weights, and $h_{trained}$ is the trained Linear Regression model’s hypothesis.

Training a Linear Regression model means finding the optimal parameters for which the hypothesis will output the closest estimate possible to the target output, y, given the feature vector x. In this process, a Cost function is used as a measure of how wrong the hypothesis’ estimates are in comparison with the target output, and the training objective is finding the bias and the weights that will minimize this Cost function for the given training data set.

Above is an example of a basic Linear Regression problem: predicting house prices. For the sake of this example, the chosen feature vector is two-dimensional, containing only the size of the house (per 1,000 sq. ft.) and the number of bedrooms in the house.

The goal of a Linear Regression model is to find the parameters of the hypothesis that best fit both features in the training set, given the corresponding target outputs.  A trained model will return the following hypothesis:

It can be noted, in the figure on the right, that the Linear Regression model is very sensitive to outliers, which gives all the more reason for carefully considering the choice of training data for this model. Additionally, going back to Tom Mitchell’s definition of Machine Learning, another very important conclusion can be drawn:

If the experience E is unrelated to the task T, then the performance measure P will be an inaccurate representation of the model’s capabilities.

In other words, it can be noted that the model above is not trained to predict house prices for properties that have more than 5 bedrooms. As such, if the purpose of the model was to accomplish the task of estimating selling prices of very big houses (many bedrooms), then the model would most likely perform poorly because it was trained with inadequate data. Garbage-in, garbage-out!

Logistic Regression

In classification learning algorithms, the goal is to find the parameters of a hypothesis that can draw an optimal decision boundary between the (usually) two classes of observations. The training process is similar to that of Linear Regression, except that this time the hypothesis and Cost function take a different form.

Depending on the characteristics of the data to be classified, the decision boundary can take various forms, from a straight line to a very complex shape that might require sophisticated features, fine tuning of the model’s parameters, and well-chosen training data.

Above is an example of a classification problem that can be reasonably solved using a linear decision boundary. It can be noted that, although the classifier gets most of the training observations classified correctly, there are some, albeit only a few of them, that get misclassified.

This leads to the question of what is a good performance metric for a classifier?

From the figure above, one can quickly deduce an accuracy measure that is computed as the ratio of correctly classified examples and total number of observations. Indeed, for this model, we can say that the classification accuracy on the training set is 80.93%, a reasonable and satisfactory value for such a model, one could say. While not necessary for this problem in particular, it turns out that we can control the training set accuracy with the help of a regularization parameter that tells the model how strictly we want it to differentiate between the two classes, or in other words, how well we want it to fit the training data. This is an incredibly useful feature, particularly in classifiers that involve more complex decision boundaries, like the one in the figure below.

By playing with the regularization parameter, we can obtain the following examples of decision boundaries:

We can note that the third model has the highest accuracy, but it also has a somewhat strange decision boundary that will undoubtedly prove to be problematic if we want to use that model to predict new observations. So now we can ask: is the training set accuracy metric a good indicator of the performance of our classifier? The short answer is: no. The accuracy on the training set can only tell us how well the algorithm will distinguish between classes for observations that are in the training set, without giving any indication about its performance on never-seen-before feature vectors. More often than not, we are interested in an algorithm that can generalize to new observations with a high accuracy. To get a feel of how well our algorithms can do that, the general rule of thumb is to calculate the model’s accuracy on another set of data, called the test set, that has not been part of the model’s training.

This is where the problem of the Bias vs. Variance Trade-Off becomes apparent. We call a model Highly Biased if it underfits the training data (low training set accuracy) and it performs even worse on the test data (low test set accuracy). Similarly, a model is said to have High Variance if it overfits the training data (high training set accuracy), but it has poor results on the test data (very low test set accuracy). The challenge is finding the right regularization parameter for which the model has similar performances on both the training data and the test data.

In order to choose the regularization parameter (as well as other parameters that might be of interest in the process of fine-tuning the model), using a cross-validation set is highly encouraged. In order to see if the model is underfitting or overfitting the data, a useful test is  comparing the algorithm’s performance on the training set versus its performance on the cross-validation set, then choosing the parameters for which the performances are similar. The following distribution of the data is recommended:

Conclusions

This week’s meeting focused on understanding the amazing capabilities of Machine Learning algorithms, presented in contrast with the most common challenges that all models face. We concluded that the Machine Learning algorithm is only as good as the data you train it, validate it, and test it with, and this important insight greatly influences the development of Digital Twins as powerful tools for streaming data analysis for predictive purposes.

See the meeting’s presentation at the following link: Machine Learning

Note: The figures in this post are adaptations of the examples given by Andrew Ng in his Machine Learning course.

Leave a Reply