Google wants to use machine learning to make restaurant reservations for you. China wants to use it to help with healthcare. Facebook wants to use it to put an end to fake news. It’s easy to find flashy headlines about machine learning. However, a question remains – what actually is it?
Without an understanding of the core essentials of machine learning, applying the technology to solve real world problems is an exercise in futility. It’s like a blind man wearing glasses because he heard they help you see. One ought to understand the fundamentals of machine learning to both make good use of the technology and to avoid costly mistakes when acquiring it.
In this post, I’ll provide a concise introduction to machine learning (ML). If you’re looking for an overview of the AI landscape, please visit our other post here.
As we go along, I’ll call out technical definitions in bold in an attempt to demystify common ML jargon. Each section of this post will act like a domino in a chain, with the biggest question from the first section driving the content of the second section and so on until you have a solid grasp of the basics of machine learning. After this post, you’ll be better able to evaluate the ML at the core of artificial intelligence technology.
At its core, ML isn’t actually that complex. Most ML you’re likely to see in the real world can be understood intuitively by understanding classifiers.
What is a classifier?
At a high level, a classifier is a labeling machine. Feed it a discrete input (e.g., an image, a word, a sentence) and it outputs one of a set of known labels.
In the example above, the enigmatic “black box” represents our classifier. It takes as input a discrete object of interest (the image) and produces a label (“Cat”) from a fixed set of possible labels (the output space).
In the real world, most of the exciting ML models you might have read about are either more complex flavors of the classifier we describe in this blog post, or are composed by chaining simple classifiers together to perform more complex tasks (e.g., sentence translation, self driving cars, chatbots). For the sake of this introductory post, we’re only going to examine a simple classifier predicting images into two classes: cat or dog. This straightforward example will make things more intuitive without sacrificing technical rigor, and the same insights will apply to practical applications such as self driving cars.
How are classifiers constructed?
In the previous section, we imagined a classifier as a black box: an input (the image) went in and a label (“Cat”) came out. Now, we’ll discuss how classifiers are made, which will lay the groundwork to explain how they work. The graphic below helps illustrate the key components of classifiers:
Calling out each of the three components:
- The training data consists of pairs of known inputs and outputs that look just like the inputs and outputs that the classifier will emulate. In our example, our data consists of pictures and their corresponding labels. Our (input, output) data looks like this: (Image of cat, “Cat”), (Image of dog, “Dog”), (Image of dog, “Dog”) and so on.
- The training algorithm uses this labeled data to produce a classifier that can emulate the task demonstrated in the data. In our example, this means that the training model can label an image as “Cat” or “Dog”).
- The classifier — defined here precisely — is a model with all of its parameters filled with actual values by the training algorithm. A simple way to think about parameters is to recall the equation of a line: y = m*x + b. In this equation, “m” and “b” are the parameters.
A model is an algorithm used by our classifier to make its predictions. For now, you can think of a model like an empty shell. It needs to be filled with information before it can make predictions. While a model has the capacity to perform complex predictions, it requires the right parameters to do so. Asking a model to make a prediction without providing parameters is like expecting an empty DVD player to play “Interstellar”. While the DVD player has the capacity to show anything on the screen, it lacks the instructions for showing Interstellar in particular.
To develop these parameters and fill this empty shell, we use a process known as training. Basically, we teach our model how to make predictions by using training data.
Here’s how that works at a high level: we construct an algorithm that takes as input labeled data and produces as output a classifier, or, more precisely, the filled in parameters that complete a model. Labeled data goes in (images labeled as “Cats”; images labeled as “Dogs”) and a set of numbers (the parameters of the model) comes out that defines how the model should predict whether a new image is a cat or a dog. The training algorithm fills the values of our parameters.
In the real world, how classifiers are constructed is massively important to the ultimate usefulness of the model. For example, a classifier which is fed examples of stop signs wouldn’t be able to detect traffic lights, just like a DVD player playing Interstellar isn’t showing Lord of the Rings. So, why can we train a classifier on street signs and expect it to work on unseen street signs, but not expect it to work on traffic lights? In the next section, we’ll go over how classifiers are actually trained — how we fill the values of our parameters — and gain insight into what types of unseen data we can expect a classifier to be accurate on.
How do we fill the values of our parameters?
It might help if we yank the lid off our black box in earnest and get into the step by step process of how our mysterious classifier labels images as either “Cat” or “Dog”. The following graphic represents the first few steps:
First, we have to convert our discrete input – for example, an image of a cat – into something we can do math on. For practically all machine learning applications, this is a vector. A vector, in turn, can be thought of as a point on a cartesian grid.
These are usually very long vectors, which means these points are not in 2- or 3-dimensional space, but in something like 500-dimensional or 10 million-dimensional space. I recommend doing what every ML expert does, which is to visualize 2-dimensions and think “10 million” really hard.
Converting an image to a vector is a relatively straightforward process. Every pixel of the image is a dimension of the vector, and the value of each dimension is, for example, the grayscale value of that pixel.
Given that each discrete input is converted to be some point in space, we just have to split this space in two: some of the space for “Cats” and some for “Dogs”. In the simplest case, we do this by drawing a line — or in the high-dimensional case a hyperplane — that separates the space. This is our decision boundary.
The job of the training algorithm is to define the parameters of this decision boundary. In high dimensional space, this decision boundary exists as a high dimensional plane. In two dimensions, the decision boundary exists as a one dimensional plane – which is simply a line. Just like before, when someone refers to a decision boundary or hyperplane it’s safe to visualize a line and think “lots of dimensions” really hard:
A line requires two parameters to be fit: recall again the equation of a line, y = m*x + b. M and b are the parameters our training algorithm sets the values of. For higher-dimensions, we have more parameters, but the idea remains the same.
Putting this all together, our black box of a classifier starts to look pretty straightforward. We convert our input to a vector, we plot the resulting vector, and then we measure which side of the decision boundary (line) it’s on. For bonus points, to get a measure of confidence we can measure how far away we are from the decision boundary — the further we are, the more confident we are. In our example, the further a point is from our decision boundary the cattier the cat or the doggier the dog that point represents.
In the real world, these parameters can can define how a car decides whether or not a sign is a stoplight is displaying a go or stop signal, whether “ciao” means hello or goodbye and whether a chatbot thinks “I’m just joshing you” is a joke or someone telling the bot their name. But, it should also be clear that we’re dividing our space in our example into just a few classes. If we’re classifying between cats and dogs, how would we be expected to pick up on what a badger is? If we’re classifying whether something is a stop sign, how would the model decide whether it’s a traffic light? If we get a new image of a cat, we can expect it to fall into the portion of the space labelled “cat”, but as we move away to more and more different inputs, it becomes less and less clear that they’ll vectorize to the right place in the space.
How does training actually work?
Short answer: training is curve fitting.
Knowing how the black box works, the training algorithm is much more intuitive. Training an ML system, at its core, can be viewed as an exercise in curve fitting. In fact, you may have done least squares regression back in school — this is a perfectly valid and used machine learning algorithm.
The line drawn in the plot below shows how we draw our decision boundary (each point is a training example):
Intuitively, the goal of a training algorithm is to construct a decision boundary that separates our data with as few mistakes as possible, with as many elements as far from the decision boundary as possible. For example, the boundary line in the plot above makes only one mistake — mislabeling a cat as a dog.
In the real world, much of the challenge of machine learning is to take the limited training data we have, and find the “right” curve, the curve that classifies the most unseen data correctly. It should be clear then that more data creates a more accurate boundary, or that the right data — the data that’s closest to your unseen data — should create a more reliable decision boundary. When we talk about “big data” or “clean data” or any of these ML buzzwords, what we mean is data that will help fit the right curve to separate, in our example, cats and dogs.
Where do we go from here?
This post has focused on the crucial technical underpinnings of machine learning, explaining classifiers, how classifiers make predictions and how you can train a classifier from training data. When people ask me practical questions about machine learning, the intuitions gained from these underpinnings ground my answers.
Practical questions these intuitions have helped me answer: why is it important to evaluate ML systems on unseen data? Why are neural nets such powerful classifiers? What type of data is valuable for machine learning? In upcoming blog posts, I’ll leverage the insights from this post to answer those and other questions. Other posts to look forward to include an explanation of AI axioms, how to evaluate AI systems, how to tell if AI is right for your task and more. If you’d like to be alerted when new posts go up, please subscribe.
If you still have questions about the points in this post and want to discuss further, please shoot me an email at [email protected]. Perhaps your question will turn into a future blog post!
 A more accurate statement here would be “a lot of tasks that don’t look like classifiers can nonetheless be understood with the same intuitions.” For instance, neural machine translation can be viewed as a sequences of classifiers that work like the below image:
Here, words of the translated English sentence is predicted from (1) the foreign sentence, and (2) the English sentence generated so far. Each of these decisions is essentially a classifier, with a huge output space consisting of every word in the English language. The details are complex, but for the most part the intuition holds.
 Parameters, too, is a technical term. The parameters of the model are the learned numbers that determine how the model will perform its task.
 Capacity serves as a technical term as well — high capacity models (e.g., neural networks) have the ability to learn more complex tasks, but tend to be more difficult to train. High capacity models tend to have many parameters.
 For many applications, this is a higher-order tensor. But, for the purposes of intuition a tensor is just a vector with extra mental gymnastics — a vector is a rank 1 tensor, a matrix is a rank 2 tensor, and so on.
 For language, converting text into vectors requires an extra step. This is known as embedding words into a vector space (usually, around 100-1000 dimensions); the resulting vectors are called word embeddings. You may have heard of word2vec and GloVE, which are two popular methods for generating word embeddings, along with the corresponding dictionaries for mapping words to vectors.
 A hyperplane is an (n-1) dimensional space embedded into an n-dimensional space. Much like most things in high-dimensional space, most AI experts imagine either a 2-dimensional space and a 1-dimensional “hyperplane”, or a 3-dimensional space and a 2-dimensional “hyperplane” (in that case, just a plane).
 Expanding on this a bit: least squares regression will fit a line to a set of points. This line is actually along one more dimension than the embedding space. In the 2D space we’ve been using as an example, imagine now a third dimension coming out of the screen, where every positive example has a value of 1 and every negative example has a value of -1. The “line” we’re fitting is then the line between the 1 and -1 values; the decision boundary hyperplane is the intercept of the fitted line with the feature space.