<p>medium bookmark / Raindrop.io | This blog post is based on a conference talk I gave at the PyTexas and North Bay Python conferences. The blog post is a little more detailed, but if you prefer watching video to reading text you can watch the talk on YouTube. Machine learning seems to be everywhere these [&hellip;]</p>

Breakdown

medium bookmark / Raindrop.io |

This blog post is based on a conference talk I gave at the
PyTexas and North Bay Python conferences.
The blog post is a little more detailed, but if you prefer watching video
to reading text you can watch the talk on YouTube.


Machine learning seems to be everywhere these days, but a lot of the information
about what it is and how it works can be somewhat opaque. On one end of the
spectrum there’s the “just run this code” approach, which is great if you’re
learning a new library for a familiar task, but can seem a bit like magic when
it’s demonstrating something you’ve not done before.
On the other end of the spectrum is the mathematical explanation. Mathematical
notation is a useful tool if you know it, but if not it can hide some simple
ideas behind unfamiliar language.

This post will try to walk the line between those extremes. We’ll explore the
ideas behind machine learning without much math or much code, so we can pick out
some of the landmarks that will help us navigate this brave new world.

What can machine learning do for us?

In order to understand how this thing works, first we need to know what it’s
aiming for.

Think about how we typically write software. As much as we like to use fancy
abstractions like objects and functions to organise our code and make it easier
to maintain, we’re ultimately writing a specific, linear sequence of steps that
the computer should perform to turn some input into some output. If our program
gets some input that we didn’t explicitly write instructions for, it’ll break.
If we’ve done our job well, it’ll break in a predictable and consistent way—like
a Web application responding with a 404 error when it gets a request for an
unknown URL—but it’ll still break.

The killer feature of machine learning is generalisation: the ability
to adapt to new kinds of input that we didn’t explicitly consider when we were
building the system.

I recently wrote about using machine learning to understand ingredients in
recipes
. There’s no way I could consider every possible format for recipe
ingredients—the rules of the English language are too complex and too vague—so I
used machine learning to build a general system, instead of hand-coding a
specific system.

The same applies to other machine learning applications. Think about recognising
faces: it might be possible to build a specific system that could recognise my
face, but we need a system that can generalise if it’s going to be useful for
any face.

We’re aiming for generalisation, but most of the software we write consists
of very specific instructions. This seems like a hard problem to solve, but
there’s a good chance you’ve solved it before in another context.

High school science class

Remember your high school science teacher telling you that you should pay
attention because this will all come in useful one day? Today is that day.

Most high school science experiments aim to build a generalised system to make
predictions about the world, and they build those systems by following specific
instructions. Let’s walk through a high school science experiment, and draw
parallels to machine learning.

Step 1: Experimental design

The first stage of any experiment is to define our goal. What are we trying to
understand about the world?

We’re going to look at a simple physics experiment to determine the relationship
between the height a tennis ball is dropped from, and the height of the ball’s
first bounce. Once we understand the relationship, we should be able to
predict how high a tennis ball will bounce before it’s dropped.

A machine learning project starts the same way: we believe there’s a
relationship between some input value or values, and some output value or
values, but we don’t understand what it is. We want to build software that can
make reasonable predictions about the output values based on input values.

Step 2: Data collection

Now that we understand what we’re looking for, we need to take some empirical
observations of what happens in the real world when we bounce a tennis ball.
For some experiments we might get lucky, and find a dataset some other scientist
has collected that we can work from. Other times, we’re going to have to go out
and collect our own data—we’re going to have to bounce a tennis ball a whole
bunch of times.

Again, machine learning is similar: we’ll collect some data that’s relevant to
our problem. Sometimes it’ll be there in the database of our Web application, or
a public dataset on the Internet. Other times, we’ll have to collect it
ourselves.

Step 3: Mathematical modelling

Once we’ve collected our data, we can take a look at it. Here’s a scatter plot
of the drop heights and bounce heights we’ve observed for our tennis ball.

01234567891001234567891011Drop heightBounce height

A chart showing the height of a ball’s first bounce
when it is dropped from different heights.

It looks like the data falls in a straight line, so we can draw a trend line on
our chart to describe the pattern that we see.

01234567891001234567891011Drop heightBounce height

A line of best fit on a scatter plot

While this trend line doesn’t look like much, it’s actually a powerful
mathematical model. We took measurements at 1 metre increments, but our line is
continuous—it fills in the gaps between our observations. In other words, our
line can generalise to drop heights that we didn’t explicitly include when we
were building the model.

How do we define a line?

There are an infinite number of straight lines we could draw, each of which is
defined by two parameters:

  1. A fixed point, which we typically define as the point where it crosses the
    vertical axis. This is often referred to as the intercept, because
    it’s the point where the line intercepts the axis.

  2. The angle of the line, which we typically define using the
    gradient. The gradient is how far up the line goes each time it
    goes across by 1 unit; in our example, that’s how much the bounce height
    increases each time the drop height is increased by 1 metre.

Here’s our chart again, with controls to vary the intercept and gradient:

01234567891001234567891011Drop heightBounce height

Gradient

0.75


Intercept

0.0

A line of best fit on a scatter plot

Once we decide on values of these two parameters, we have everything we need to
make predictions using our line.

class StraightLine(object): 
    def __init__(self, intercept, gradient): 
        self.intercept = intercept 
        self.gradient = gradient 
 
    def __call__(self, x): 
        return self.intercept + x * self.gradient 
 
 
bounce_height_predictor = StraightLine( 
    intercept=0.0, 
    gradient=0.75, 
)

Our stright line model, implemented in Python

I’ve chosen to implement this using a StraightLine class that can represent
any straight line, and an instance assigned to bounce_height_predictor that
represents our specific straight line. This highlights an important difference
between a flexible model that could work in a large number of situations, and
the parameters we give that model to make it work in a specific situation.

How do we know it’s the right line?

Our ideal trend line will fall as close as possible to the observations that
we made—we want our mathematical model to agree with what we’ve observed in the
real world. We can dismiss some lines just by looking at them, but once they
start to get close to the observations it’s hard to pick out exactly the right
one.

We can find out how close our trend line is to our observations by measuring the
distances between the predictions made by the line and the observations we made.
To make it easy to compare different lines to each other, we can then make a
single error score from these measurements by squaring them (to make sure
they’re all positive numbers), and then taking the average.

def mean_squared_error(observations, predictor): 
    """ 
    Returns the mean squared error of the given predictor 
    function, compared to observations passed as a list of 
    (input, output) tuples. 
    """ 
 
    errors = [ 
        predictor(input) - observed_output 
        for input, observed_output 
        in observations 
    ] 
    return sum([err ** 2 for err in errors]) / len(errors) 
 
 
bounce_height_error = mean_squared_error( 
    observations, 
    bounce_height_predictor, 
)

Python code to calculate the model’s error

The differences between the observations and model predictions are shown on this
version of the chart. Notice how the error changes as you change the parameters
of the line.

01234567891001234567891011Drop heightBounce height

Gradient

0.75


Intercept

0.0


Mean squared error
0.0081

Error measurements

If we find the gradient and intercept that give us the minimum possible error,
we’ll know our line is a good fit for our data.

Automating the process

The learning part of machine learning often refers to finding the best
parameters we can for a flexible mathematical model, so that it fits a set of
observations we’ve made of what’s happened in the past, known as our training
data
.

For our simple straight line model, there are only two parameters to find, and
it’s easy to visualise the results. We don’t really need a computer’s help to
find reasonable parameters for the model, but it provides a clear example of how
a computer might be used to fit a model to some data.

A modern neural network model might have many thousands of parameters, and is
capable of modelling much more complex relationships than this one. Even with a
more complex model, the basic process remains the same.

A typical process looks like this:

  1. Start with random parameter values.
  2. Calculate the difference between the training data and the model’s
    predictions using those parameter values.
  3. Make a small change to the parameter values, so that the error goes down.
  4. Repeat many times, until the error has stopped decreasing, has reached some
    target value, or we’ve performed a fixed number of repetitions.

There are various standard algorithms that can efficiently
find parameters that give the minimum error. One technique is
gradient descent, which uses calculus to work out how quickly the
error is changing at any given point, and uses that to decide if each parameter
should increase or decrease, any by how much. Machine learning libraries will
provide implementations of various optimisation algorithms, so while it’s
important to understand the concept of optimisation when you’re building machine
learning systems, don’t be put off by mentions of calculus if that isn’t your
strength.

Step 4: Model verification

The goal of our experiment was to understand the relationship between the drop
height and the bounce height well enough to make predictions about drop heights
we’ve never observed. So far, we know we have a mathematical model that fits
well with the examples we used to build the model. The true test is whether or
not the model will generalise to other examples.

Fortunately, we’ve already figured out how to determine if the model agrees with
a set of observations. To decide if it generalises well, we can compare the
model’s predictions to a different set of observations that we didn’t use to
train the model.

If the model generalises well, its predictions should agree reasonably closely
with these new observations.

When the model doesn’t generalise

If the model doesn’t generalise, we’re usually facing one of two problems.

Over fitting means that our model has agreed so closely with the
examples we used to build it that it’s not capable of generalising to new
examples. When this happens, it’s often useful to re-train the model using more
examples.

Under fitting means that our model doesn’t agree closely enough with
the examples we used to build it, and so it’s not able to generalise either.
For example, if we tried to train a straight line model to fit points on a
curved line, it wouldn’t be able to get very close. When this happens, using a
more complex model can help.

Picking the right model

For this simple example, where we have one input value and one output value, we
can look at the data on a scatter plot and make an educated guess at how we
should model it. We can see that the points lie in a straight line, so we use a
straight line as our model.

Most real world situations are too complex to visualise in a simple diagram, so
it might not be clear which model to use.

You might have to try out several models, and see which performs best.

Step 5: Making predictions

Once we have a model that generalises well enough for our purposes, we can use
it to make predictions.

I like that we use the word prediction for the output of our model. It reminds
me that the output is an educated guess, and not a 100% accurate decision. How
accurate the model needs to be depends on the problem we’re using it to solve—a
model predicting the right move to make in a game probably has more room for
error than a model predicting if a self-driving car should apply the brakes.

What’s next?

Now that we’ve seen the structure of a basic machine learning system, how do we
get from here to implementing our own machine learning systems?

Learning resources

If you want to develop your own machine learning systems, there are three
resources I’d recommend to get started:

Other brands are available

This post has only covered supervised learning, which refers to
algorithms that learn from examples where we have both the input and the desired
output. This is often referred to as labelled data, because the
input values are labelled with the expected output. While this is a popular and
powerful technique, there are others that work differently.

Other techniques you may want to explore include:

  • Unsupervised learning, which refers to algorithms that learn from
    examples where we know the inputs but not the outputs. This kind of algorithm
    is useful for finding structure in data; for example, we can use unsupervised
    learning to find clusters of values.

  • Reinforcement learning, which refers to algorithms that learn from
    trial and error to maximise some reward. The latest versions of AlphaGo use
    reinforcement learning
    to learn how to play games without
    needing any examples of how humans play.

Curated

Jan 5, 8:54 AM

Source

Tags

Tomorrow's news, today

AI-driven updates, curated by humans and hand-edited for the Prototypr community