0:52

Suppose you run a shipping service,

so, you know, users come and ask

you to help ship their package from

location A to location B and suppose

you run a website, where users

repeatedly come and they

tell you where they want

to send the package from, and

where they want to send it to

(so the origin and destination) and

your website offers to ship the package

for some asking price,

so I'll ship your package for $50,

I'll ship it for $20.

And based on the price

that you offer to the users,

the users sometimes chose to use a shipping service;

that's a positive example and

sometimes they go away and

they do not choose to

purchase your shipping service.

So let's say that we want

a learning algorithm to help us

to optimize what is the asking

price that we want to offer to our users.

And specifically, let's say we

come up with some sort of features

that capture properties of the users.

If we know anything about the demographics,

they capture, you know, the origin and

destination of the package, where they want to ship the package.

And what is the price

that we offer to them for shipping the package.

and what we want to do

is learn what is the

probability that they will

elect to ship the

package, using our

shipping service given these features, and

again just as a reminder these

features X also captures the price that we're asking for.

And so if we could

estimate the chance that they'll

agree to use our service

for any given price, then we

can try to pick

a price so that they

have a pretty high probability of

choosing our website while simultaneously

hopefully offering us a

fair return, offering us

a fair profit for shipping their package.

So if we can learn this property

of y equals 1 given

any price and given the other

features we could really

use this to choose appropriate

prices as new users come to us.

So in order to model

the probability of y equals 1,

what we can do is use

logistic regression or neural

network or some other algorithm like that.

But let's start with logistic regression.

2:57

Now if you have a

website that just runs continuously,

here's what an online learning algorithm would do.

I'm gonna write repeat forever.

This just means that our website

is going to, you know, keep on

staying up.

What happens on the website is

occasionally a user

will come and for

the user that comes we'll get

some x,y pair corresponding to

a customer or to a user on the website.

So the features x are, you

know, the origin and destination specified

by this user and the price

that we happened to offer to

them this time around, and

y is either one or

zero depending one whether or

not they chose to

use our shipping service.

Now once we get this {x,y}

pair, what an online

learning algorithm does is then

update the parameters theta

using just this example

x,y, and in particular

we would update my parameters theta

as Theta j get updated as Theta j

minus the learning rate alpha times

my usual gradient descent

rule for logistic regression.

So we do this for j

equals zero up to n,

and that's my close curly brace.

So, for other learning algorithms

instead of writing X-Y, right, I

was writing things like Xi,

Yi but

in this online learning setting

where actually discarding the notion

of there being a fixed training

set instead we have an algorithm.

Now what happens as we get

an example and then we

learn using that example like

so and then we throw that example away.

We discard that example and we

never use it again and

so that's why we just look at one example at a time.

We learn from that example.

We discard it.

Which is why, you know, we're

also doing away with this

notion of there being this

sort of fixed training set indexed by i.

And, if you really run

a major website where you

really have a continuous stream

of users coming, then this

sort of online learning algorithm

is actually a pretty reasonable algorithm.

Because of data is essentially

free if you have so

much data, that data

is essentially unlimited then there

is really may be no

need to look at a

training example more than once.

Of course if we had only

a small number of users then

rather than using an online learning

algorithm like this, you might

be better off saving away all

your data in a fixed training

set and then running some algorithm over that training set.

But if you really have a continuous

stream of data, then an

online learning algorithm can be very effective.

I should mention also that one

interesting effect of this sort

of online learning algorithm is

that it can adapt to changing user preferences.

5:51

And in particular, if over

time because of changes in

the economy maybe users

start to become more price

sensitive and willing to pay,

you know, less willing to pay high prices.

Or if they become less price sensitive and they're willing to pay higher prices.

Or if different things

become more important to users,

if you start to have new

types of users coming to your website.

This sort of online learning algorithm

can also adapt to changing

user preferences and kind

of keep track of what your

changing population of users

may be willing to pay for.

And it does that because if

your pool of users changes,

then these updates to your

parameters theta will just slowly adapt

your parameters to whatever your

latest pool of users looks like.

Here's another example of a

sort of application to which you might apply online learning.

this is an application in product

search in which we want to

apply learning algorithm to learn

to give good search listings to a user.

Let's say you run an online

store that sells phones - that

sells mobile phones or sells cell phones.

And you have a user interface

where a user can come to

your website and type in the

query like "Android phone 1080p camera".

So 1080p is a type

of a specification for a

video camera that you might

have on a phone, a cell phone, a mobile phone.

Suppose, suppose we have a hundred phones in our store.

And because of the way our

website is laid out, when

a user types in a query,

if it was a search query, we

would like to find a

choice of ten different phones to

show what to offer to the user.

What we'd like to do is have

a learning algorithm help us figure

out what are the ten phones

out of the 100 we

should return the user in response to

a user-search query like the one here.

Here's how we can go about the problem.

7:37

For each phone and given

a specific user query; we

can construct a feature vector

X. So the feature

vector X might capture different properties of the phone.

It might capture things like,

how similar the user search query is in the phones.

We capture things like how many

words in the user search

query match the name of

the phone, how many words

in the user search query match the description of the phone and so on.

So the features x capture

properties of the phone and

it captures things about how

similar or how well

the phone matches the user query along different dimensions.

What we like to do is

estimate the probability that a

user will click on the

link for a specific phone,

because we want to show

the user phones that they

are likely to want to

buy, want to show the user

phones that they have high

probability of clicking on in the web browser.

So I'm going to define y equals

one if the user clicks on

the link for a phone and

y equals zero otherwise and

what I would like to do is

learn the probability the user

will click on a specific

phone given, you know,

the features x, which capture properties

of the phone and how well the query matches the phone.

To give this problem a name

in the language of

people that run websites like

this, the problem of learning this is

actually called the problem of

learning the predicted click-through rate, the predicted CTR.

It just means learning the probability

that the user will click on

the specific link that you

offer them, so CTR is

an abbreviation for click through rate.

And if you can estimate the

predicted click-through rate for any

particular phone, what we

can do is use this to

show the user the ten phones

that are most likely to click on,

because out of the hundred phones,

we can compute this for

each of the 100 phones and

just select the 10 phones

that the user is most likely to click on,

and this will be a pretty reasonable

way to decide what ten results to show to the user.

Just to be clear, suppose that

every time a user does

a search, we return ten results

what that will do is it

will actually give us ten

x,y pairs, this actually

gives us ten training examples every

time a user comes to

our website because, because for

the ten phone that we chose

to show the user, for each

of those 10 phones we get

a feature vector X, and

for each of those 10 phones we

show the user we will also

get a value for y, we

will also observe the value

of y, depending on whether

or not we clicked on that

url or not and

so, one way to run a

website like this would be to

continuously show the user,

you know, your ten best guesses for

what other phones they might like

and so, each time a user

comes you would get ten

examples, ten x,y pairs,

and then use an online

learning algorithm to update the

parameters using essentially 10

steps of gradient descent on these

10 examples, and then

you can throw the data away, and

if you really have a continuous

stream of users coming to

your website, this would be

a pretty reasonable way to learn

parameters for your algorithm

so as to show the ten phones

to your users that may

be most promising and the most likely to click on.

So, this is a product search

problem or learning to rank

phones, learning to search for phones example.

So, I'll quickly mention a few others.

One is, if you have

a website and you're trying to

decide, you know, what special

offer to show the user,

this is very similar to phones,

or if you have a

website and you show different users different news articles.

So, if you're a news aggregator

website, then you can

again use a similar system to

select, to show to

the user, you know, what

are the news articles that they

are most likely to be interested

in and what are the news articles that they are most likely to click on.

Closely related to special offers, will we profit from recommendations.

And in fact, if you have

a collaborative filtering system, you

can even imagine a collaborative filtering

system giving you additional

features to feed into a

logistic regression classifier to try

to predict the click through

rate for different products that you might recommend to a user.

Of course, I should say that

any of these problems could also

have been formulated as a

standard machine learning problem, where you have a fixed training set.

Maybe, you can run your

website for a few days and

then save away a training set,

a fixed training set, and run

a learning algorithm on that.

But these are the actual

sorts of problems, where you do

see large companies get so

much data, that there's really

maybe no need to save away

a fixed training set, but instead

you can use an online learning algorithm to just learn continuously.

from the data that users are generating on your website.

12:05

So, that was the online

learning setting and as we

saw, the algorithm that we apply to

it is really very similar

to this schotastic gradient descent

algorithm, only instead of

scanning through a fixed

training set, we're instead getting

one example from a user,

learning from that example, then

discarding it and moving on.

And if you have a continuous

stream of data for some application,

this sort of algorithm may be

well worth considering for your application.

And of course, one advantage of

online learning is also that

if you have a changing pool

of users, or if the

things you're trying to predict are

slowly changing like your user

taste is slowly changing, the online

learning algorithm can slowly

adapt your learned hypothesis to

whatever the latest sets of

user behaviors are like as well.