0:00

The topic of learning is a very rich and fascinating one in game theory. we will

able to, to sort of just get a taste for it. And we'll be focusing on two examples

of learning in games one, one, one method called fictitious play and the other

called no-regret learning. And in particular an algorithm called regret

matching. but the topic really is vast, and so let me say a few words before we

actually look at the specific methods. The first of all we should recognize that

learning in, in, in game theory is fairly different from learning in certain other

disciplines. For example, as it's done in machine learning in, in, in AI, in

computer science or in statistics optimization. in those, in those, in those

disciplines one usually has in mind a single actor acting in an environment and

the environment is unknown to the agent, it may be , it may be partially

observable. And so it's very difficult to figure out what an optimal strategy is,

but there is a well-defined notion of optimal strategy and the goal of learning

is to learn something about the environment and how to act best in it. In

the case of game theory, the environment includes, our paths consist entirely of

other agents. So, even as you're trying to learn and adapt, so are they. And so what

ends up happening is, you can't really divorce the notion of learning from the

notion of teaching, because as you adapt, you're influencing the activities of other

agents. And just imagine informally the situation where there are two agents who

repeatedly drive towards each other playing the game of chicken. And maybe

drag racing by adolescents. And perhaps the one of, one of the one of the so, so

what they want to do is, of course, they want to just zoom straight ahead and have

the others go to the side of the road and, and given them the right of way. of course

if they both do that, they collide and that's a bad idea. And so they sort of

test each other and over time dare more or less. So imagine that there's one driver

who's an extremely goo d modeler of the other driver, driver 2. And so driver 1

learns very well. Whatever driver 1 strategy is, driver 1 will learn what

driver 2 strategy is and best respond to it. But over time he will figure it out

and be a great, great responder. It seemed like you can't do any, any better than

that. Well imagine now that player 2 is sort of a bully driver. Who doesn't really

model the first driver very well but just barrels ahead not caring about the, the

about the, the circumstances. Perhaps agree willing to take a few hits here and

there to scare off the first driver. Well, what's going to happen is, the second

driver, who's a terrible learner, and a very bad modeler of the first driver, is

going to keep going straight ahead. and the first driver, the wonderful learner

and best responder, is going to learn to accomodate. And what happens is that the

second driver was perhaps a bad learner, but a very good teacher. So, that's one

thing to keep in mind when you think about learning new games. And the other is that

learning is, an overloaded term, and there are many things you might learn in the

context of games. And, and we'll be looking at a very particular slice. The

context we'll be looking at is specifically repeated games. And when we

speak about the learning in repeated game, we really be speaking about strategies

that as they unfold draw interesting inferences or use accumulated experience

in an interesting way. that is the nature of learning that we, and in fact most of

the literature in game theory, considers. So with that in mind here are two

examples. The first and this is perhaps the granddaddy of learning regimes in game

theory, called fictitious play. In fact, it really was not conceived initially, nor

is today viewed as a realistic or effective learning method but it does

contain many of the elements that you see in more involved versions of learning. It

was presented first as a euristic for computer, computing the Nash Equilibrium

in games. It turns out not to be a very eff ective procedure but it, it, it is an

interesting thing, kind of a basic learning procedure. the way it works is

5:12

simply, each agent starts with some belief about what is the strategy of the other

agent. each agent best responds. The agent updates their beliefs based on what they

observed in the, in this iteration of the game, and the game repeats. As I said,

this is a very general regime and in fact this is a general regime of model based

learning. We have a model of the agent which would best respond to an update over

time. Fictitious plays is a special case where the model is simply a count of the

other agent's action so far and you take their counts as their current mixed

strategy. And so a little more formerly let's, let's assume that wa is the number

of times the opponent played action a in the past. And there's some initial values

for those that are non zero. And, and then you simply play a with probability that is

proportional to the time that it was played in the past. That's a very

straightforward, simple procedure. And so you, you always, there's something a

little paradoxical going on, because every agent, assuming, unless we'll talk about

two agents here. The two agents are always playing a pure strategy. And yet they're

modeling each other always with playing a mixed strategy. So be that as it may, and

we should always note that you need to worry about edge cases such as tie

breakings and what happens if you have two actions who's played an equal number of

times in the past, well you need to worry about that. Here's an example of how it

might work and in the context of matching pennies. Matching pennies, again, two

players, each choosing heads or tails. If they both chose the same, the first guy

wins. If they chose differently, the second guy wins. Let's assume that these

are the initial frequencies that they have in mind. And so I's beliefs about 2 is

that 2 played heads with. You know, a frequency of 1.5, and a, entails a

frequency 2. And these are player 2's belief abou t player 1. Okay now it's

round 1 what should they do? Well, player 1 want's to best respond to his beliefs.

He believes that this, he believes that this is the distribution of player 2 and

he wants to match. So he's going to play tails so he can match, this is the best

response to this mixed strategy that he ascribes with, to player 2 so he's going

to play tails. What about player 2? Player 2 has these beliefs, and he wants to

mismatch. So, since he believes that player 1 will play heads with greater

probability than tails, he's going to play tails. So he too, is going to play tails.

And this stage is over. Let's move now to the next stage. At this point, what

happens, well, these are the updated beliefs of the players. Player 1 observed

player 2 playing tails, so he increases the 2 to a 3, and so does player 2

increases his beliefs in what player 1 will do. So, what do they do, what player

1 still want's to match player 2 and he still believes in fact, even more

strongly, that player 2 will play tail with greater probability, so he's going to

play tail again. On the other hand, player 2 now believes that these are the

probabilities, so player 2 believes that player 1 will play tail with the greater

probability. Player 2 wants to mismatch, and so player 2 will now play heads. And

you'll continue this calculation and you'll, you can persuade yourself that the

play will proceed in this fashion. And this is how fictitious play, takes place.

So, notice something interesting. the strategies don't converge, and if you were

to continue to play this out, you would see that the T's and H's on both sides

sort of ebb and flow. But you will see that there's a certain balance taking

place over time. And this is, an, an, an, an, an, and in fact in this game they

would converge that on average, if you look the long term average of the

strategies, each of the agents will play, tails and heads with equal probability,

.5. And so, we call this the empirical frequency. Now notice that in matching

pennies, . 5 is also the unique Nash Equilibrium.

And the question is, is this an accident? And the answer is no. And here's the

theorem, the theorem says that if the empirical frequencies of the players plays

converge in fictious play, then to have to converge to a Nash Equilibrium of the

game. Now, they may not converge in general. That's why it's not an effective

learning procedure in general. But there are a host of conditions under which the,

even if the play doesn't converge, the empirical frequency does. And here are

some of the conditions are sufficient. If the game is zero sum. If it's solved by

something called iter, iterated elimination of strictly dominated

strategies. If it's something called a potential gain, which we won't define

here. Or if it is a 2 by n gain as with one pair has only two strategies, the

other may have more. But it has what's called generic payoff, which we also, also

want to find here. But the main thing to, to take away from this is that there are

some sufficent conditions that guarentee that the empircal frequency of play and

fictious play will converge even, even if the play itself will not. Let us now

switch to a very different form of learning as a whole class will learn in

core on no regret learning and it, it's difference in a fundamental way. first of

all the methods themselves will not be model based, they will not explicity model

what the other agent is doing. but rather we'll adapt the, the strategy directly

there's one difference. But perhaps more fundamentally in this case we'll start

with a learning method, but we start with a criterion for, for, for what we want the

method to satisfy. namely the no regret criterion. And so, what does this say?

We'll say that a regret on an agent at time t, for not having played some

strategy s, is this difference. The difference between the payoff he actually

got at the time and the payoff he would have gotten, had he played strategy s.

That's the naturally enough notion. We will now define where the learning rule

exhibit no regret. It will be the, if the case that if in the limit, agents would

not play will not exhibit inner regret. In other words, if you as you go to the,

13:53

surprisingly, simple, and it's called regret matching. And the way it works is

as follows. It says, simply look at the regret that you have experienced, so far.

And for each of your pure strategies, and pick the pure strategy in proportion to

its regret. So, if we define the, and again, if the regret of the strategy at

time t is Rt of s, then the probability at the next time that you'll play is, this is

the sum of all regrets across all pure strategies. And take your relative regret,

relative to all, to the sum of all regrets. And so, a very simple rule. And

it has surprisingly, strong properties. first of all, it provably exhibits no

regret and furthermore it actually converges the strategies when you use no

regret matching, converge to a correlated equilibrium, at least for a finite, for

finite games, finite games that are repeated. So, those are two examples of

leaning rules, one model based specifically, fictitious play. Another one

that is model free, regret matching, that ex, one of the family of learning methods

that exhibit no regret. As we said in the beginning the topic of learning game is a

very rich one but at least we have a taste for it now.