0:00

The topic of learning is a very rich and fascinating one in game theory. we will

Â able to, to sort of just get a taste for it. And we'll be focusing on two examples

Â of learning in games one, one, one method called fictitious play and the other

Â called no-regret learning. And in particular an algorithm called regret

Â matching. but the topic really is vast, and so let me say a few words before we

Â actually look at the specific methods. The first of all we should recognize that

Â learning in, in, in game theory is fairly different from learning in certain other

Â disciplines. For example, as it's done in machine learning in, in, in AI, in

Â computer science or in statistics optimization. in those, in those, in those

Â disciplines one usually has in mind a single actor acting in an environment and

Â the environment is unknown to the agent, it may be , it may be partially

Â observable. And so it's very difficult to figure out what an optimal strategy is,

Â but there is a well-defined notion of optimal strategy and the goal of learning

Â is to learn something about the environment and how to act best in it. In

Â the case of game theory, the environment includes, our paths consist entirely of

Â other agents. So, even as you're trying to learn and adapt, so are they. And so what

Â ends up happening is, you can't really divorce the notion of learning from the

Â notion of teaching, because as you adapt, you're influencing the activities of other

Â agents. And just imagine informally the situation where there are two agents who

Â repeatedly drive towards each other playing the game of chicken. And maybe

Â drag racing by adolescents. And perhaps the one of, one of the one of the so, so

Â what they want to do is, of course, they want to just zoom straight ahead and have

Â the others go to the side of the road and, and given them the right of way. of course

Â if they both do that, they collide and that's a bad idea. And so they sort of

Â test each other and over time dare more or less. So imagine that there's one driver

Â who's an extremely goo d modeler of the other driver, driver 2. And so driver 1

Â learns very well. Whatever driver 1 strategy is, driver 1 will learn what

Â driver 2 strategy is and best respond to it. But over time he will figure it out

Â and be a great, great responder. It seemed like you can't do any, any better than

Â that. Well imagine now that player 2 is sort of a bully driver. Who doesn't really

Â model the first driver very well but just barrels ahead not caring about the, the

Â about the, the circumstances. Perhaps agree willing to take a few hits here and

Â there to scare off the first driver. Well, what's going to happen is, the second

Â driver, who's a terrible learner, and a very bad modeler of the first driver, is

Â going to keep going straight ahead. and the first driver, the wonderful learner

Â and best responder, is going to learn to accomodate. And what happens is that the

Â second driver was perhaps a bad learner, but a very good teacher. So, that's one

Â thing to keep in mind when you think about learning new games. And the other is that

Â learning is, an overloaded term, and there are many things you might learn in the

Â context of games. And, and we'll be looking at a very particular slice. The

Â context we'll be looking at is specifically repeated games. And when we

Â speak about the learning in repeated game, we really be speaking about strategies

Â that as they unfold draw interesting inferences or use accumulated experience

Â in an interesting way. that is the nature of learning that we, and in fact most of

Â the literature in game theory, considers. So with that in mind here are two

Â examples. The first and this is perhaps the granddaddy of learning regimes in game

Â theory, called fictitious play. In fact, it really was not conceived initially, nor

Â is today viewed as a realistic or effective learning method but it does

Â contain many of the elements that you see in more involved versions of learning. It

Â was presented first as a euristic for computer, computing the Nash Equilibrium

Â in games. It turns out not to be a very eff ective procedure but it, it, it is an

Â interesting thing, kind of a basic learning procedure. the way it works is

Â 5:12

simply, each agent starts with some belief about what is the strategy of the other

Â agent. each agent best responds. The agent updates their beliefs based on what they

Â observed in the, in this iteration of the game, and the game repeats. As I said,

Â this is a very general regime and in fact this is a general regime of model based

Â learning. We have a model of the agent which would best respond to an update over

Â time. Fictitious plays is a special case where the model is simply a count of the

Â other agent's action so far and you take their counts as their current mixed

Â strategy. And so a little more formerly let's, let's assume that wa is the number

Â of times the opponent played action a in the past. And there's some initial values

Â for those that are non zero. And, and then you simply play a with probability that is

Â proportional to the time that it was played in the past. That's a very

Â straightforward, simple procedure. And so you, you always, there's something a

Â little paradoxical going on, because every agent, assuming, unless we'll talk about

Â two agents here. The two agents are always playing a pure strategy. And yet they're

Â modeling each other always with playing a mixed strategy. So be that as it may, and

Â we should always note that you need to worry about edge cases such as tie

Â breakings and what happens if you have two actions who's played an equal number of

Â times in the past, well you need to worry about that. Here's an example of how it

Â might work and in the context of matching pennies. Matching pennies, again, two

Â players, each choosing heads or tails. If they both chose the same, the first guy

Â wins. If they chose differently, the second guy wins. Let's assume that these

Â are the initial frequencies that they have in mind. And so I's beliefs about 2 is

Â that 2 played heads with. You know, a frequency of 1.5, and a, entails a

Â frequency 2. And these are player 2's belief abou t player 1. Okay now it's

Â round 1 what should they do? Well, player 1 want's to best respond to his beliefs.

Â He believes that this, he believes that this is the distribution of player 2 and

Â he wants to match. So he's going to play tails so he can match, this is the best

Â response to this mixed strategy that he ascribes with, to player 2 so he's going

Â to play tails. What about player 2? Player 2 has these beliefs, and he wants to

Â mismatch. So, since he believes that player 1 will play heads with greater

Â probability than tails, he's going to play tails. So he too, is going to play tails.

Â And this stage is over. Let's move now to the next stage. At this point, what

Â happens, well, these are the updated beliefs of the players. Player 1 observed

Â player 2 playing tails, so he increases the 2 to a 3, and so does player 2

Â increases his beliefs in what player 1 will do. So, what do they do, what player

Â 1 still want's to match player 2 and he still believes in fact, even more

Â strongly, that player 2 will play tail with greater probability, so he's going to

Â play tail again. On the other hand, player 2 now believes that these are the

Â probabilities, so player 2 believes that player 1 will play tail with the greater

Â probability. Player 2 wants to mismatch, and so player 2 will now play heads. And

Â you'll continue this calculation and you'll, you can persuade yourself that the

Â play will proceed in this fashion. And this is how fictitious play, takes place.

Â So, notice something interesting. the strategies don't converge, and if you were

Â to continue to play this out, you would see that the T's and H's on both sides

Â sort of ebb and flow. But you will see that there's a certain balance taking

Â place over time. And this is, an, an, an, an, an, and in fact in this game they

Â would converge that on average, if you look the long term average of the

Â strategies, each of the agents will play, tails and heads with equal probability,

Â .5. And so, we call this the empirical frequency. Now notice that in matching

Â pennies, . 5 is also the unique Nash Equilibrium.

Â And the question is, is this an accident? And the answer is no. And here's the

Â theorem, the theorem says that if the empirical frequencies of the players plays

Â converge in fictious play, then to have to converge to a Nash Equilibrium of the

Â game. Now, they may not converge in general. That's why it's not an effective

Â learning procedure in general. But there are a host of conditions under which the,

Â even if the play doesn't converge, the empirical frequency does. And here are

Â some of the conditions are sufficient. If the game is zero sum. If it's solved by

Â something called iter, iterated elimination of strictly dominated

Â strategies. If it's something called a potential gain, which we won't define

Â here. Or if it is a 2 by n gain as with one pair has only two strategies, the

Â other may have more. But it has what's called generic payoff, which we also, also

Â want to find here. But the main thing to, to take away from this is that there are

Â some sufficent conditions that guarentee that the empircal frequency of play and

Â fictious play will converge even, even if the play itself will not. Let us now

Â switch to a very different form of learning as a whole class will learn in

Â core on no regret learning and it, it's difference in a fundamental way. first of

Â all the methods themselves will not be model based, they will not explicity model

Â what the other agent is doing. but rather we'll adapt the, the strategy directly

Â there's one difference. But perhaps more fundamentally in this case we'll start

Â with a learning method, but we start with a criterion for, for, for what we want the

Â method to satisfy. namely the no regret criterion. And so, what does this say?

Â We'll say that a regret on an agent at time t, for not having played some

Â strategy s, is this difference. The difference between the payoff he actually

Â got at the time and the payoff he would have gotten, had he played strategy s.

Â That's the naturally enough notion. We will now define where the learning rule

Â exhibit no regret. It will be the, if the case that if in the limit, agents would

Â not play will not exhibit inner regret. In other words, if you as you go to the,

Â 13:53

surprisingly, simple, and it's called regret matching. And the way it works is

Â as follows. It says, simply look at the regret that you have experienced, so far.

Â And for each of your pure strategies, and pick the pure strategy in proportion to

Â its regret. So, if we define the, and again, if the regret of the strategy at

Â time t is Rt of s, then the probability at the next time that you'll play is, this is

Â the sum of all regrets across all pure strategies. And take your relative regret,

Â relative to all, to the sum of all regrets. And so, a very simple rule. And

Â it has surprisingly, strong properties. first of all, it provably exhibits no

Â regret and furthermore it actually converges the strategies when you use no

Â regret matching, converge to a correlated equilibrium, at least for a finite, for

Â finite games, finite games that are repeated. So, those are two examples of

Â leaning rules, one model based specifically, fictitious play. Another one

Â that is model free, regret matching, that ex, one of the family of learning methods

Â that exhibit no regret. As we said in the beginning the topic of learning game is a

Â very rich one but at least we have a taste for it now.

Â