0:57
First we've got kind of a challenging problem here that if we wanted to get
very high utilization in the LAN Because there are natural fluctuations over time.
So, we're going to leverage diversity in the services.
Some services need
a certain amount of bandwidth at a certain moment of time and they're inflexible.
And some other services you can use to kind of
fill in whatever room is left over.
>> For example, you might have latencies instead of queries.
If I make a Google search and
some datacenter does not have the information to fulfil that query.
It might make a backend query to some other facility.
And that's a latency sensitive query.
>> [CROSSTALK] Yes, you may have a back up traffic that
you want to do on the time scale of hours to days.
>> Right.
>> Another theme is that we're going to use a software fine networking approach.
Where we gather information about the state of the network.
Make a centralize decision about the flow of traffic.
And then push those decisions down to lower levels to actually implement them.
But bringing all that information together in one place you
got to a relatively complex decision.
And you do that traditionally, with a optimization technique like
linear programming, which is a way to take a set of constraints on required amounts
of data flow over parts of a network and come up with an optimal solution.
And of course, you can use linear programming for
many other optimization applications.
2:42
But if you just take that big generic hammer, and
apply it to the situation where we need to make relatively quick decisions.
Turns out that it's rather slow.
>> Right, and part of the complexity comes from the multitude of services,
the different priorities.
If you had just one service, you could run in it flow algorithm, and
that would be much faster.
But that's not the case >> So we'll have to do something to get us
something faster even if it's perhaps not guaranteed to be exactly optimal.
Fourth, we're going to have a dynamic reallocation over bandwidth, of bandwidth.
So, as the demands on the network change over time.
We're going to have to make continual decisions about what traffic is highest
priority to move across which links at a given moment.
Okay? So,
that's part of the challenge with linear programming that we have to make quick
decisions.
>> So these are online algorithms.
But they're not online in the same way as things inside the data center might be.
For example, Google runs its traffic engineering 500 or so times a day.
So, it's not as fine grained as things you might need inside a data center.
Traffic between these facilities is relatively stable it seems.
>> And finally, a commonality in the architecture here is that we're
going to have to implement an enforcement of the flow rates.
Somehow when the traffic enters the network.
And we'll do that at the edge, rather than at every hop along the path.
>> Correct.
7:12
And this is done based on the priority service has the bandwidth that it gets
signed.
Now that we've looked at an overview of the architecture.
Let's look at some of the design choices that make this work.
For one, the failsafe choice, it has a fallback option.
They also keep available the BGP forwarding state.
Because each of their switches allows them to have multiple forwarding
tables a the same time.
They can afford to have BGP forwarding tables.
Also, in addition to the traffic engineering provided buns as a backup.
Now if the traffic engineering scheme does not work.
They can discard those routing tables and use the BGP forwarding state instead.
So the BGP routing tables serve as a big red switch.
>> So if you look at this big red switch feature of the design,
it's pretty interesting.
Google went to a great deal of effort in many aspects of the design.
There's many layers of getting reliability in this rather non
trivial distributed system.
And one of them is just switching completely from a very new
kind of traffic routing to a much more traditional, tried and
true hardened approach of traditional BDP routing.
So that big red switch was a way of getting a greater
degree of confidence that you can actually run the system.
And if some unexpected problem came up in the new design just switch back to
the old one.
>> We should note that this fallback was in the context of a very early
state architecture.
Then it made sense to not rely on this new combination of several different pieces.
Among these new pieces was custom hardware.
So Google made their own custom cheap and simple switches to operate this network.
Part of the rationale for this was the timeline.
In 2011, there were not many software defined networking switches on the market.
So Google built their own using cheap commodity equipment.
This should be a callback to our earlier lesson on topology
where we looked at closed networks.
Here what you're looking at is a two stage closed topology.
But, it's fine in several lead switches.
But this approach, the build gate port, ten gigabit switches, in 2011.
There's several interesting things about this hardware architecture.
For one, they're forgoing the expensive,
proprietary equipment that van operators traditionally use.
But that goes away.
BIg forwarding tables and deep buffers.
So, they have to use edge based straight limiting,
because they no longer have deep buffers.
Also, given the number of locations in this architecture is small,
twelve data centers are being connected.
They don't necessarily need the big forwarding tables.
So they can make due with small forwarding tables.
Also, keeping these switches simple makes them highly available.
On these switches also runs an embedded chip running Linux,
which is what interfaces with the open flow controller.
Most of these smarts then reside in the software.
These switches are simple in the software, which is the open flow control logic
at each site replicated for four torrents using pixels.
Further, for scalability of this system they use a hierarchy of controllers.
The software solution achieves nearly 100% utilization and
solves the trafficking problem in 0.3 seconds.
This is achievable only because of the several tricks they use for
scaling the system up.
For one, they use a hierarchy of controllers.
This is what this looks like.
So, at the top level we have the global controller,
which is talking to an STN gateway.
The gateway can be thought of as a super controller that talks to
the controller's at all these different data center sites.
Each site might itself have multiple controllers,
because of the scale of these networks.
This hierarchy of controllers simplifies things from the global perspective.
Other scaling tricks in use are the aggregation of both flows and links.
As we discussed earlier, traffic engineering at this global scale is not
at the level of mutual flows but of flow groups.
That also helps scaling.
Further, each pair of sites is not connected by just one link.
These are massive capacity links that are formed from a trunk of several parallel
high capacity links.
All of these are aggregated and
exposed to the traffic engineering layer as one logical link.
It is up to the individual site to the partition traffic, multiplex and
demultiplex traffic across the set of links.
>> Now Microsoft has also publicly disclosed design for
optimizing wide area traffic flow in their WAN.
And although there are differences in goals and
design in many aspects There are also similarities,
some of which we mentioned, and different features.
One particular interesting feature we'd like to mention in this
design is the way to make changes to the traffic flow without causing congestion.
So if you look at this example here.
Let's say we're in network on the left.
And we have these two flows, the blue one and
the green one that are each using in this simple example
100% of the bandwidth of the links that they flow across.
If we wanted to move from the scenario on the left to the scenario on the right.
If you do that in any naive way, because you can't control
exactly when every packet will flow across every link.
There's going to be some period of time where, due to timing uncertainty,
you have both of the flows sharing to some extent one of the links.
And you can think of different ways to do this.
But you're always going to run into some link that may get
used by both flows at the same time.
So, if you take a look at this design for SWAN there's an approach to
making those updates with a certain amount of spare capacity.
So that you can avoid congestion.
So that's an interesting approach that takes the optimization one step further.
So that you're providing a pretty strong guarantee on lack
of congestion even while the network data flow of changing.