0:52
Let's take a slightly closer look at how something,
like a web search query, might work.
So I make a search query,
it hits a server in a data center.
This server might query several other servers
which further might communicate with other servers, and so on.
These responses are then collated,
and the final search response is sent to me.
This kind of a traffic pattern is referred to as,
scatter-gather, or partition aggregate.
So you're scattering the request,
and then you're gathering the results.
For one query, there might be a large number
of server to server interactions within the data center.
If you think this picture is complicated,
take a look at this.
This is what Bing's query workflow
for producing the first page of results
from a search query looks like.
So the query comes in here at the top.
It's split into multiple requests,
which further might be split into sub-requests and so on.
So you can see the scatter pattern in play here.
A server makes several different requests,
to fulfill the search query.
And then you can also see the gather pattern
closer to the end of this graph.
Where are the results
from multiple, different subqueries,
are gathered and processed together.
And in the end, a response is produced,
which is sent to the user.
The processing of a query can be a multi-stage process,
and each step can involve as many as
40 different sub-requests.
And this kind of scatter-gather pattern,
is not exclusive to search.
Facebook, for example, also sees several,
internal requests in processing a user request for a page.
Loading one of Facebook's popular pages causes an average
of 521 distinct requests on the backend.
For the same page in the 91st percentile,
you require 1740 items to be fetched.
So really, inside these web applications,
you can make a large number of requests
involving one user request.
2:49
Further these data centers also run,
big data processing tools, such as Hadoop, Spark,
Dryad, Databases,
which all work to process information
and make it available to these web applications.
These data processing frameworks
can move massive amounts of data around.
So what does the actual measured traffic
and data centers look like
when they're running these applications?
The short answer unfortunately,
even though it seems like a cop-out is, it depends.
On the applications that you're running,
on the scale you're running them at,
and the network's design as well as the design
of the applications.
But nevertheless, let's take a look
at some of the published data.
One thing that's unambiguously true
is traffic volume inside data centers is growing rapidly,
and is the majority of traffic that these servers see.
The majority of the traffic is not,
to and from the Internet but, is rather,
inside the data center.
Here what you're looking at is data from Google
showing the traffic generated by their servers,
and their datacenters over a period of time,
a bit more than six years which is on the X-axis.
And on the Y-axis is the aggregate traffic volume.
The absolute numbers are not available,
but over a six-year period
the data volume grows by 50 times.
Google has noted that this e-traffic doubling every year.
Google's paper is not quite clear about
whether this is data center internal traffic only
but Facebook has mentioned that machine to machine traffic
is several orders of magnitude larger
than what goes out to the Internet.
So really, most of the traffic in these facilities is going
to be within the facility, as opposed to
to and from the Internet,
and it's growing quite rapidly.
So what does this traffic look like?
One question we might want to have answered
is about locality.
Do machines communicate with neighboring machines,
or is traffic uniformly spread
throughout the data center or such?
So here we have some data from Facebook
that goes some way towards addressing this question.
Let's focus on this part of the table,
where all of Facebook's data center traffic
is partitioned by locality.
Within rack, within a cluster, within a data center,
or across data centers.
As you can see roughly 13% of the traffic is within a rack.
So these are just machines within the rack
talking to each other.
A rack might host some tens of machines, for example,
40 machines seems to be quite common.
Then we see that there is 58% of traffic,
which stays within a cluster but not within the rack.
So this traffic is across racks within a cluster.
Further, 12% of the traffic stays within the data center,
but it not cluster local.
So this is traffic,
between multiple clusters in the data center.
Also interesting is that 18% of the traffic
is between data centers.
This is actually larger than the rack local traffic.
So locality in this work load is not really high.
Also worth noting, is that Hadoop
is the single largest driver of traffic
in Facebook data centers.
More data at rack locality comes from Google.
What you're looking at here
is data from 12 blocks of servers.
Blocks are groups of racks,
so a block might have a few hundred servers.
This is a smaller granularity than a cluster,
but a larger granularity than a rack.
Here, what you're looking at, in the figure on the right
is traffic, or a block, that is leaving for other blocks,
so that is non-local traffic.
For each of these 12 blocks in the figure,
you see that most of the traffic is non-local,
so most of the traffic goes to other blocks.
Now there are 12 blocks here.
If traffic was uniformly distributed,
you would see 1/12 of the traffic being local,
and 11/12, that is roughly 90%, being non-local,
which is exactly what this graph shows.
Part of this definitely stems
from how Google organizes storage.
The paper notes, that for great availability,
they spread data around different fall domains.
So for example, a block might have all its power supply
from one source.
If that power supply fails,
you lose everything in that block.
So if data is spread around well, over multiple blocks,
you might still have the service be available.
So this kind of organization is good for availability,
but is bad for locality.
Another set of measurements comes from Benson et al.,
who evaluated three university clusters,
two private enterprise networks,
and five commercial cloud networks.
The paper obscures who the cloud provider is,
but each of these datacenters host 10,000 or more servers
running applications including web search,
and one of the authors works at Microsoft.
The university and private data centers
have a few hundred to 2,000 servers each.
While the cloud data centers here, have 10 to 15,000 servers.
The cloud data centers one through three,
run many applications including web, mail, etcetera,
but cloud four and five run more MapReduce style workloads.
One thing that's worth noticing is,
the amount of rack-locality here is much larger.
70% or so for the cloud data centers,
which is very different from what we saw earlier
for Google and Facebook measurements.
There're many possible reasons for these differences.
For one, the workloads might be different.
Not even all MapReduce jobs are the same.
It's entirely possible that Google and Facebook
run large MapReduce tasks which do not fit in a rack.
While this mystery data center runs smaller tasks
that do fit in a rack, and, hence,
traffic is mostly rack-local.
There might also just be different ways
of organizing storage and compute.
There's also a five-year gap
between the publishing of these measurements.
Perhaps, things are just different now,
hap sizes might have grown substantially,
or people have changed how they do these things.
Having looked at locality, let's turn our attention
to flow level characteristics.
How many flows does a server see concurrently?
Facebook's measurements, and I quote here:
8:50
And Hadoop nodes have, approximately,
25 concurrent connections on average.
In contrast, for a 1500 server cluster,
running Hadoop style workloads,
the numbers are much smaller.
Two servers within a rack and four servers outside a rack
is the number of correspondents for a server.
So a server talks to six other servers in the median.
Contrast that with the 25 in Facebook's workloads.
The only things we can conclude from this, perhaps,
are that: A) there are differences across applications,
so web servers, caches, and Hadoop,
all have different numbers of concurrent flows.
And also, that even all Hadoop workloads
are not created the same.
So Facebook's Hadoop workloads seem to be different from this,
mystery data center workload.
Another thing we are interested in finding out,
is, what is the arrival rate for flows?
How soon do new flows arrive in the system?
Facebook's measurements put the median
flow inter-arrival time at a server
at approximately 2 milliseconds.
The mystery cluster has inter-arrival times
in 10s of milliseconds,
so a-less than a 10th of Facebook's rate.
Regardless of the difference in these two measurements,
one higher order bit that's worth mentioning here
is that at the cluster scale, the flow of inter-arrival times,
would be in the microseconds,
if you had a 1000 server cluster, for example.
So, really, new flows enter the system
at a very high frequency, as a whole.
What about flow sizes?
Most Hadoop flows, it seems,
are quite small in Facebook's measurement.
The median flow is smaller than a kilobyte,
and less than 5% of flows, exceed a megabyte,
or last longer than 100 seconds.
For caching workloads, most flows are long lived,
but they're burst internally.
So you can think of these as, perhaps,
long lived TCP connections,
where you do not want to incur the cost of a DCP handshake,
and the rate increase, due to slow start every time,
so you keep the DCP connection alive,
but data is only exchanged sporadically.
Also of interest are heavy hitters.
So, are there a small fraction of flows
that move most of the bytes?
For this set of measurements, the answer seems to be no.
Heavy-hitters are roughly the same size as the median flow,
and are not persistent.
So some flows which might be moving a large fraction
of the bytes, now, do not continue to be
the heavy-hitters in the next time interval.
These measurements do a great interval
from the measurements from the 1500 server cluster
also running Hadoop.
There are more than 80% of these servers,
last less than 10 seconds, but,
more than 50% of bytes are in flows
that are also small or short, less than 25 seconds.
Some part of this, in particular the heavy-hitters
being roughly similar in size to the median flow,
stands surely from application level load balancing.
For example, for caching, if you're spreading the load
across these different caches,
you will see roughly similar flow sizes.
So where do we go from here?
What does data center traffic look like?
We don't have a definitive answer.
And right now not a whole lot of data is available.
There are, however, some conclusions we can draw
from the nature of data center applications,
as well as the points of agreement
of these different data sets.
We look at these in detail, but briefly:
One, data center traffic, internally,
is quite big and increasing.
Two, we have very tight deadlines for network I/O.
That is, we are targeting small latencies
for applications, like web search in particular.
Three, because we have a large number of flows,
we'll see congestion in the network,
and problems like TCP incast.
Four, we need isolation across multiple applications
that are running in these data centers.
Five, centralized control at the flow level
might be difficult
because of the large flow rates we are seeing.
So let's talk about each of these in more detail.
One thing that's amply clear from these measurements
and from the growth of these applications,
is that need for bandwidth into the centers
is rapidly growing.
As the application scale, we need to provide
higher and higher capacity networks,
and in a cheap and scalable and fault-tolerant manner.
We want efficient network topologies
and efficient routing to achieve high capacity.
Apart form moving large volumes of data around the data center,
we also want to do this at very small latencies.
Particularly for applications like search,
where the deadlines for responses can be very small,
10's of milliseconds.
Just looking at this Bing query workflow tires me out.
Let's take a small candy break.
Here, I have a bunch of candy I've collected.
[Rattling]
And I'll drop in this, washed and cleaned, stone into it.
It's roughly the same size and weight
as the individual pieces of candy.
Now, if I were to pick a piece from this box randomly,
I'm highly likely to end up with candy.
No teeth broken.
So that worked.
But if I grab a handful of candy now,
let's see what happens.
14:21
So as we discussed,
the deadlines for network I/O in the data center
can be quite small.
Now suppose that all of your web servers are really nice,
and their response times are quite small.
For most requests, in fact, for 99% of the requests,
the issue response is in 10 milliseconds or less.
But for 1% of the requests,
they take an entire second to respond.
This is an example I'm stealing from Google's Jeff Dean.
Now if you're making just one request,
the odds that your response will take 1 second,
or be slower than that, are just 1%.
But, if you were to make hundred requests,
and wait for them all to finish,
the odds are 63% that you'll have to wait a second or more.
Just like choosing a handful of candy from that box.
You'll likely to end up with one of those servers
that responds slowly.
Now given of what you've seen
about the nature of the applications,
they make lots of small requests.
This kind of thing can happen quite often,
63% is very poor odds.
You do not want your service time to be such large numbers.
And as these services expand, the problem only gets worse.
Thus, what we need is to reduce variability in latency,
and also to tolerate some latency, perhaps,
at higher layers.
For example, the application might make redundant requests
in just takes the one that returns faster.
Because there will be some variation latency inherent
to anything that's networked.
With the large number of requests,
we'll also see a large number of flows that share
the network's capacity,
leading perhaps to congestion, and a problem called TCP incast.
TCP incast is something we'll look at in more detail
later in the course, but essentially,
when a large number of hosts try to share the same buffer,
when a large number of flows, try to share
the same network buffer,
they overwhelm that buffer, and network throughput drops,
and latency increases.
There are various application layer fixes to this,
but ultimately, all of these complicate
the application's logic.
This is a problem that we should attempt to solve in the network,
and we look at solutions to do that.
Further, with the variety of applications
that might be running together in these data centers,
particularly if you're a cloud provider, like Amazon,
you also need to look at isolating these applications
from each other.
So if the applications that are latency sensitive
are stuck in buffers behind applications
that are just moving bulk data,
we will see very poor quality of service.
So we need some way of isolating
these applications from each other.
Applications with different objectives
are, after all, sharing the network.
Lastly, given that data centers see very high flow arrival rates
and most of the flows are short, that is not persistent,
centralized control at the flow level
could be very difficult to scale.
Particularly to large deployments
of tens of thousands of servers.
So perhaps, we'll need distributed control,
possibly with some centralized tinkering.
The implications we have discussed here
will factor into our discussion of all the topics
we see in the course,
from routing, congestion control,
to software-defined networking.