Let's first start with partitions, what are they?

Simply put, the data within an RDD is split into many partitions, and

partitions are very rigid things.

Most importantly, they never span multiple machines, this is super important.

Data in the same partition is always on the same machine.

Another point is that each machine in the cluster contains at least one partition.

Sometimes more, sometimes exactly one.

Actually, the number of partitions is configurable, by the way.

We'll learn about when and how the number of partitions can be changed by the user.

Though by default, when a job starts the number of partitions is

equal to the total number of cores on all executor nodes.

That's the default.

So for example if all of the machines in your cluster have four cores and

you have six worker nodes then that means the default number of partitions you can

start with may be 24.

And importantly spark comes with two out of the box kinds of partition

which makes sense for different sorts of applications.

These are hash partitioning and

range partitioning, we'll talk about these in more detail in the next few slides.

However, it's important to note that all of this talk about customizing

partitioning is only possible when working with Pair RDDs.

Since, as you'll see, partitioning is done based on keys,

so we partition based on keys, hence, we need a Pair RDD to do that, so

let's start with hash partitioning.

To illustrate this first kind of partitioning let's return to our

groupByKey example that we saw in the previous session.

As we saw on the last session by defaulting this example

hash partitioning is used but what is it?

How does it work?

What does it look like?

The way it works is like this.

Since groupByKey knows it has to move all of the data around the cluster,

its goal is to do that as fairly as possible.

So the first thing it does is it computes the partition p for

every tuple in the pair RDD.

So we start by getting the key's hash code And

then we modulo that with the default number of partitions.

So we said it was 24 in the last slide.

And whatever the answer is of this computation,

is the partition that that key goes onto.

Then when we actually do the hash partitioning,

the tuples in the same partition are sent to the machine hosting that partition.

So again the key intuition here is that hash partitioning tries to spread around

the data as evenly as possible over all of the partitions based on the keys.

The other kind of partitioning is called range partitioning.

This is important when some kind of order is defined on the key so

examples would include integers or charge or strings for

example and if we think about our previous examples where.

We were working with pair RDDs that had keys that were integers,

we could hypothetically range partition these keys.

For these kinds of RDDs with keys that can have an ordering,

range partitioning could be an efficient choice for partitioning your data.

Intuitively though, that means when you're using a range partitioner,

keys are partitioned according to two things.

They're partitioned according to some kind of order, so like I said a numerical

order electrographical ordering as well as a set of sort of ranges for the keys.

So then you have groups of possible ranges that the keys can fall within.

And that means that tuples with keys in the same range will

appear on the same nodes.

Let's look at a more concrete example each kind of partitioning must start with

hash partitioning, so consider we have a Pair RDD with these keys.

So we don't care about the values for

now but these the keys that we have in our pair RDD.

Let's say our desired number of partitions is 4.

Also just creating simple let's assume that the hashCode function

is just the identity function for this example.

And let's remember that you compute the partition what we do is we call

the hash code on the key.