Hello. During this video,
I will explain to you how NameNode works and what
other helper services are running alongside NameNode service.
It will be the last video of this lesson where we
discuss technical details of HDFS architecture.
The last but not the least, and definitely,
we'll be delighted to see each other regarding the other questions.
Well, you will see me by no means,
and I will have nothing else to do but to miss you on this side of screen.
NameNode is a service responsible for keeping hierarchy of folders and files.
NameNode stores all of this data in memory.
RAM speed of reading and writing data is
fostered by order of magnitude compared to a disk.
But nothing comes free, getting great speed,
you also get a great responsibility to provision efficient master server.
The example will make it clearer.
Imagine you're a customer asking for
a distributed storage for research purposes of their laboratory.
He or she would like to analyze one year of data from
Hadron Collider which is approximately 10 petabytes of data.
In HDFS, all data is usually stored with replication factor three.
So, you can request to buy approximately 15,000 of two terabyte hard drives.
On average, 15 hard drives will die every day.
So, you should request at least 1000 extra hard drives for several months of research.
This storage covers the capacity for data nodes.
The next question is the following: Storing all meta information in memory,
how much memory will the file hierarchy consume?
On average, typical size of objects,
such as folder file or block,
is around 150 bytes.
If you use the block size of 128 megabyte,
this data will consume at least 35 gigabyte of RAM on the NameNode.
If your resources would like to process several years of data from Hadron Collider,
you should operate your NameNode RAM capabilities appropriately.
The more files you have in a distributed storage,
the more load you have on a NameNode.
And this load doesn't depend on the file size,
as you have approximately the same amount of meta information stored in RAM.
This problem, well-known in the Hadoop community,
even have a special name.
Any ideas? Right, you are.
Small files problem.
And our next lesson will be totally devoted to
different file formats useful for overcoming this issue.
128 megabyte of data was once chosen as a default block size. Any ideas why?
When you read block of data from a hard drive,
first you need to locate this work on a disk.
Quite naturally, this operation is called seek.
Having a reading speed of three and a half gigabyte per second,
you will be able to read 128 megabyte in 30 to 40 milliseconds.
Typical drive seek time is less than one percent of the aforementioned number.
It is exactly the reason for having 128 megabyte block size.
It is one choice to have
less than one percent overhead for reading the random block of data from a hard drive,
and keeping block size small at the same time.
We can more easily keep equal utilization of
hard drive space over the cluster with small block size.
Let us move back to the NameNode architecture.
You already know that NameNode stores all the hierarchy in RAM.
NameNode server is a single point of failure.
In case this service goes down,
the whole HDFS storage becomes unavailable,
even for read-only operations.
So, there are a number of technical tricks to make
NameNode decisions durable and to speed up NameNode recovery process.
NameNode uses write-ahead log,
or WAL for short,
strategy to persist matter information modifications.
This log is called the edit log.
It can be replicated to different hard drives.
It is also usually replicated to an NFS storage.
With NFS storage, you will be able to tolerate full NameNode crash.
However, edit log is not enough to reproduce NameNode state.
You should also have a snapshot of memory at some point in
time from which you can replay transaction stored in the edit log.
This snapshot is called fsimage.
You usually have a HiLoad and a NameNode,
so edit log will grow quite fast.
It is a normal scenario when replay of one week of
transactions from edit log will take several hours to boot a NameNode.
As you can guess, it is not appropriate for a high demand service.
For this reason, Hadoop developers invented secondary NameNode.
Secondary NameNode, or better to say,
checkpoint NameNodes, compacts the edit log by creating a new fsimage.
New fsimage is made of all the fsimage by applying all stored transactions in edit log.
You can see the whole process in the slide.
It is a robust and poorly asynchronous process.
You should also take in mind that secondary NameNode consumes
the same amount of RAM to build a new fsimage.
Please provision your fold to run to loop cluster appropriately.
By the way, secondary NameNode was considered a badly named service.
There are a lot of horror stories that you can find on the internet.
A lot of people think of secondary NameNode as a backup NameNode.
Remember, it is not true.
The later was released in 2009.
Please see the following dura for a complete discussion.
At the end, I would like to give you another one example,
so you will be able to configure HDFS storage for your own purposes.
Think about the following question,
is it a big difference to have two terabyte hard drive per
node or two hard drives per node, one terabyte each.
For instance, you have
Samsung 940 Pro SSD with disk read speed of three and a half gigabytes per second.
Let us evaluate how long it takes to read
10 petabytes of data from a hard drive with similar reading speed.
Approximately, 35 days to read these data from one disk.
But if you have 5000 drives,
then you will be able to read all of this data in parallel within 10 minutes.
So, the amount of drives in your cluster
has a linear relation to the speed of data processing.
Overall, these numbers will be a good reference for you when you
decide to install your own cluster for research and development purposes.
Summing up, you now can explain how
NameNode stores meta information hierarchy and how it achieves durability.
You can estimate necessary resources for
a distributed storage to address your customer needs.
Your customer needs to include the required amount of space and speed of data processing.
Resources include a relevant amount of drives and minimal amount of frame for a NameNode.
You can measure how much RAM you need to store all HDFS blocks in memory.
And finally, you should be able to explain differences between NameNode,
secondary NameNode, checkpoint NameNode, and backup node.