0:00
[SOUND] Hortonworks is probably the granddaddy
of the cloud computing application distros.
What it provides is a very flexible, very dynamic and very up to date environment.
You can download it, as I said to your laptop or
to a cloud, build your own cloud type of thing.
And you could either ask them to help you support that cloud or
you can just work with what they give you as open domain.
Either will work for you.
Hortonworks is certainly one the distributions that you
should look at and it's on the whole, I found very easy to work with.
0:47
So, what do they provide?
They have developed over the years a data connection strategy.
They have two main thrusts.
One's called HDP, and I'll talk about that in a bit, in a minute.
But really, it's software based around Apache Hadoop, it's open source.
It provides distributed storage, processing of your large sets of data, and
runs on commodity hardware, that's its focus.
Hadoop really has been instrumental allowing businesses to quickly gain
insight for masses of data.
And Hortonworks has capitalized on that in building up a reputation.
And it actually acts as a sort of resource to many different industries,
companies like Yahoo.
1:39
HDF is a new wrinkle on top of the connected data strategy,
what it's trying to do is address real-time data collection, curation,
analysis of data that's being delivered.
And it can come from any device, source or system, either on premises or off cloud.
But the notion is to process these streams very effectively,
and then integrate those streams, the data coming in on those streams,
with your HDP with your framework for storing large amounts of data.
So, this is what we're going to talk about for a second and then we'll see how other
companies relate to this, but first, what does Hortonworks provide?
This is their diagram of tools and
as you can see, it's very complicated, it's very flexible.
And what they've done is to build layers of tools, so
you don't just get Hadoop by itself.
What you're getting is operating systems to manage it, you're getting distributer
file systems, you're getting all of the things to manage life cycles for the data.
You're getting workflow software, you're getting security with this system.
And you're getting operations, so you can actually manage how this thing is
occurring, how the whole scheduling of your Hadoop jobs go, all sorts of issues.
And there are additional tools hidden behind the back that keep everything
consistent and working in the way that you would reckon they ought to work.
So just sort of giving some of those tools.
It's starts off with providing you an environment,
a toolkit, a Zeppelin, which is a notebook.
Now, electronic notebooks are kind of recent, what you do is in that notebook
you keep notes of what experiments or what computations you've done.
You can actually write code into those notebooks.
The system that Hortonworks use is called Zeppelin and it's very,
very new, sort of notebook system, you might have heard of other ones.
I think there's a Jupiter Notebook as well out there but
the Zeppelin is the one one that comes with the package.
It allows you to write all sorts of interfaces or
you can just ignore Zeppelin, you can go straight to the Tools.
The tools, well all sorts of data access tools.
There's batch like mat reduce, there's strip like PIG, there's SQL like HIVE.
There's beyond SQL, H Space and Accumulo and
now coming up a new item there, Phoenix.
There's Storm which is used for streaming, there's Search in the form of SOLR.
In-memory Spark, and then,
there's some other pieces that are more focused on specific platforms,
like, Microsoft platforms, Azure platforms, or whatever.
4:45
All of these use HDFS, which is a distributed file system, and
it resides under all of this, that allows you to increase your data,
to store it safely, to be able to retrieve it.
What we're going to be showing in the next few slides,
is just how all these things fit together, how you can piece and
match these different tools to build your application.
From the top, Zeppelin, it's open,
it's a web based notebook, it allows you to have interactive data analytics.
So you can actually put scripts into Zeppelin and
those scripts will execute and create little data analytics on your,
well little compare with amount of data, on your screen and
it will do that sort of interactively in real-time.
Beneath that,
on another HDP tool that being's packaged with the system is Ambari.
It's used quite a bit, LinkedIn uses it.
And what it does is source management platform for provisioning, managing,
monitoring and securing Hadoop clusters, so
that you can actually see how Hadoop clusters are behaving.
Look at how Zeppelin works, it would takes a lot of different languages,
you can write Zeppelin scripts in Python, Spark or Scala, or a lot of other ones.
What it provides is a web
6:20
display of the results of your computation, and you can see that you can
put visualizations into the notebook, you can put bar graphs.
You can just list tables and but this is all the mechanisms.
If you look above that, those diagrams, what you'll see is some scripting code.
Get's involve to actually do those particular diagrams
with a particular data set using some of the other tools.
Zeppelin, you can have different views of what's going on under the lid.
So, you can take Tez and that would tell you about your cluster resource or usage.
So, if you want to know do you have a network bottleneck or
is the file system a bottleneck or do you have spare cycles on your machine?
7:16
Tez will actually allow you to see what you can do
to accelerate individual jobs and do some optimization.
It's a Hive view, which allows the user to write and execute SQL.
The Hive is interactive and it's built into Zeppelin, so
that Zeppelin can script and interact with you as well.
Shows the history, allows you to see all sorts of
views, source views, JWC, OWC, CLI.
It allows you graphical views the query.
So, it's sort to help you really sort of get on and
produce the products that you want.
Pig is a scripting tool.
Now, we will be talking about MapReduce.
The Pig View actually allows you to combine all sorts of different tools,
like MapReduce, into multiple occurrences of use of that tool.
And you can use it for iteration with MapReduce,
you can use it for saving and loading datasets.
All sorts of nice sort of additional tools, but
the Pig View allows you to see this and to script this, and to manage this.
Under that is a scheduler, they allow you to see the capacity scheduler from YARN,
and we'll talk a little bit about that in a second.
But when you set up your workload management, you enable multi-tenant,
multi-workload processing.
How do you schedule?
Well, you obviously wnat to use the full capacity of the system you've got.
This view prevision's cluster resources allows you to see them,
allows you to create them and allows you to manage them.
That's another useful tool in this notebook.
And last, Files View, that allows you to manage and
browse all of the software that you've placed into HDFS.
It can be very, very useful when you've got lots of files.
So moving from there, If we want to talk about data access's worth in Hortonworks,
mentioning all of the steps that have gone on to producing this distribution.
We will be talking about MapReduce and Pig and all these other good things.
And MapReduce will be based on Hadoop.
But underneath all of this, there's been lots of work by
the Hortonworks people, to simplify the management,
and to provide a very flexible management system, and that's cool.
YARN is, they like to call it a data operating system.
So YARN is, if you like, underlying all of these applications,
and enables these applications to talk to each other.
You might sort of argue, well, YARN is a bit like Office is to Word,
YARN is to sort of MapReduce.
Underneath all of this, the glue that allows you to build very, very interesting
flows of data from one application to another, that would be YARN.
And it would be doing the scheduling and allocation.
10:35
So it takes, or it has APIs for MapReduce Hadoop.
You don't have to change your Hadoop, it's the same API.
But it will run those MapReduce jobs in batch for
structured or unstructured data.
Pig, we've mentioned a bit, but
it's really sort of a scripting language for building data pipelines.
So you can sort of imagine it's almost like a shell script of doing
MapReduce jobs, and doing other sorts of processing on it.
You can describe this with a script, and then run it whenever you want.
So YARN supports Pig, and it supports the dataflows that occur in Pig,
using the other tools.
As an SQL, Hive has this interactive SQL query.
And you can make queries of petabytes of data in Hadoop.
And again, YARN is the underlying fabric that connects these things together,
so you can take your output from five high, from put it somewhere else.
Or so that you can use input to hide from some HDFS.
There's a NoSQL, or beyond SQL, or whatever you want, not just SQL.
Hbace and Accumulo are two non-relational NoSQL databases.
They're written on top of HDFS, so they work with huge files, but they allow
you to almost sort of SQL like queries, even though the data is absolutely huge.
In the case of Accumulo, there's additional column and
cell information that you can store, so that you can actually provide
access controls to some of the data in your database.
Next, we have Storm, Storm is a distributed real-time large volume,
high-velocity data.
It's really sort of pushing the limits.
But this allows you to take, say, lots of mouse clicks and process them, and
feed them into a database.
Or it might be that it takes Olympic results or
results from car racing, or whatever it is, and
stream all this stuff in real time, and record all of that data.
It will take images, it will take all sorts of data formats, and
you can process them.
13:09
Memory processing, but it processes huge amounts of data in either
sort of streaming or sort of just stationary data.
Spark allows you to do those.
It's got a nice, easy language and lots of libraries to do that.
And again, the coordination between Spark and HDFS, and
all the other parts of the system is all done by YARN.
Underlying all of the YARN, the data, that resides in your HDFS,
and we'll be discussing that in a subsequent lecture.
Just reviewing quickly, MapReduce, why would you use MapReduce?
What's the purpose of having that in the?
13:57
Developers can write applications very fast, they can use any language they like,
and what you use is an API to actually do your MapReduce.
That allows it to manage a scalable number of computations from a single stream.
It allows you to do really sort of fast parallel processing.
And MapReduce, well, sort of knocks off days of what used to perhaps,
well, it knocks of weeks of what perhaps used to take months to do.
You can get things done in a few days that are really remarkable.
It allows you recovery, it supports recovery of the system.
So if one of the processors dies, you don't have to worry about it.
The MapReduce system will actually try and recover and continue, without you
even noticing that any of the data's changed, or anything else has happened.
And it also minimizes the way that data has to move around in the system,
so that you get the maximum performance out of all the systems.
You're not just waiting for memory copies or things.
15:07
This really offers a very,
sort of, economic way of performing your computations.
So that's what you get from MapReduce, and it's in the form of Hadoop in Hortonworks.
And, again, If you've learnt Hadoop from Apache,
this Hadoop on Hortonworks is exactly the same.
The interface is all the same.
Pig, well we described a bit about that, but some key factors.
It's extensible, so you can
add custom functions to Pig to make it do all sorts of interesting things.
It's easily programmed, and you could have tasks involved in data transformations.
They can be simplified and coded as dataflow sequences.
This is really cool.
And then it's self-optimizing, there's actually tools we'll go through and
try and optimize.
Your Pig expressions cause they may not be the best way of writing it but
there is self-optimizing code that will actually reorganize it so it is efficient.
Hive, that's the database, that's the interactive database we'd talked about.
First of all, it's using SQL so it's familiar.
It's fast, it will still give you results very quickly.
You can run it from the command line.
Scalable, extensible so you can have 10,000
disks with data on it and still be using Hive.
And it's compatible with all of the other tools and
systems that we've been talking about.
Nosql HBase that's a big semi-relational,
beyond the relational in a sense, it's fault tolerant.
It replicates itself.
16:51
It can provide you atomic updates.
It can provide you strong consistency at the row-level operations.
It's highly available, so it's really cool for sub-interactive services.
It provides automatic sharding and load balancing.
It's fast, gives you near real time look ups,
in-memory caching server side processing.
So, it's rather cool.
And the main thing is it's usable.
And the data model accommodates wide ranges of use cases.
It's reasonably understandable, you don't need a PhD to understand it.
It exports metrics via File or Ganglia plugins.
And then it has an easy Java API to use.
It's coupled with Thrift, REST APIs, so
that you can actually build web interfaces.
Or you can or have data transformations on the information you're providing, To get
it into the right format for putting it in H space, so that again is pretty cool.
18:15
It has integrity and availability, there is a master fail-over in Zookeeper,
which is this consistency software we will talk about later, provides that.
It has write-ahead logs for recovery.
It's scalable, and it really performs very well,
relative encoding to compress similar consecutive keys, so
you find the keys stored near to each other in the big data center.
Its speeds long scans if you've tried to look through lots and lots of data.
And it caches recently scanned data if you want to write
iteration over some data rows or columns.
And then it is really good at sort of managing data
group columns within a single file and so on.
So we'll be looking at some of those things in the future.
Really sort of now edging into the the new features of Hortonworks.
You've got Storm.
This is a streaming system, it's being bench-marked.
It can process a million hundred byte messages per second per node.
It's scalable, has a fault-tolerant, it's reliable.
And above all it's easy to operate so you can actually use Storm
very simply to build the customized applications you want.
Solr is a search system.
STAT uses standard-based interfaces, like XL, XML, JSON and
HTTP, Provides statistics, linearly scalable.
19:57
So that gives you an entree into doing your own
search systems within the world of the data that you've collected.
Last as I say, we've got Spark which is this in-memory programming core.
They actually called it Spark Core.
It allows you to program with Scala, Java, Python, R.
It has the API's to link those to libraries and
those libraries are quite extensive now.
Which makes it a very exciting so you get Mlib, you get Spark SQL,
Spark Streaming, Graphx, ML Pipeline for data science workflow.
We'll be coming back to that Spark system later on.
But for now, this is all maintained and
running on yarn in Hortonworks and you can match and mix between all of them.
So now let's move on.
We've talked about HDP but
if we want to look at what Hortonworks is doing with data flow.
Well they have a lot of offerings.
They have something called NiFi.
A NiFi is a way of managing data logistics and event processing.
They have Kafka which is a messaging system.
And systems can break or go down but
you are not going to lose the messaging in there.
We had Storm, again from Latin, I've talked about Storm before, but
it's integrated into this platform as well.
And it provides your linkage between HDP and HDF, so
you can go backwards and forwards.
You can stream data into your big data stores, or
you can stream data out of your big data stores and analyse it and pipe it out.
Or you can integrate, in any which way, those systems together.
What HDF does, real time data connection,
real time curation analysis, delivery of data to and from any device.
21:52
And it doesn't matter it can be on premise or in the cloud, it all works.
It's all open source.
Moving right on from there there's a picture of NiFi and the way it organizes
things, and that's probably very small, but we've got a lecture on it later on.
If you're interested in NiFi, you're looking at the,
well, I won't say the bleeding edge, but this is really new stuff.
And as I say, we're going to have a lecture by Resr on NiFi.
And I think that'll be really sort of interesting for you.
But what it does,
22:37
You have sort of Black Box flow file processes.
Allows you to do different source of manipulations.
You have buffers and connections that you can make.
And you can decide how much buffering you want in there.
It has flow controls, schedulers for the buffering, and
process groups that you can schedule.
And you can have sub-nets connecting those processes together.
So, it's fairly a flexible system, but as I say it's really new.
Very exciting and lots of things that NiFi can do, and
still sort of bleeding edge, people are working on using those.
So what you've seen today is an introduction to
all of these distributions.
Now, we'll move on to cloud air, which adds, well,
which is a different approach from Hortonworks So,
cloud air when you're ready.
[MUSIC]