Hello. In previous video,
you learned how to write queries on hive tables.
You can use SQL queries to read data from a table,
and save it into new one.
Now, you will extend your knowledge by learning
more ways to read and write data from different sources.
At this point, you can save the geoIP data frame into the Hive by SQL query.
To do this, you should create a view from our data frame,
and execute as a create table a select query on it.
After that, you will have a new temporary view with a new table in the web database.
But this table creation is verified only when this query is executed,
and that will be set if your program fails at the end of the data process.
You also can save data in the Hive by the spark API method.
And in case of a syntax error,
your problem will fail at the very beginning,
and this will save you a lot of time and nerves.
Data frames have a special property write,
to save data into any place.
You can save data into Hive table by saveAsTable as table method.
Let's try this. Unfortunately, the saveAsTable method fails.
Let's check why.
The table geoIP already exists.
You created it by the SQL query earlier.
Let's save our data frame into a new table.
Now this works, but what should you do, if for example,
you have a new results that are more precise than the previous,
and you want to read them.
The result of the method, saveAsTable,
in case data already exists,
depends on what parameter.
The default mode is error,
and spark slow an exception each time the data already exist in the source.
If you want to clean up the previous results,
and rewrite new one,
you can use that rewrite mode,
and that mode will add the data frame as you wrote in the existing data.
Let's check the behavior of saveAsTable in different modes.
Firstly, you overwrite the existing table in the overwrite mode,
and you check the number of rows in the table.
Now, you should append geoIP data to the existing table,
and check the number of rows again.
As you can see, the number of rows in the table has doubled.
The data frame wide property allows a stored data frames not only to Hive,
but the files as well.
For example, we can save our table or data in the file by save common.
Spark creates geoIP out folder in our house directory,
and writes the data there.
I suggest you to look what is on the inside.
The first three rows taken from the files written as some kind of mess,
but you can find elements of data there.
Here's an IP address for example.
It means that data is stored in a binary format.
By default spark works with binary parquet files,
which are designed to high performance we can write in.
If you want to store the data in a more human readable form,
you can save it in CSV format for example.
The data will be organized as a text.
The columns in each row will be separated by commas.
And a popular textual data format is JSON.
You can store our data as JSON strings,
and each row will contain columns, names, and radius.
This overhead help us parsing the data is a table in advance.
If you try to save a data frame in the folders that already exist,
you'll get an exception.
Any ideas what should you do if you want to rewrite the table or append new data temp.
As we save as table,
you can rewrite the output folder by setting mode equals overwrite,
and the append mode will add the data to the existing one.
Also, each data format has its explicit function to save.
Parquet saves into parquet files,
CSV saves into a CSV, JSON saves into JSON.
You can choose which one is more convenient for you.
Spark SQL allows to read data from folders and tables by Spark session read property.
The spark session read table will create
a data frame from the whole table that was stored in a disk.
And now you check its first rows.
Towards a folder with JSON object,
you can use that with JSON method.
The columns read from JSON file will be permuted,
because the columns in JSON don't have any order.
When you read the data from the CSV file by the read CSV common,
it returns the right order of columns,
but without the names of columns.
You should provide a schema of the data frame
as you did it when we creating the data frame manually.
Now the columns are correct,
and you can read data from the parquet file as well.
There are no issues with the names of the columns and order in them,
because each of the saved parquet files,
stored in its own schema.
As you remember, each table in the Hive is saved in HDFS as a separate directory.
So, save data in table and save data in file are pretty similar.
But now, I will show you something totally different,
saving data into any external database.
It's really awesome, because allows you to
import and export data from anywhere in your company.
Each relational database allows to read and write the data into it,
by special JDBC protocol.
JDBC means Java DataBase Connectivity.
You should just download the jar file with
a client library for this database and set a proper connection string.
The connection string contains a type of database,
its web address, the database name,
the username, and the password.
Now you can write your data frame into the external database by JDBC,
and read its data from JDBC as a new data frame.
It is very useful at the end of data processing,
when you have created a report,
or machine learning prediction,
and should make it available as a business system in your company.
Now you can read and write data into the Hive by Spark API method,
can read and write data in the directories,
as well as export and import data frames into external relational tables.