A ride let me quickly show you how to use glue crawlers to have the crawler going into your data set and inferring the schema. We're going to start with the registry of open data on a WS which you can find on registry dot Opendata dot AWS. This page is part of what we call the AWS datasets. And Morgan will talk about that in a dedicated video on the Week 4. For now. The most interesting, relevant part for us is the New York City Taxi and Limousine Commission TLC trip record data. If you like hearing details, you can see that a WS maintains these data sets for customers who wants to do analytics with data collected by taxis and limousines in the New York City. This is all stored in a bucket called NYC TLC and that's why I am going to my terminal and I am going to list the contents of this bucket. It's a public bucket, so it WSS three LS as three endpoint. NYC TLC. And you will see that this public bucket has miss as miscellaneous and trip data. So let's go on a WSS three LS Mercs. Trip data. You will see that it is a bucket containing lots of CSV files as you can see as you can tell right and you can tell that those CSV files are separated by ear an month but they're all in the same root directory of the bucket. So if we run a crawler against this bucket, it would scan every single file in it. So to avoid that and to make the demo faster, I will create a crawler and I want the specific crawler to crawl only one specific file, which is this exact one 2018 slash 06 dot CSV. What I'm going to do is create a bucket in my account. Call TLC RAF demo. So I made this bucket in my account and now I am going to copy only one of those files into my bucket and I can do that with a double USS three CP. New York City TLC. Trip. Trip data. Yellow trip data. 2018 Zero six dot CSV, which is this file here. Yellow trip data. 2018 Zero six dot CSV. Now I want to copy this file from this bucket to my bucket. TLC ref demo. Now notice. Something very interesting when you do a WSS three CP. Using one S3 bucket as source. And another S3 bucket as destination. It goes from bucket to bucket. So that transfer has this speed because it doesn't pass through my computer. It is going from 1A WS public bucket. To a bucket that belonged to my account, which is the bucket that I just created. OK. So transfer is completed. Now I am going to do a WSS three LS to my bucket. And the file is there. You can also see the file on the S3 management console, so if you go to S3. There is a bucket. Call TLC craft demo. And there is only one file inside that bucket, which is the file that I just copied. Now I want you to notice the size of this file 700 seven, 100 megabytes 737 seven, 33 right 733 megabytes. And let's also sneak and peek the structure of this file. I'm going to be doing. I downloaded the file to my computer as well to the folders large TMP slash, TLC and I'm going to head this file. Head is a Linux command that only displays the top lines of a specific file. This is how the file is structured and these are the first lines of the file. It's a CSV file so you can see that we have values separated by comma. All right? An here is the. Names. Of each one of these. Data. Now. Remember when I said that a structured data is not about the file formats, but about the file content? Let's say that in one of these timestamps here, you have a column by mistake, right? That would make this file an unstructured file, unstructured file because you are having something that could confuse any parser that would load this file thinking that it is a delimiter. And the delimiter is the comma, right. So file format is file format, data structure is data structure. One thing is not necessarily related to each other. And there are some file types that are friendly to structured formats. So this file is a structure because it had been already cleaned by AWS data sets. And here we have Vendor ID. So this is Vendor ID 1, t pap_pickup, I need to see the documentation for this data set to see what t pap means. So we have passenger_count, trip_distance. So one, two, three, four, five, the fifth field is the passenger_count. So one, two, three, four, five, I don't know, maybe one here, one passenger, right? So if you want to see the number of lines in this file, we can run with the wc- l command to count the number of lines. And we see that this file has eight million lines, right? 8.7 million lines. So it's a long file. It's not compressed, so it occupies 734 megabytes, right? Now let's go back to Glue and let's create a crawler that will get this file, infer that specific schema, and create a Glue table for us. Let's go to Services > AWS Glue. And then here on the sidebar, I would like to locate the crawlers. And clicking the button Add crawler. Here is the crawler name. I'm going to call this crawler name TLC-Crawler, and then I should choose a data store for my crawler. And the data store is going to be s3-tlc-ref-demo, which is the bucket that I just created. If I wanted to use that original bucket, I would need to put slash trip data slash and the specific file. But if you just specify the bucket, the crawler will crawl. The crawler will locate all the files and infer the schema for them. So this is my path, Next. I can run the same crawler, crawling multiple data stores, which is not the case. And here I can specify the IAM role which the glue crawler will assume to have get objects access to that S3 bucket. I have the option of creating a new role here or using a role that already exists in my account. So create an IAM role. Let me call this, CrawlerRole, Next. Here is the frequency I want to run that specific crawler. For this case, I'm going to run on demand, which means that I will manually invoke that crawler. And here I can add a database. This database is a database that will be added to the Glue H catalog. I'm going to call that database TLC. And here is, if I want to add any prefix to the tables that will be created to my Glue H catalog as the crawler output. I will leave this as empty. I will hit Next. Here I have an option of reviewing how the crawler will be. It will use this IAM role. It will run on demand. The output of the crawler will creating a database called TLC in my Glue H catalog and finish. Now the crawler is being created, and the crawler needs to be run manually. So I need to click here and clicking Run crawler. Now it is attempting to run the crawler and the crawler is running. What this crawler will do is going to my S3 bucket, the S3 bucket that I specified, it will do a get object. It will get that 700 megabytes file. It will get the first line. It will see vendor ID, t pap_pickup and day time and all the files structure that I have inside that CSV file. And it will create a table in the Glue H catalog using CSV as a surd. It will see that the content separator is comma and it will create a table with those columns. So when we go back to the crawler. You can see that the crawler added one table in our glue data catalog. If we click in the crawler we can see the preferences of the crawler. And if we go here on the glue catalog databases, you will see that there is one database created and if you click hearing tables in that database you can see what are the tables that are in this database. And here is the most interesting thing here is the output of what the crawler did .The crawler scanned the file and inferred the schema for me with the specific datatypes corresponding to each one of the specific fields inside the CSV file that. CSV file, if I open the CSV file with VI in my computer, you can see it is a long file as we saw with WC it has 7 million lines, right ?So it is endless .It's basically have 7 million things here. What the crawler did was opening the first couple of thousands of lines to infer that specific schema. OK, so here is the table, and here are the column names and the data types of this table. Here is the serde that was used which is the CSV serde very simple serde but the main benefit of this crawler is, if you have a custom file, if you have an access log, if you have a file that is not CSV, if you have a JSON file, if you have a columnar data type file. OK, so here's the table and this table is a table I can query with Amazon Athena. You can also edit the schema by removing some fields if you want if you wants to remove that from the Athena table. The file on S3 remains untouched but the table in the glue metal store will not have that specific column or you can also add some columns. The crawler is also responsible for scanning the files from time to time and altering the table schema if the file schema changes. When you configure the crawler, you can ask the crawler to let you know what to do .If it wants to notify you, or if you want the crawler to change the table schema for you automatically. And that's it .This is how you create a group glue crawler and points the crawler to your data set and have the crawler inferring the schema for you.