Welcome to another BigQuery demo. Here we're going to look at some advanced BigQuery functions. Specifically, some geographic information system functions, or GIS functions. Whatever you think of latitude and longitude data, there's some really neat built in GIS functions. And some mapping capabilities built into BigQuery. And some really cool plugins as you're going to see. So first we need a really cool data set. I'm going to be using the BigQuery public data set on London bike sharing. This is millions of bike rides happening around the city of London. Where they're coming from. Where they're going to. Because the the city of London has bike commuting stations that are fixed that people can just rent bikes from. And then take them between places. So first off, what are the tables that we're going to be looking at? You've got your cycling stations, which is just a fixed locations around London. That has just basic information about where you can rent the bikes. And really important part here as we're going to cover later is it has a latitude and longitude. Which means we could put it on a map, which is awesome. And then we need some activity. Who's renting the bikes? How long are they spending on the bikes? How fast they're going? We'll actually show you how to impute that in a pretty easy straight line fashion. That's cycle hire. So our end goal is we're going to create a map for this first problem is to find the fastest bike commuters [LAUGH] inside of London. Those people who had the fastest average kilometers per hour. Because we're in the UK we're going to use kilometers per hour. Going from a starting station to an ending station, fastest average speed for rides that must be more than 30 minutes. So people who are you really going to work out in. So I'm going to copy this first seemingly long query and I'll break down every part of it for you before we run it. So it all makes sense. So paste it in there into BigQuery and we have a couple different tables. And again, I like to explore the tables first. So we have the stations themselves. So I'm going to pull the stations up. And how many stations do we have? And the details, we have 778 stations. Assuming that each row represents an individual station, very small amount of data. Here's what the data actually looks like. Stations got an ID, really important for GIS, it's got a latitude and longitude. These data types, as you can see for latitude and longitude, are floats. They're not GIS data types yet. Just because you have latitude and longitude does not mean it's a geographic point yet. I'm going to show you the very easy function called ST_GEOGPOINT, geography point. It's going to turn those floats into a geographic point that you can then put on a map. So you've got lat and long, which is great, and that's pretty much all we need from here. because then we can get the name of the station, which is useful for a map label as well. So, how do I get this raw data and get it into a good GIS format? That's the stations, the activity is in the cycle hires. Rentals table, and you get as you might expect for a transactional table, how long the bike was in duration for a primary key rental ID. Into account on those get how many renters per station. A same bike can be rented more than once which is interesting for bike maintenance. And then where it was started and where it ended. But honestly, you don't need these from the bikes table because you already have the station ID. And a lot more information, because the latitude longitude for that station is not in here. So we can kind of ignore the ending station. But we do have the when it was rented and when the rental ended with the end date there. So let's take a look at this query. I call this staging, because I like to get my data in a good format, it is the pre-processing before we actually go. So what do we actually do? With the with clause, the with clause says, hey get this named sub-query. Everything below here between these two parentheses lines 1 to line 34, is a named sub-query. Technically you could stuff all of this into the from clause of the below table, but it won't be really readable. What are we actually doing? So I'm basically saying, select as a struct, a struct is kind of like a pre-joined table. It's just a simple container, I do it for readability. To basically say, all of these fields, I want you to prepend or prefix it with the word starting. So it'll be starting dot name or starting dot point. Because you're going to have a lot of the similar sounding column names. Because you have a starting station and an ending station. I prefer to use a struct just for that reason. So we've got a lot of information about where it came from, the station name. We've used the ST_GEOGPOINT to turn the longitude and the latitude into an actual geographic point. Which BigQuery natively supports. Some other information about the station. Like how many docks, bike docks, does it have? When the station was installed? Lots of other good information in there as well. So I'm actually going to throw a just a quick limit on this query. You can see that, I mean if you look at the cycle hire table, it's a very, very large table. This is 24 million records. And let's throw just a limit on there to see what some of this data actually looks like. So running this, and again, we're doing the joins, doing two joins actually. because we can do kind of like a self join to basically get at the both the starting and ending station. Here's what it actually is going to look like in the end. And again this is, you could store this table if you want. I'm going to talk a little bit more about why you may or may not want to do that. With this there, starting dot name for the station, and then that's where the bike started from. That's how many bikes are at that dock. That's when the station was installed. Where did it go? It went from New Springs Garden Walk to Waterloo Place St James, point to point. And everything that starts with bike dot whatever, that's the bike struct, has this information. Here's the line. So I'll show you what, if you actually scroll down in the query. You can get some really interesting data which geographic information system functions, GIS functions. You can basically say give me two points and I'll give you the straight line distance between them, ST_DISTANCE. Also, give me two points and I'll draw a line string for you if you're going to visualize this later on a map. So the distance is going to output in meters, ST_DISTANCE. And ST_MAKELINE basically says, hey if you're a GIS visualizer, here's a line string. So you can see that the bike distance is 2184 meters between these two stations I just mentioned. And this the point-to-point to actually draw that line as you're going to see. We already actually dumped this out into a GIS visualizer a little bit later. So we have, again we're looking for average speed. So we've got the bike duration, and I think it's seconds 1980 seconds. And we have the distance that it covered so we can get the average speed from there on out. And that's exactly what the latter half of this query actually does. So that we have all that information to stage that data. I dumped it in a with clause, you can see what it's doing. Yes, you could dump it into a into a table, but you'll lose the benefit if you're filtering on where clauses. The where clause here, as I noted in later on down inside of the query here, allows you to do what's called automatic predicate push down. It's a really cool, almost magical part of BigQuery where yes, you can store all this data inside of a permanent materialized table. But if your users are constantly just filtering for a very small sliver of that later on, like we're filtering for the duration is greater than 30 minutes. Then what BigQuery can do is, it can take this where clause filter. And before it actually executes that, as part of the query execution plan bring that up into the where clause. You're not goinna see this on the UI or anything like that. And then filter that data before it gets processed. So again, the argument against materializing this out to a permanent table is your users might continuously be filtering on this. So allow them to benefit a predicate push down, but honestly, it really just depends on your use case. So once we have all that raw data, the actual query that does the average speed is pretty easy. We want to return some basic dimensional information, the starting station, ending station. The distance in kilometers, since it's going from meters to kilometers rounded to just two decimal places. This is the line the trip has made, just unioning together, mashing together all those lines. That's essentially how to do a distinct on the trip line here for our use case. And the total trips that were made between those two stations, such as the count there. And then this is literally just a very simple average of the bike distance over the duration. So the distance, meters over seconds, that's going to give you your speed. We actually want that in kilometers per hour, so you're taking meters divided by a thousand. And then seconds divided by 60 to get into minutes divided by 60 to get into hours. And that's how you get the kilometers per hour. Where bikes have at least 30 minute durations and the stations themselves have at least a hundred trips between them. And ordering by the fastest average pace first, limit 100. So let's go ahead and run that. I should have been giving this explanation while this query is running. [LAUGH] because think it's churning through all this data and creating all this geographic information systems point. Is probably going to take at least, I think when I ran this last it was 30 to 60 seconds. So while that's running, that is going to to be the average speed. And boom, here are the results. We don't have to wait for hours to finish. It actually looks like between Finlay Street in Fulham and King Edward Street and St Paul's. The distance is nine kilometers and folks on a bike share. And again, this is the average this isn't even the max, the average people are putting out. It's about 10 miles an hour or 16.6 kilometers per hour, it's 103 trips in this data set. So once you have the distance, the trip line, these geographic points, isn't it better to visualize this type of information point-to-point using a map? So you can use this kind of open tool, BigQuery Geo Viz to just plot your BigQuery map data. So let's see, this actually finished in 22 seconds. And we actually, I'm going to save these results somewhere. So let's say I'm going to just dump them to a BigQuery table. Let's see, do I have a cool one? Who knows, business? [LAUGH] And then I'll call this just gis_demo or something like that. And boom, dump them do a table. It's on the EU server. I've created a dataset in the EU. So let me create a dataset in the EU, because that's where this data wants to live. And let's see, create a data set. First of all, let me disable turning off cache. So hopefully that'll, why don't we run that whole query again. We'll do a gis_demo, will be the data set ID, perfect. And then we'll say that is in the European Union because that's where this London data exists. And for the BigQuery pull to decide, and then let's hope, did it save it in cache, no. because I've disabled the cache. I'll wait 22 seconds for that to actually come into play and then we'll save it as a table. Once you have that as a table, one of the cool things you can do. Is inside of this BigQuery Geo Viz you just literally say hey, here's my project. So let's get our project ready while we have this. I know it's already done, it's pretty fast. And let's save the results, BigQuery table. I have GIS visualization, we'll just call this demo for speed and then we'll dump it into there. And in my project, that's fine, gis_demo, here it is. Here's the demo table. I'm just going to query this, just so I have that query, give me everything from that table. So for this given project, how BigQuery GIS works. This is just kind of an app spot project publicly available. You paste in your project ID and then much like inside of BigQuery. You just dump in a query into the query editor that you see here. I'm going to bring this a little bit more on screen. I got to make sure that I'm opening this in the same incognito window. So we actually open this here. Brr, brr, brr, brr and I'm going to authorize it. Yes, as my Qwiklabs account that I already have here. Allow it to view and manage the data. Now I can paste it as a student, because previously it was trying to do it from my Google email address. And boom, dump that into there and now let's get some GIS data in here, shall we? So I don't know, what is this magical wizard thing looks? I'm going to run this, processing location auto-select, sure, go nuts. Whoa, look at that, that's awesome. It automatically recognized, hey, you got some lines inside of here. It immediately zoomed this in on side of London, which is super cool. So we've got this. Now let's look at the data geometry column trip_line. Excellent, I want to style it though. I want the larger lines to represent a faster average speed trips. So I think inside of my demo, what do we have? What do we have? Yeah, down here. It's the stroke weight, which is the weight of the line. We want it to be a linear function with average kilometers per hour. So let me just try to mimic that. So we want the stroke weight, which is the width of the line, to be data-driven, sure. And it's going to be a linear function, and it's going to be the average kilometers per hour. So we want the thicker lines there. And then we want this to be the, I honestly just messed around with this. To basically say these are all the different ranges that you can have. Let me just zoom out on this. I ended up, I played with this a little bit and found that if you did, what did I do? The domain itself is, this is just because you want the bigger lines to show up a little bit better. But honestly, a lot of this is half of an art and a science. So I'm just copying from what I did earlier. And you can see as we zoom in to the map. You find that largest line, that is the highest average speed. So I'm going to click on that. And this is exactly what we saw. So it started in Finlay Street, went to King Edward Street. The distance is nine kilometers. There's 103 trips that made that, and the average speed of all of them was 16.6, so we went full circle. And again if you reduce this, if you increase this, this just changes the weights of those lines as well. How big, how much of a disparity, deviation, or rather between the the sizing of those as well. As you can see, not only was that one pretty fast but there was two other ones that were pretty fast in commuting there as well. And you can make all sorts of assumptions and interpretations about why this one was the fastest route. Maybe people were just, hey we don't want to go around town and explore a lot. because you can imagine, not all these bikes are just going from a station to station. Maybe the ones in the inner city of London, or tourists that take a long time. Maybe they're going super fast, but they take a long long long long long actual distance. Because the what the distance that we're measuring here is just literally the straight line distance between the stations. So maybe one assumption is, if you're riding from Fulham to Reverence wherever it's ending up that most likely you're just going point-to-point. You're not using the bike as a touring route, as well. Again, just all assumptions until you actually go and take a look at the data. Maybe interviews for the folks that are starting from station to station there. And see that maybe the majority of them are actually just bike commuters instead of tourists. So that's a basic recap. You just went from a raw data set, convert it with GIS functions and then ultimately visualize it in BigQuery Geo Viz, nice work. So in our demo you saw me use ST_DISTANCE, but that's just one of the many GIS functions supported by BigQuery. Here's an example, where we take New York City bike station locations against zip code boundaries. To see how many stations are inside of that zip code polygon with ST_DWithin. A major takeaway is that all lat and long values should be converted with a ST_GeogPoint as a WKT or a well-known text. Which is much more efficient to store and query from. In the demo, we covered St_MakeLine. We can also make polygon areas as well. As you see here with these three points in the triangle. Here's an example which queries which points are within 150 kilometers of another given point. And then draws those lines to match against that intersection. Again, these are not driving time estimates, but simple straight lines. You can get pretty advanced with GIS. Like seeing if locations intersect with ST_Intersects and all the other functions that are listed here. If this interests you, I'd really encourage you to read up on the GIS documentation and other examples from BigQuery. And see how you can make some pretty awesome GIS insights and maps with your data.