Okay, so two methods for dealing with dirty data. First, clean it through SQL. Second, as I've eluded to quite a lot, there's a UI tool called Cloud Dataprep. That is going to be the basis of two labs for this module. So, you can clean it with SQL, so if you're data needs to conform to business rules, you can do a lot of things directly within SQL and BigQuery. You can say his field is allowed to be null, meaning that it's allowed to have a empty value, or it's a required field. If you want them to put their first name and last name, you're going to have to put both of those fields, and you can proactively check whether or not those values are null after the fact, and you can check things like IF then, or in SQL it could be a CASE WHEN, or it's IF null do this operation. There's a lot of conditional logic that you can apply to test for data validation, and you could also, in upstream systems, since BigQuery is an analytics warehouse, require those primary keys, or require those constraints and those upstream systems, before it even gets into BigQuery. So if you're the BigQuery data analysis admitting guru of your data analysis group, you can put these constraints on these source systems. [LAUGH] If someone's giving you bad transactional data, don't clean up their mess, tell them to take this class [LAUGH] and to clean up their data, and validate it before it even makes it into your systems. Okay, so confirming your data is accurate. So maybe it matches the data types, it follows everything about whether or not it's allowed to be null or required, but you need to make sure if it actually makes sense. So say the item price for example. Item price times quantity orange always equal the sub total. You can set up a test or a check for that inside of SQL or potentially you can look up against an objective reference dataset. So if you say, all right, well you can have these 50 states in the United States, you could have a look up table that basically says is this state that's in this column, within using that in operator there, this list that I have. So using a subquery or a join, which we're going to cover a lot more in the second part of this course, is one of the ways you can actually nest that statement directly within the inorperator there, and we'll cover that. Okay, completeness, so completeness is a hard one. So completeness is exploring the data to see if anything's not there. So checking for the skew, as we mentioned before, so looking for those null values, and handling those null values, and if it's null, do this or if this value isn't present, roll back on this value. If that value's not present, roll back on this value, that's the COALESCE function there. All right, so you can enrich your dataset with other datasets, mash them up together to form that complete picture. So you have IRS tax filings for 2014 and 2015, and you wanted to bring both those together in one ultimate consolidated table, you can do that through UNION, or say you had 2014, 2015 data already in UNION together, and then you wanted to join in organizational details like the name of the charity or the address, you can enrich your dataset by doing a JOIN. We'll more of that in the Could Dataprep lab. Cleaning it, so it's gotta be consistent. So as we mentioned, one fact in one place, and depending upon what kind of dirty data you're dealing with, there's a whole host of functions as you've seen before. Parsing out dates, cleaning up a lot of dates, [LAUGH] is what data analysts spend a lot of their time doing, myself included. Sub stream parsing, doing that left, right, mid that's all handled through the sub string in standard SQL, and then replacing values as well. So again, here the earlier you do this before you start, and if you're really complex analysis, the cleaner you make your dataset, the better you're going to be, and it doesn't mean you have to replace your raw data table. So all these steps are going to be parts of a recipe that continue on after or at downstream. So this logic is repeatable when new dirty data comes into your system. Last but not least, Uniform. So you want to make sure that all your data types are similar and that your comparing apples to apples, especially when it comes to visualizing your results. So this is where you're going to be in Data Studio correctly labeling what are the units for your charts and graphs that you're creating as well. And throughout the rest of your code, particularly for that Mars Rover burning up example, if you're using the English system, or Metric system, document [LAUGH] that heavily using comments in your code. Okay, here's an interesting example. So, we've got some weather data that we're pulling from NOAA, and we're basically saying, give us all the weather data where the U.S. states is not NULL, meaning that it is present, and limit the top 10 results. So this is me just throwing another kind of trick question at you guys. So the state has to be present. So the question is why does the below query still show these blank state values when it's clearly filtered on IS NOT NULL, right? [LAUGH] And the correct answer is, it's because it's not null. It's not null because it's blank. It's a valid blank value. You can't see it normally. If you could highlight it over, it would be over like an empty string, or if someone hit the space bar once, because a valid null value in BigQuery looks at exactly what that latitude and longitude looks like. It will say, N-U-L-L there, and if someone in their [LAUGH] wrong mind wanted to actually store a string value as N-U-L-L., you have my permission to yell at them, because that would just drive a data analyst absolutely nuts. But you would still probably be able to determine whether or not it's a valid null, a string stored as N-U-L-L, or a blank [LAUGH] with enough SQL analysis. But if you see that one, just let me know.