Now one of the most critical parts of data analysis happens way before you build your first visualization, or your machine learning models. And while not necessarily the most glamorous, data preparation and transformation can't be ignored. In this module we're going to cover the elements of what makes a data set good, and then look at two different ways to process, clean, and prepare your data. One, as you might expect, is through SQL and BigQuery, and the other is through a graphical user interface in Cloud Dataprep. So the majority of data quality lectures will begin with a statement like this, garbage in, garbage out. Know what that means in our specific examples are, if you want to build amazing machine learning models, you can't feed them with garbage. You gotta feed him with really good data to begin with. Again if that is your end goal to build awesome ML models after you've mastered this course, and the data engineering course after this then this is your real first step there. You have to get extremely good clean data in order for something like an ML model, or even just insights from SQL to begin to even process things at that caliber, because ML specifically relies on consistent patterns of data. And if you start feeding it garbage data it's going to potentially over fit your models and do a whole bunch of other bad things that you don't want it to do. So in order to prevent that let's talk about some of the things that we can do. So high quality data sets follow these strict rules, there are five of them. They're valid, they conform to those business rules, they're accurate, again some true objective value. They're complete, meaning that you can see the whole picture and you're not just getting a subset of the data, not being fooled. They're consistent, meaning that you can drive actual insights that you can rely off of, and they're uniform, and there's a fun example that we'll go through for each of these. First modes, validity. First stop, validity. Data follows constraints on uniqueness. So what do three of these images have in common? They're all unique identifiers,right. So the telephone represents a unique number where I can reach you, you can reach me. If somebody else had my same number, I'd be worried, I don't even know how that would work, you would call me and two people would answer [LAUGH]. It just wouldn't work, right. And of the same principle that that sounds ludicrous, having multiple physical addresses be the same, or people sharing the same license plate presents a massive problem, right. And in the same way for your data, when we talk later on in the courses about joining data together, or like we did earlier about having the EIN, or the charitable organization that had multiple filings within the same year. That'll present complications, like we talked about, like when you're summing things, and making assumptions about all the different individual records. If they're not unique, you could run into some serious complications, especially when we talk about joins later, and unintentionally cross joining your data between the cover. Second up, valid data corresponds to constraints of ranges. So here we go, which values are out of range? Let's see if you got some dice, and you've got the rolls here, the occurences. If you're staring at number seven, that could potentially be right, yeah. So say you're rolling one die, and is it possible to get a seven? No, it's not possible to get a seven that's out of balance. Alternatively, say since I did mention dice, you're rolling two dice. What would be the out of range value here? That would be roll number five with a value of one, if you roll two dice, you can't have one as one of those values. So the meta point here is that if you're able to know ahead of time if your data smells funny. Or if after exploring your data, and the exploratory analysis that you perform, you find that hey this should always be X, or this should always be Y. And it falls into one of these ranges you can set those up much like they do in software development as a test. And you can do that within SQL, you can do that within one of the other tools I'm going to show you called Cloud Dataprep. Any of these values that are making your data set weird, or cause you pause, you can have automatic notifiers that basically say, hey this dirty data right here, I'm going to go ahead and clean it up, and make it transform. So in this particular case you could say, well if it's one I want to exclude that complete value, I don't even want to include that role at all. Or maybe I'll completely invalidate the results, because there's something obviously wrong with [LAUGH] the dice here. So there's many things that you can do, but setting up that validation in an automated and repeatable way is one of the end results here. Moving on, data has to match a source of truth right, that's what makes it accurate. Say if you're in the United States and you're presented with this list Washington, Oregon, California, Hot Dog, Florida, and Maine. And you're like wait a minute I don't remember a US state called Hot Dog, it doesn't match what we know as a source of truth. So it's a string it matches in terms of data value, but against unknown lookup source it doesn't necessarily match. And again there's debates over, as you're goona see in your Cloud Dataprep lab, how many US States are there? Are there 50 as it's commonly known, or are you also including territories, US territories, or post offices that exist potentially in other countries? You'll see that come up again soon. What about this? So we're going to talk about completeness. Thanks in large part to the ubiquity of probably the image on the right, you'll be able to discern that it is actually Big Ben. But here you're comparing a subset, or sample of your data in trying to make insights or assumptions about the entirety of your data,right. So from here going back to the last slide you could say, my data set is a bunch of just clocks, and lamps, and they look funny, and they're low resolution. But when in reality, the greater picture of your data set here, because it's pixels are just numbers right. Is much, much more complex, and it's actually no it's this location, this is Big Ben, this is in London, right. So it's funny how we can immediately notice something that's wrong with a picture right, a picture's worth 1,000 words. But within a data set, if someone hands you a CSV file, or a bunch of JSON then we generally just, unless we're practicing to, we don't ask the question of is this all the data? Is this a subset, am I missing fields, or rows, or columns, is this complete, or do I need to do additional work to make sure that I'm collecting everything that I need to collect. Especially true when you're matching up data from a variety of different silos right. So don't be afraid to ask those questions, is my data complete, do I have everything, are all the rows accounted for? Last but not least ,we have consistency, and this is the dilemma here, where you have potentially two owners for a house. You have two different tables, the House Address, 123 ABC St, and the Owner ID of 12, and you have an owner's table, where it has Owner ID of 15, but their address is 123 ABC St.. In database 101 we call this a referential integrity issue, so who actually owns that house right, whose name is on the deed. And having consistent data, and we call this one fact in one place, gives you that harmony across many different systems, especially when it comes to joining this data later, you don't have any issues. I lied it wasn't the last one. Uniformity was the last one, because this is the example that I really like to share. So, there was a Mars climate orbiter, and just before the year 2000, right, in 1999, NASA unfortunately lost $125 million climate orbiter as it burned up into the surface of Mars. Because of an issue with one team using the English system of feet, and yards, and another team that was working on a different part of the system using the metric system. And while it seemed like a potential innocuous problem to begin with, like everyone's just English verses metric, who's right. At the end of the day, when it amounts to building such a complex system like this, you need to make sure that everyone is using a uniform system of measurement.