[MUSIC] Welcome back. This lesson will focus on data quality. If you've noticed, we've been using the same data set for many of these lessons. It's the sales superstore data set. The thing about that data set is that it's nice and clean. Everything has been nicely pre-cleaned for you. So you don't have to worry about that stuff. However, in the real world most of the data that you gathered does not come to you already cleaned. The fact of the matter is that the job of doing data analytics includes getting the data in good shape. You need to get to that level of data quality, so that you can do further analysis. I'm going to show you how to do that. After this lesson, you will be able to address data that come to you in bad shape or aren't clean. And apply techniques using Tableau to improve the quality of the data quickly. Data cleaning is a very important skill to have. Quality data cleaning is actually fundamental to your work as a data analyst. Often the data you must analyze will be incomplete, poorly or improperly coded, not in the right format or from a variety of different sources. Your job is somehow to be able to mash the data together and process to make it coherent and correct. In the real world, you're not going to be given clean data, it's just not going to happen. So you have to learn how to be a able to clean it. And there is no way for anyone to explain every possible scenario and I'm not going to pretend it here. But suffice it to say, you should never assumed that you had clean data given to you unless you have verified it yourself. Now if you remember, I went over some techniques on how to address outliers and anomalies in course two. Outliers are really cool, most of the time. Unless you really need to get that cool visualization done, and these outliers are messing it up. But they're not always coding errors, they might in fact, be important areas of study in and of themselves. However in this lesson, I'm taking a more practical approach on how to address data that you're stuck with. I have to deal with data quality issues on a daily basis, and you will too. And this is just sort of one quick trick to be able think through how you can address data quality issues. And just remember, it's not okay to just throw up your hands and say, it's too hard for me or the data are garbage. You're stuck with the data and you have to investigate to see if you could fix it. So for this lesson, we're going to go through an exercise in cleaning the data. And were going to go with a point of view that we don't have a lot of time to do it. And we just want to get something in good enough shape to have a nice visualization and so I love to do geographical data. So I'm going to scrape a somewhat random list from Wikipedia. So if you go to Wikipedia a lot, there are plenty of tables that show a bunch of data which is very interesting. And you can just copy and paste the tables. So we're going to go to wikipedia.org/wiki, that's W-I-K-I/list_of_tallest, T-A-L-L-E-S-T _building. And that should be the one on your screen there. So what I'm going to do is I'm going to scroll down a little bit just the second here. I'm going to scroll down, and I am going to select this table here. Make sure I grab everything. I'm going to right-click and select Copy. As I'm doing here on the screen, you can see that. So I'm using the Chrome web browser, but it's probably the same with Microsoft Edge or Firefox. So now we're going to go to Tableau. So I'm booting up Tableau as you can see here, and this is really cool. I think I've showed you this before, but if you hit File and then you go to Paste. You can just paste the data directly in to the data or in to the Tableau. So let's look at this, wow. Look at this, but it's interesting. Looks like we have some issues here, and this is unfortunate. I want you to notice that the column names have F1, F2, etc. Instead of what we should have which is rank or building or whatever then if we scroll down to 12, scroll down to 12. Look at that. You see, we have a much bigger problem. You notice that Patronas Towers, so it's between 11 and 13 there. Patronas Towers appears to show up in the rank column instead of the building column. That is problematic, because you have to either manually nudge this or do something else. So we're going to do that something else in a moment. So at this point you could either throw up your hands and say that the data are just in very bad shape. And there's nothing you could do, or you could try to find a solution very quickly. As you might guess by this, I actually have a solution which is pretty interesting and is actually a pretty simple fix. And in fact, most of the fixes that you can do are actually quite simple as well. So it's not as if this is a unique fix to this particular data set. It's often the case where there is data when you import it looks like it's completely in bad shape, but a little clean up usually fixes the problem. So this is very illustrative of that. So what we're going to be doing is, we're going to go back to Wikipedia and look at how the data displayed in the table where we pulled from. Let's see here one, two, three, four and five I'll look at that. Look, I've just circled here 11 and this number 11 shows up for both Patronas 1 and Patronas 2. These are two different building. So when I scroll further down, I see that the Central Plaza and Hong Kong and the Federation Tower in Moscow are also the same height so they have the same rank. Then if we go further down even more, I mean there's a tone of them. But we can see here that the JW Marriott and Dubai and also Keangnam Hanoi Landmark Tower and the Pinnacle and China, and a bunch of other buildings are all having issues. Look at this here at 59, I've circled that right there. And what's crazy there is that you have all of those other buildings there, in China and UAE and in the United States. So these are all odd, it has the same issue. So what are we going to do about it? What are we going to do about this? This is a problem, right? I'll tell you what we're going to do about this. We're going to fiddle with this thing up here, and just watch what I'm doing. You see here I have my mouse over the rank, so I'm just going to click on one of these arrows and I wonder what it does. And I just change the sort order by clicking on that. And look at that, it has put the numbers next to each of the buildings so that actually solves the problem. I think we'll have to see, there's actually no guarantee it's going to fix this, so I'll have to see. [LAUGH] I actually have not done it before, I wanted to sort of make it as pure as possible in terms of my description. But we're going to do what we did before, we're going to copy and paste this new version of this table which seemed to have fixed the issue with the rows in Tableau. But I'm going to paste it again, Just like I'm doing here, and then we'll see if it actually fixes the problem. So you can scroll down into data set, where I placed the data, and it put an 11 on each. And it looks as though it fixed it very nicely, so we should all be all set. This is really great, I'm really excited about this. The interesting thing here is that you notice that the row labels are still not really in good shape. And I think that this is something I'm just going to have to fix manually, which is not unusual at all when we're cleaning data. We'll just rename those fields. So let's do this, we'll just double-click on it and just call it Rank. And then we'll call this Building, City, State, or actually it should be Country, I'll just. Feet, Meters, Floors, and finally the Year that it was constructed. So if you notice actually the years constructed and has footnotes which is kind of annoying. And so we're going to do a little trick here to be able to take care of this. And we can remove this very easily and what we're going to do is we are going to click on the down arrow here, as I'm showing you. Go to Custom Split, click on that. Then we are going to use a separator here. We are going to use a open bracket, which is exactly how the footnote is described. You see it on the screen there, you can replicate it here. And then we're going to split off the first one and then look at that. I have just split, I have created two variables there is Year Built which is original. Then there is Year Built-Split1 which takes the values from the year, and splits it at that open bracket. Fix the issue with the year. See how easy it was actually? I'm going to do that with Meters as well, and you can just follow along here, and I again fixed the issue with the Meters. So now here's the thing, and I should address in right now. I want to build a map, and the issue here is that we want to deal with geography. If you can look back at the Data Source here as you can see here, you can see that the city and country are labeled as texted. I want to change those, so I'm going to click on the ABC. And then I'm going to assign it a geographic role which is a City, just like that. Just click on that and viola it is a Geographic Role. I'm going to do that same thing for Country. Change that to a Country or Region. And so now we can create a new worksheet. And what we're going to do is we're going to drag the longitude over to the columns and the latitude over to the row. So we're going to need a filled map, it's no good having a blank map. So let's drag Country over and we're going to drop it over the Color. No we're going to drop it over Size I think, we'll do Color for now I'm giving it away. And then I'm going to click on actually to Show Me and then the field map, which allows me to see all of these countries in some detail. So you can see here, we have Vietnam, and we have Saudi Arabia, China. We have China which has the most, Australia again. And here's little UAE there, did you see that? And then I'm going to zoom actually, and really hover over UAE. Because it actually has a lot of tall buildings there, mainly in Dubai. But it doesn't actually tell me that in the map now, does it? It just tells me that there is a really tall building in the country of UAE. So I think I'm going to do cities instead because it doesnt really tell me anything here. So what I'm going to do, I'm going to add City to our map, and just follow along here. City will get dragged over to the Size, and here I got this note which says, Filled maps are not available in the geographic city role. Which makes sense because of course, cities are one small part of a country. So that's cool, but it does show you exactly where it is, right? So there all of these cities, and you can see some of these cities that are well known, that are around there. But look over here on the right-hand side. Think we have another issue coming up. But it looks like all these other Chinese cities are there I think, right? And Dubai and all of these places, but I have a feeling we have an issue. Because we do see those five unknowns and we're going to get to those in just a second here. So what are we going to do with that five unknowns? This is a big deal, right? Because we're supposed to be able to show everything in this. And what you're going to do here is we click on it, and that's all we do we just clicked on that five unknowns. We clicked on Edit Locations, and then look at this. Here it is, we have five ambiguous or unrecognized cities. Now it's kind of funny because it doesn't recognize Hong Kong, which almost everyone in the world knows. The other cities, they're probably in China. And so Tableau's having trouble recognizing them. So what we're going to do is we're going to actually work on these. So we're going to click on each of these and see what we could do with it. So let's go ahead and do that right now. So what I'm going to do here is just enter latitude and longitude. So this is actually an exercise about and lesson about data quality not about how to fix mapping. So I'm not going to go through this in detail except to say that you could see what I'm doing on the screen, and it says Enter Latitude and Longitude. And if you have that, you have to look it up. So what you do is you go to Wikipedia or Google, and then say for little town of Wuxi, W-U-X-I. You Google it, there Wuxi latitude and longitude, I can't spell I guess. And you can then find the coordinates and fill in the information, so you do that for each of them. You find, look at this, you find the coordinates that you can see right here on my mouse and you just type those in to Tableau. And then it will determine that's where the city is located. We'll fix each of this particular cities and where they're located, and they will get input into this map as well. The other thing that I want to do before I go is I want to maybe put the building names there. And you saw a little bit of what I was doing, which was dragging the building to the label. So I drag the building to the label so that here we have a pretty decent way to show our map. We had to do a little bit of cleaning, but at least we got it into decent shape. Now there are many more interesting things you could do with this map. This is not a beautiful visualization. It does not conform to all of the rules that we've been talking about for good visualizations. But in terms of insuring that we can clean the data, this is a good example of how we can clean messy data, very quickly. And come up with something that we wouldn't have otherwise. So this concludes my demonstration of using Tableau to clean up data. After this lesson, you should be able to address data that comes to you in bad shape or is not clean. And apply techniques using Tableau the improved quality of the data quickly. We'll see you next time.