So our first lesson is about working with different types of data. The reason we need to talk about different data types is because in regression and many other predictive models only numerical variables are accepted. And they can be used as independent or dependent variables. But you immediately realize that in the real world our data sets include many other data types, so what do we do? Well, in many cases, we have a strategy of converting these non-numerical variables into numerical formats, and it's meaningful for your model. But first, let's look at what are the different types. I'm going to organize the types by their format and by their content. I know it's a little bit nuanced, but it is a complex subject matter. So by format we can recognize some data fields are numbers. They're integer or decimal values, such as GPA, sales, or even cell phone number, so they're all digits. And then other fields, they are string or text values, such as gender in the form of female or male, company name or text messages. And other fields are clearly date and time variables. You may also classify your data type by the content they contain. First, I consider there's a numerical type of data. So these are continuous integer or decimal values that represent a count or measurement. Not all numbers represent a count or measurement. Exceptions include the ZIP code is a number, but it's not a count or a measurement. Similarly, Social Security number or customer IDs. And then there's categorical type of values. So these are discrete values that falls into a predefined number of categories. Within the categorical variables, there could be different type. One type is called ordinal, so ordinal means they are naturally ordered. So, for example, you have letter grades from A to F, they are ordered. Skill level, so maybe in low, medium, high, they're also ordered. And then there are nominal type, so this is the kind of offset of ordinal is they have no natural order, such as race, hair color. And then there's also another special categories, which is called binary. So this binary variable have two values, they could be Yes/No, True/False, 0/1. Now you notice that the categorical variables can be presented as a strings or numbers. But the key is that there's three nature of these variables. Lastly, there's text variables, so text variable are often numerical data. And for example, they could be addresses, names, text messages and email body. Also now you realize that not all string variables are text in nature. Some of the string variables can be categorical variables, they just represent different categories. So that's where the some the complexity come from. If you're confused already, do not worry, and we are going to talk about different types in more detail and our strategy of dealing with them. So let's first talk about categorical variables. So categorical variables are discrete values. I want to emphasize that categorical variables can be expressed as strings or numbers. Let me give you some example. First, hair color may enter as string variables, such as black, brown, blue, red, white and so on. Sometimes they are indexed or encoded, then maybe 1 represent black, 2 represent brown, 3 represent blond and so on. So the same information can be either entered as a text or numbers. So, hair color example, what type of categorical variables? Well, that's nominal, Because there is no natural order in hair color. And then letter grades, letter grades may be entered as A, B, C, D, E. Or 4, 3, 2, 1, with 4 represent A, 3 represent B and so on. Again there are two formats for letter grades. And letter grades, because they're naturally ordered, so this can be viewed as ordinal variables. And then consumer satisfaction, so they have customer satisfaction surveys. And the answers include very dissatisfied to very satisfied. They may take on some numerical values from 1 to 5. So again, the same information can be entered as text or numbers. The last one is also ordinal. So let's also talk about string variables a little bit. As I mentioned earlier, some string variables may be categorical variables or binary variables. Such as Female/Male, True/False, or even satisfaction levels. Other string variables are free-form text, such as product reviews, emails, company names, and we often discard this second kind. However, there are also text mining tools that sometimes can help you extract useful information from such free-form text.