This lecture is about creating new variables. Once you loaded the data set into R, sometimes you'll find that some of the variables that you're looking for are not actually in the raw data that you loaded in. So, you may need to transform the data a little bit to get the value you'd like. Usually, you'll add those values back into the data frame. This is particularly important when you're doing something like prediction, where you wanna create new variables but you might use as predictors. Common variables that people create are missingness indicators. So, pointing out where you might have missing data, that's one really common variable to create. Also cutting up quantitative variables, creating variables that are factor versions of a quantitative variable corresponding to particular values that may be of interest. Or applying transformations to deal with data that have a strange distributions. So, we're gonna be using this Baltimore restaurant data set as an example. So again, you could go to their URL and then click over here on this export button to get the URL for a particular version of the data set. In this case, I'm using the CSV version. Again, I see if the data directory exists. Then I download the file from the internet, and I load it into R using read.csv. Now, I've got a data set rest data which is the restaurant data that was created from this data set. And so, one thing that I'm gonna talk about first before I get into analyzing a data set is creating sequences. Sequences are often used to index different operations that you're going to be doing on data, and so it's good to be able to know how to create them. The command that you use in R is seq. And so, usually what you do is you tell seq the minimum value and the maximum value. And then there's two ways in which you usually specify how many values to generate. One is to use the by command, so by=2 means it starts at 1 and then creates new values, increasing each new value by 2. Another way to do it is to specify the length of the vector. And so what that means is, it'll start at 1 and end at 10, and it will create exactly three values. A third way is to say, suppose you have a vector x and it has five values in it. And suppose you wanna create an index, so that you can loop over those five values. You might use this seq(along = x), which will basically create a vector of the same length as x, but with consecutive indices that you can use for looping or accessing subsets of data sets. So, one kind of variable that you might wanna create is a variable that indicates which subset another variable comes from. So for example, here we have these restaurant data. We actually have the neighborhoods that the restaurants are in. And so, I might wanna find all the restaurants that are in two neighborhoods near me, Roland Park and Homeland. And so, if I use this percent in percent command, it will allow me to find only the restaurants that are in those two neighborhoods. It will return true if you're in that neighborhood and false if you're not. And then what I'm doing is, I'm appending these nearMe variable onto the restaurant data set. So, this allows me to now subset that data set only by the restaurants that are near me, which is kind of a nice, clean way of subsetting the data set, as opposed to always having to use this longer percent, ampersand command. Another thing you might wanna do is create binary variables. So for example, we might wanna find the cases where we know the zip code is wrong. So again, I'm assigning to the data frame the variable zipWrong. And here I'm using the ifelse command. So for the ifelse command, I first send it a condition. In this case, I'm gonna send it the condition is the zip code less than zero? So if the zip code is less than zero, we think that that is zip code must be wrong. And so, the first thing that I return is true if the condition holds true. So in other words, all the cases where the zip code is less than zero, I'll get a true. And then, this is what the function returns if the condition is false. So if the condition is false, in other words if the zip code is positive, then I get a false. So at the end of the day, what I can do is then make a table of whether the zip code is wrong and what if the zip code is less than zero. And you can see that in all the cases where the zip code are greater than zero, I get false. And in the one case where the zip code is less than zero, I get true. And so that might be a value that I wanna filter out of my data set. This makes an easy way to filter those values out. You also might wanna make categorical variables out of quantitative variables. So for example, we might wanna break the zip codes up into consecutive numbers. So here, I'm gonna create a variable that I'm calling zipGroups. And so, what I'm gonna use is this cut command. Gonna apply it to quantitative variable to zipCode. And I'm gonna tell the cut command that I want to break that up according to some values, a list of value that I give it. In this case, I'm gonna break it up according to the quintiles to that zip code. So, what that will then return as a factor variables. So zip groups, is a factor variable where we'll break the variable zip code up into the zero quantums of the 25th percentile, the 25th, so the fiftieth percentile. The 50th, the 75th, and the 75th to the 100th percentile. And so for example, there are about 375 of the values that land between the 25th percentile and the 50th percentile. So, then what i can do is i can make a table that shows you which of the zip codes fall into which of these different clusters. And so you can say, the low values fall into the first cluster and then the medium values fall into the next cluster, and so forth. So, what this does is it breaks the quantity of variable up into a categorical variable. An easier way to that, I had to specify all the breaks with the quantile function in the previous version of cut. If you go into the Hmisc package and you use the cut2 function, you can actually specify I want to break the zip code up into four different groups and I want to break them up according to the quantiles. And so when I do that, it will actually find out the quantiles for me and break it up into four groups according to the quantiles, which is kind of a nice way to do it if you don't wanna have to set the breaks in advance. So, the cut2 function in the Hmisc package. Another thing you might wanna do is you might wanna create factor variables. So for example, the zip code variable is an Attinger variable when you load it into r, but you might wanna turn it into a factor variable. In other words, it's not clear that incrementing the zip code by one changes quantitatively, the variable. So, you might wanna turn that into a factor variable and the way you do that is with the factor command. So, it takes an input of this integer variable, and it turns it into a factor variable which I'm calling ZCF. So, if I look at the first ten values of ZCF, it shows me these values and they look just like they did before. They look like the values that are the but then it tells me how many different zip code levels there are. There are 32 different zip code levels, and if I look at a class of this variable it's a factor variable. So, a couple of other things that you might wanna know about factor variables. So here, what I'm gonna do is I'm going to create just a dummy vector, so that we can see some other properties of factor Variables. I'm gonna create a vector of yeses and nos, and so I'm gonna do that randomly. I'm gonna generate a size ten vector of yeses and nos. And I'm gonna turn that into a factor variable. And I'm gonna, by default, what it's gonna do is it's gonna treat the lowest value alphabetically as the first value, the factor variable. But supposed I wanna treat the second, the yes values as the lowest value. Then I can tell it to take the levels and create them in this order. And so for example, what I can do is I can re-level that variable and make the reference plus be equal to the yes value. Another thing that you can do is you can use the as.numeric command to to change that factor variable back into a numeric variable. If you wanna use it in certain kinds of models, this can be a useful thing to do. And one thing that it'll do is, it'll start with sort of the lowest value and call that one and then the next lowest value and call that two and so forth. So, cutting produces factor variables. We saw before that I wanted to create this zipGroups variable, where I cut the zipCodes up into four separate groups by their quantiles. And so, if you make a table of them, you'll get to see this table and the class of this object zip groups is now a factor variable. You can actually use the mutate function to create a new version of the variable and simultaneously added to a data set. This is part of the flyer package. So, what I'm gonna do here is I'm gonna create a new data frame. This new data frame is, I'm gonna apply the mutate function to the old data frame. And I'm going to add a new variable zipGroups, which is equal to a function of one of the variables in the original rest data data frame. And so, what happens now is the new data frame will be the old data frame rest data with the new variables added. So now, I can make a table of the zipGroups and I can see that it appears in this new data frame rest suite. Another thing that you're probably gonna do with quantitive data is explore to analysis, maybe apply some different transformations. So, here is the list of the most common transformations taking absolute values at square roots. You might take the ceiling on the floor, if you wanna run values up or down to the nearest integer. You can also use the round function to round to as many digits as you'd like, or the signif function, to round to as many significant digits as you'd like. All the standards for calculator functions, like cosine and sine, and in r. As well as logs, which are commonly used when you have skewed data, or data with a lot of outliers. And then you can also exponentiate data, as well. There's a lot more about this, at these two links that I'd provided you here at the bottom, and you can have these lecture notes, as well as this stat methods page. Even more further notes are available from the plyr tutorial, as well as the lecture notes that focus on transformation from Andy Jaffe as well.