0:00

Tapply is useful because it splits up a vector into,

Â into little pieces and it applies a, a summary statistic or

Â function to those little pieces, and then after it applies

Â a function it kind of brings the pieces back together again.

Â So so split is not a loop function but it's a very handy

Â function that can be used in conjunction, with functions like lapply or sapply.

Â And so I just want to mention it here.

Â So split takes a vector.

Â So it's kind of like tapply, but it, but

Â it's like tapply but without applying the summary statistics.

Â So what it does, is it takes a vector, or

Â an object x and it takes a factor variable, f.

Â Which again identifies levels of a group.

Â 0:40

And then it splits the object x into the

Â number of groups that are identified in, in factor f.

Â So for example, if f has three levels identifying three

Â groups, then the split function will split x, into three groups.

Â And so, and then once you've got those groups split apart, you can apply,

Â you can use lapply, or sapply to apply a function to those individual groups.

Â 1:06

So here is, is a simpler example, similar to what I had before.

Â With tapply example, I've simulated a normal 10 normal random variables

Â with mean zero, 10 uniforms, and 10 normal's with mean one.

Â And has created my factor variable here.

Â And now I'm just going to split the vector into three parts.

Â Because because the factor variable has three levels.

Â So now you can see when I split the x vector.

Â The first, I got a list back and the first element is 10 normals, the second element

Â is 10 uniforms and the third element, which

Â gets a little cutoff here is 10 normals again.

Â So that's what the split function does.

Â And now I've got a, so a split always returns a list back.

Â And so if you want to do something with this list, you can use lapply or sapply.

Â So, here for example, it is common to use

Â the lapply function in conjunction with the split function, so

Â the idea that you split something that lapply function over it.

Â Now, this case, this use of lapply and split is not necessary, because

Â you can use the tapply function which will do the same exact thing.

Â 2:12

It's not anymore efficient or any worse to do it this

Â way but the tapply function is a little bit more compact.

Â But the nice thing about the split, using the split function is

Â that it can be used to split much more complicated types of objects.

Â So for example, here I've got a data frame for.

Â I'm loading the data sets package and I'm, and I'm

Â looking at the airquality data frame, from the data sets package.

Â So, you can see that this is the first six rows of the data, of this...

Â Data frame I think there's about a hundred some rows total in this data frame.

Â And you see there are measurements on

Â ozone, solar radiation, wind, and temperature, and

Â then the month and the day within that month.

Â 2:50

And so, one thing I might want to do is, is calculate for

Â example the mean of ozone, solar radiation,

Â wind and temperature in, within each month.

Â So, so for in each month, there's you know, 30 some observations.

Â And I want to calculate the mean within each month.

Â All right, so how do I do this?

Â Well, what I'd like to do is I'd like to split the data frame into monthly pieces.

Â And then once I've split data frame into separate months, I can just calculate the

Â means, the column means using either apply or call means, on those other variables.

Â [SOUND].

Â So that's what I've done here.

Â What I've done is I split the airquality data frame and this,

Â and the factor I'm going to use to split is the month variable.

Â So the month variable technically speaking, in the data frame is not

Â a factor variable but it can be converted into a factor variable,

Â because it only takes the values 5, 6, 7, 8 and 9.

Â Basically because the measurements are only taken in

Â the, kind of, warmer months of the year.

Â So here I've split the airquality variable according

Â to the month variable, and then I'm going to apply.

Â An anonymous function and the anonymous function here, what it does is

Â it takes the column means of just the ozone, solar radiation and wind.

Â So I'm not going to take the mean of temperature here.

Â So I'm just going to take the column means of the,

Â those three variables for each month each column monthly data frames.

Â So here you can see the results.

Â You can't see them all but you can see most of them into

Â lapply is returning a list back, where each element of the list is

Â a vector of length three which is, which is the mean for ozone,

Â the mean for solar radiation and the mean for wind, within that month.

Â As you can see that

Â for, for most of the months the ozone value is

Â NA and that's because when I take the mean of that

Â column there are, and there are missing values in that column

Â and I can't take the mean if there are missing values.

Â So the, the result, when I think the mean is that I just get a missing value back.

Â 4:46

So one thing I can do is I can.

Â So before I fix the missing value problem, I can also call sapply here.

Â And the idea is that sapply, instead of

Â returning me a list, it will simplify the result because each element

Â of the returned list has a, has a vector of length 3.

Â They're all the same length.

Â So what I'll do is put, put all these numbers into a matrix.

Â Where the three rows and in this case 5 columns.

Â So here you can see the monthly means.

Â For each of the three variables, in a much more

Â compact format, it's in a matrix, instead of a list.

Â Of course I still got NA's for a lot of them, because the missing values

Â in the original data.

Â So one thing I knew is I was going to pass the na.rm argument to call

Â means that would remove the missing values

Â from each column, before its calculating the mean.

Â And that, now when I call sapply on the split list, I can get the, the

Â means of the observed values for each of

Â the three variables for each of the five months.

Â So, so split is a very handy function for splitting arbitrary

Â objects according to the levels of the

Â factor and then applying any type of function.

Â To those split elements of that list.

Â And so here I split a data frame, you can split

Â other lists, you can, and, or other kinds of things too.

Â [SOUND].

Â So the last thing I want to talk about is splitting on more than one level.

Â So you, in the past couple of examples

Â what I've, I've only had a single factor variable.

Â And I've

Â split whatever the object is with a vector or a data frame.

Â According to the levels of that single factor.

Â But you might have more than one factor.

Â For example, you might have a variable, that, you

Â know, it's gender, so it has male and female.

Â And you might have another variable.

Â That has, for example, the race.

Â And so, you might want to look at

Â the combination of the levels within those factors.

Â And so so here, we've got, I've got f1, which is a factor with two levels.

Â And so I've simulated

Â a normal random variable with 10, with 10 observations.

Â 6:38

I've got a factor with two levels, and each repeated

Â five times, and then I've got another factor with five levels.

Â If repeated two times.

Â So there are my kind of two category, two group, grouping variables here.

Â And I want to look at the kind of combination of the two.

Â So I can use the interaction function to combine all the levels

Â of the first one with the, all the levels of the second one.

Â And so because there are two levels in the first

Â factor and there is five levels in the second factor

Â and there is a, the total combination of 10 different

Â levels that you can have when you combine the two together.

Â So when you see, when I call, when I called the

Â interaction function I get another factor, that kind of concatenates the

Â levels of one with the other, and you can see that

Â it prints out that there is a total of ten levels.

Â Okay.

Â 7:24

So, what now I can slit my numeric vector x according to the two different levels.

Â So now, when I Iike, when I use, now one thing, when

Â I use the split function I don't have to use the interaction function.

Â I can just pass it a list with the two factors and it will

Â automatically call the interaction function for me,

Â and create that 10 level interaction factor.

Â So I can just pass the list of these two

Â factors in it, and you can that, it create, it returns

Â me a list with the levels of

Â the 10 different kind of interaction factor levels.

Â And then and then, and then the elements of

Â the numeric factors that are within those 10 levels.

Â Now of course there are, although there are 10 levels between

Â the two different factors, that we don't exactly observe every single combination.

Â And so there are some empty levels here and you

Â can see that some of these levels have nothing in them.

Â They have zero elements, whereas other levels have a number in it.

Â And so, so one thing you can do.

Â Well first I can, I could take this list and, and,

Â and lapply or sapply a function over it, if I wanted to.

Â But, sometimes it's a little bit handy to not have to keep these empty levels.

Â So, so the split function has an argument called drop.

Â 8:31

And if you specify drop equals true, it will drop.

Â The empty levels, that are created by the splitting.

Â And, and this can be very handy, when you're, you're combining,

Â multiple different factors.

Â If you're only using a single factor, then doesn't, that argument doesn't really

Â do anything, because you'll just use all the, all the levels but, usually.

Â But if you have multiple factors then typically you're going to have empty

Â levels, just because you don't observe

Â every single combination of the two factors.

Â