0:03

So factor is a special type of vector, which is used to create,

Â to represent categorical data.

Â So, and there's two types of factor, there is unordered or ordered, so

Â you can think of this as being, as storing data that are.

Â Have labels that are categorical but have no ordering, so for

Â example male and female.

Â 0:22

Or you can have ordered factors which might represent things that are ranked.

Â So they have an order but they're not numerical for example you know,

Â in many universities you'll have assistant professors, associates professors, and

Â full professors.

Â Those are categorical but they're ordered.

Â 0:39

So one, you can think of a factor as an integer vector where

Â each integer has a label.

Â So for example, you might, you can think of it as a vector as one two three,

Â where one represents you know, high, for example high value and

Â two represents a medium value and three represents a low value.

Â So you might have a, a variable that's called high, medium and low.

Â And underlying in R is represented by the numbers one, two, and three.

Â 1:04

so, factors are important because they're treated specially by modeling functions

Â like lm and glm which we'll talk about later.

Â But these are functions for, for, for fitting linear models.

Â 1:15

And factors are with labels generally speaking are better than using

Â simple integer vectors because the factors are, what are called self describing.

Â So having a variable that has values male and female is more

Â descriptive than having a variable that just, that just has ones and twos.

Â So for example, in many data sets you'll find that a var,

Â there will be a variable that's coded as one and two and it's, and it's not.

Â Easy to know whether that variable is really a numeric variable that only

Â takes values one and two, but the problem is that's not something that's coded in

Â the data set, so it's hard to tell.

Â If you use a factor variable then the coding for the labels is all,

Â is kind of built into the variable and it's much easier to understand.

Â 1:57

So factors can be created with the factor function, and

Â the input into the factor function is a character vector.

Â So here, I'm just creating a simple factor with the which has what, two levels, and

Â the levels are yes and no.

Â And so x is a factor, you can see what,

Â it prints out a little bit differently from a character vector,

Â in the sense that it prints up the value, yes, yes, no, yes, no.

Â And then it has a separate attribute which is called the levels.

Â And so the levels of this factor are no and yes, okay.

Â So there's only two levels.

Â I can, I can call table on this factor and

Â it will give me a frequency count of how many of each level there are.

Â So for example, it'll tell me there are two nodes.

Â And there's three yeses.

Â Now, the un-class function strips out the class for fa, for a vector.

Â So for example, I can, if I call un-class on x it'll,

Â it'll kind of bring it down to an integer vector, and you can see that underlying.

Â The factors represent as 22121 so, yes, it's coded as two and

Â no, it's coded as one.

Â Now it's not really essential for you to know this because you can

Â just treat the factor as being a vector of yeses and nos but it's used sometimes it's

Â it's useful just to know under, underneath kind of how factors are represented by R.

Â And so you see, it's really an integer vector with the attribute,

Â the levels attribute of no and yes.

Â 3:19

The order of the levels in the factor,

Â can be set using the levels argument in factors.

Â So for ex, and sometimes this is important because in modeling functions and

Â when you include a factor variable this, this,

Â sometimes it's important to know what the baseline level is.

Â And so the baseline level is just the first level in the factor, and

Â the way this is determined by NR is critical.

Â It's determined using alphabetical order, so for

Â example, if I create a factor variable.

Â With the, with the elements yes and no, then the base line level with be

Â the first level that's encountered and because no comes before yes in

Â the alphabet then no will be the base line level and yes will be the second level.

Â Now this may not be something that you want you might want for

Â example a yes to be the base line level and

Â no to be the second level and then in that case you have explicitly tell r.

Â That yes is going to be the first level and

Â you can view that using the levels argument to the factor function.

Â So now when I print out the x object you see that the elements are still the same,

Â still yes yes no, yes no.

Â But the levels attribute is reversed.

Â because yes is the first level and no is the second level.

Â