0:04

Hey everyone, this lecture is going to be about building an actual r package.

Â So we've already seen kind of what is in R

Â package and what are the different components of the package.

Â So I thought in this lecture I'd like to make do a little demo,

Â and going to show exactly you know how to put all of the pieces together.

Â In particular, how to do it in our studio which actually

Â makes it pretty convenient So the package we're going to build today.

Â Is going to be a very small package.

Â It's only going to be two functions.

Â And the functions in the package are going to be.

Â Are going to build a prediction model for a high dimensional data set.

Â Using just the top ten predictors.

Â So if you want a little bit of a background on kind

Â of what the rationale for this model is, I'll put a link up.

Â You can look at the link right here.

Â 0:46

and, you can read more about it.

Â So that's the package that we're going to build.

Â It's going to be called the top ten package.

Â And so, we'll do that in R studio.

Â And you can see how you know everything is done.

Â 1:08

Okay, so, this is the kind of the default, screen

Â for R studio, the first thing we're going to want to do

Â when we start on R project, sorry on R package project,

Â is to kind of start a new project, I'm going to

Â open up a project menu here and say new project

Â and we want to start this project in a new directory,

Â so I'll click on that then what we want to do

Â is build in R package, and the name of R package.

Â You can create whatever name you want, but

Â I'm just going to call it the top ten package.

Â Cause that's the predictor model that we are building, and it's going

Â to be in the sub directory of my home directory which is fine.

Â So I'll just hit create project.

Â 1:44

That that puts me in the directory here, where where

Â the packaged files are going to live, and you can see

Â these are the package files here and you can see,

Â it's kind of put up a bunch of default files

Â in there, there is a description file, a namespace file

Â 1:59

and then, if you look in the R directory there

Â is a, there is a file, there is a code

Â file but you'll notice that there is actually nothing in it.

Â So, we'll have to fill that up in just a second.

Â let's.

Â So let's just navigate this, file directory a little bit more.

Â In the top level directory see there's a description file we talked about.

Â If you open it up you'll see it's pre filled a bunch of the fields for you.

Â It put the package in.

Â Name in there for me.

Â But we're going to have to fill our, you know, with

Â the the title, the author, the maintainer, the description, etcetera.

Â So why don't we do that just right now.

Â So the title of this package is going to be, the top, excuse me.

Â Just the we'll call it, building

Â a prediction

Â model from the top 10 features.

Â And version 1.0 is fine.

Â The date is fine.

Â Who wrote it, so that's going to be me, let's say.

Â 2:56

Uh,the maintainer's going to be me, but the important thing, of course,

Â about a maintainer is that there has to be an email address there.

Â So it'll be me again.

Â [NOISE]

Â And then the description is just a little bit longer.

Â Than the title so I'll just say you know build

Â functions for [SOUND] prediction models from selecting.

Â 3:47

Something like that.

Â It doesn't have to be too long.

Â And then the license, we'll just choose a nice, open source license like, Gnu Gpl.

Â So, we'll just call that Gpl version 3, and that's our description file.

Â 4:04

and, by hitting command s.

Â And then I'll just close it up because we don't need it anymore.

Â So, I think at this point there really aren't any other files that

Â we can fill out, so we might as well just start coding away.

Â Okay, so the first thing we're going to want to do is build R top ten functions.

Â Let's call it top ten.

Â 4:28

So the x is going to be the kind of matrix of predictors and y is going to

Â be the vector of responses so p is the

Â number of columns that is going to be p predictors.

Â Uh,I think it doesn't really make sense to look

Â at the top ten predictors if there aren't at

Â least ten predictors, so we're going to say, we're going to

Â check to see if p is at least ten.

Â And if it's not we'll just stop.

Â >> [SOUND] So the way this model works is basically, for each predictor

Â in your kind of matrix of predictors, we fit a, a univariate

Â regression model of the, of the response on each individual predictor.

Â And then for each individual predictors, so there is going to be

Â all these regression models that we fit and for each predictor,

Â there is going to be a p value associated with that,

Â that given predictor, you know, depending on how strong the association is.

Â So, so for every predictor we're going to have a p value, and then

Â what we're going to do is we're going to sort the p values from kind of

Â smallest to largest and then we'll take the top ten smallest in this

Â case p values, which indicate kind of

Â the, the predictors of the strongest associations.

Â So we'll take those top ten predictors,

Â and then we'll fit a separate regression model

Â with those ten predictors in it and that

Â will be kind of the final prediction model.

Â >> Alright so we're, first, so we've already

Â checked to see if there's at least ten predictors.

Â And we'll initialize our vector of p values here.

Â This is going to be an empty vector of zeros.

Â And the original loop through each of the predictors

Â 5:58

and righ, and fit univary regression models.

Â And fit lm y to the x, and it'll [UNKNOWN] i.

Â And then we want to, we want to get the p

Â values out, so we need the summary of that model fit.

Â 6:23

And it's the forth, column here.

Â And so we are going to accumulate all of those p values.

Â And then once we've done that, we want to, we'll look at.

Â We want to sort the p values.

Â We'll look at all of the ordering of p values.

Â In order by default, we'll sort them for lower to

Â giver the index indicies for the smallest to the largest.

Â 6:59

And then we want a, so now we create a new data set called x10.

Â With the kind of top ten predictors, from the old data set, the original data set,

Â and fit this final linear model, [SOUND]

Â and then grab the coefficients, from the model and

Â so that's kind of what the top ten function returns.

Â 7:20

So we've got our top ten function, which kind of fits the model,

Â it, it goes through all the p values and picks the top ten.

Â So I'm going to write one for function that that takes the coefficients from the

Â final fitted model, and it takes some, kind of, additional input vectors, and

Â gives you a prediction for the new values, right, so it takes the

Â coefficients and some data and gives you

Â predictions, a predicted response for each one.

Â So we will call that, lets call predict10 and

Â 8:05

and then we're going to do a matrix multiplication of the

Â new kind of data and the coefficients from the model.

Â So, I'll do x times b is the matrix multiplication and then I want to

Â drop the dimension eh, so that it gives me a vector I'll call drop.

Â So this just returns a vector predicted responses.

Â 8:26

Okay, so we've got our code for the package it's pretty straightforward.

Â Just this two functions, but in order to have

Â a package of course, we need a, a bunch

Â of other stuff; we need some documentation, and we

Â need to be able to specify the namespace file.

Â So the way that we're going to documentation in

Â this package is we're going to use Roxygen 2 package.

Â And so the, what the, what's nice about this, the format is

Â that allows you put all the documentation in the code file itself.

Â And then what the Roxygen 2 package does

Â is it strips out the documentation that you put

Â in the code file and it makes the

Â man pages, it formats it in the appropriate way.

Â So, you don't have to worry about,

Â 9:01

documenting separate, writing out separate documentation files.

Â Two nice things about this is one is that it kind of keeps you focused on one file.

Â You don't have to kind of constantly switch back and forth.

Â And the other thing is that since the documentation is actually

Â close, physically close to the code, there's a better chance that the

Â documentation will stay up to date because you'll be able to see

Â if there are any discrepancies between the documentation and the code itself.

Â So, let's go ahead and do that.

Â 9:29

So the first function I want to document here is

Â the top ten function and so the first, so one

Â of, I need to give it a, so the first you want to do is give it a little hash

Â then the and then the apostrophe there, and I'll give

Â it a title so I'll call it building a model

Â with top ten features, and you can see that our

Â studio nicely kind of insert those extra hashes for you.

Â And I'll give a little title, description here so this function develops

Â a prediction algorithm based on the top

Â ten features in x that are most predictive of y.

Â Okay seems reasonable.

Â 10:31

There's another parameter here called y, [SOUND]

Â and this is a vector of length n representing

Â the response, okay, and I will save the return

Â value Is a vector of

Â coefficients from the final fitted model with

Â 12:03

Now what's important about, actually important about this function.

Â First of all, it's going the be the function that we want to export.

Â So we want to make sure that we export in the main space file.

Â So we gotta put the export directive here.

Â And then furthermore, because this function uses

Â the LM function, which comes from the

Â stats package, we need to import that function, so that we can use it.

Â So we need to import from the stats package, the LM function.

Â So that's going to go into the name space file.

Â Alright, so that's our documentation for the top ten function.

Â Now let's go down here and look at the predict ten function.

Â So we gotta do, probably gotta do the same thing here.

Â Start a little documentation with the prediction of the top ten features.

Â 13:49

top ten function.

Â Alright.

Â So, then, of course, we also want to export this function.

Â So we hit the export directive.

Â This function doesn't export anything special.

Â So, we don't need an import directive.

Â 14:08

Okay, so we've written our R-code.

Â We've written some documentation in the R-code, and so now it's going to

Â be time to kind of process

Â this information and start building our package.

Â So, let's see how we do that in R Studio.

Â 14:22

So in R Studio we can go over here to the build tab and you'll notice that this

Â bill tab only exists if you, if your, if

Â you've created a project that's specifically an R package okay.

Â So let's try to build our package and then load it into R.

Â Okay?

Â So you can see that at the top, in the top window here,

Â the R session has restarted, and then the top ten package has been loaded.

Â Okay?

Â So, one of the things you'll notice, if you go over here

Â to the left, in the files, if you, let's go in the man

Â directory here, Oh, and you see that there's a, there's a default kind

Â of, documentation file here that talks, that just kind of describes the package.

Â But you notice that the documentation files have

Â not been created by the Roxygen 2 package.

Â And so what we actually have to do is

Â we gotta go into here and configure our build tools.

Â 15:11

And then we, what we want to do, we want to generate documentation

Â with Roxygen and we want to create the RD file from the manual.

Â Let's create the namespace file from the documentation in the, that we wrote.

Â And then we'll do it when we build and reload.

Â 'Kay?

Â 15:37

Alright, so let's build and reload R package again.

Â And you'll notice now that the the two

Â documentation files here for our code have been written,

Â so if you look in the top10.rd file

Â you can see all this information has been extracted

Â from the the documentation that we wrote in the R file, so this is all of the

Â stuff that we just wrote, and for the

Â predict ten function again, another set of documentation here.

Â 16:03

So, Let's take a look at our top ten package here.

Â I can do library, its already been loaded but lets take a, library help top

Â ten, ops, and you can see here's the description file that we wrote

Â out as a bit of information these the exported functions over here, ahm, if

Â I look at, if I, I can print out the code for top ten.

Â And you'll see that's the code that we wrote.

Â If I do question mark top ten, you'll see that

Â the documentation file that, has been built is right here.

Â So, it, it's formatted nicely here.

Â We can look at the documentation for predict ten.

Â 16:47

And so that's R package is pretty much all, most of the way there.

Â If you want to see if it passes R command check, it's very simple.

Â You just click the check button here and it will run our command

Â check, it will build the package and then run R command check dot.

Â And so you see all the tests are going by.

Â [BLANK_AUDIO]

Â Okay, so you can see that R command check finished

Â and it looks like we passed all of the tests.

Â You see everything is okay over here so that is a good sign.

Â 17:32

So this very simple R package, we just wrote a little.

Â We wrote two functions, but you can see that

Â in R studio there are a lot of handy tools.

Â For putting a package together and for building documentation files.

Â And so try I,I encourage you to try it

Â out and write, write some functions, maybe take some

Â of the homework that you've done already and build

Â in R package out of the functions that you've written.

Â It's not too hard and particularly when you use

Â the tools in R studio, you can go very quickly.

Â