0:04

Hey everyone, this lecture is going to be about building an actual r package.

So we've already seen kind of what is in R

package and what are the different components of the package.

So I thought in this lecture I'd like to make do a little demo,

and going to show exactly you know how to put all of the pieces together.

In particular, how to do it in our studio which actually

makes it pretty convenient So the package we're going to build today.

Is going to be a very small package.

It's only going to be two functions.

And the functions in the package are going to be.

Are going to build a prediction model for a high dimensional data set.

Using just the top ten predictors.

So if you want a little bit of a background on kind

of what the rationale for this model is, I'll put a link up.

You can look at the link right here.

0:46

and, you can read more about it.

So that's the package that we're going to build.

It's going to be called the top ten package.

And so, we'll do that in R studio.

And you can see how you know everything is done.

1:08

Okay, so, this is the kind of the default, screen

for R studio, the first thing we're going to want to do

when we start on R project, sorry on R package project,

is to kind of start a new project, I'm going to

open up a project menu here and say new project

and we want to start this project in a new directory,

so I'll click on that then what we want to do

is build in R package, and the name of R package.

You can create whatever name you want, but

I'm just going to call it the top ten package.

Cause that's the predictor model that we are building, and it's going

to be in the sub directory of my home directory which is fine.

So I'll just hit create project.

1:44

That that puts me in the directory here, where where

the packaged files are going to live, and you can see

these are the package files here and you can see,

it's kind of put up a bunch of default files

in there, there is a description file, a namespace file

1:59

and then, if you look in the R directory there

is a, there is a file, there is a code

file but you'll notice that there is actually nothing in it.

So, we'll have to fill that up in just a second.

let's.

So let's just navigate this, file directory a little bit more.

In the top level directory see there's a description file we talked about.

If you open it up you'll see it's pre filled a bunch of the fields for you.

It put the package in.

Name in there for me.

But we're going to have to fill our, you know, with

the the title, the author, the maintainer, the description, etcetera.

So why don't we do that just right now.

So the title of this package is going to be, the top, excuse me.

Just the we'll call it, building

a prediction

model from the top 10 features.

And version 1.0 is fine.

The date is fine.

Who wrote it, so that's going to be me, let's say.

2:56

Uh,the maintainer's going to be me, but the important thing, of course,

about a maintainer is that there has to be an email address there.

So it'll be me again.

[NOISE]

And then the description is just a little bit longer.

Than the title so I'll just say you know build

functions for [SOUND] prediction models from selecting.

3:47

Something like that.

It doesn't have to be too long.

And then the license, we'll just choose a nice, open source license like, Gnu Gpl.

So, we'll just call that Gpl version 3, and that's our description file.

4:04

and, by hitting command s.

And then I'll just close it up because we don't need it anymore.

So, I think at this point there really aren't any other files that

we can fill out, so we might as well just start coding away.

Okay, so the first thing we're going to want to do is build R top ten functions.

Let's call it top ten.

4:28

So the x is going to be the kind of matrix of predictors and y is going to

be the vector of responses so p is the

number of columns that is going to be p predictors.

Uh,I think it doesn't really make sense to look

at the top ten predictors if there aren't at

least ten predictors, so we're going to say, we're going to

check to see if p is at least ten.

And if it's not we'll just stop.

>> [SOUND] So the way this model works is basically, for each predictor

in your kind of matrix of predictors, we fit a, a univariate

regression model of the, of the response on each individual predictor.

And then for each individual predictors, so there is going to be

all these regression models that we fit and for each predictor,

there is going to be a p value associated with that,

that given predictor, you know, depending on how strong the association is.

So, so for every predictor we're going to have a p value, and then

what we're going to do is we're going to sort the p values from kind of

smallest to largest and then we'll take the top ten smallest in this

case p values, which indicate kind of

the, the predictors of the strongest associations.

So we'll take those top ten predictors,

and then we'll fit a separate regression model

with those ten predictors in it and that

will be kind of the final prediction model.

>> Alright so we're, first, so we've already

checked to see if there's at least ten predictors.

And we'll initialize our vector of p values here.

This is going to be an empty vector of zeros.

And the original loop through each of the predictors

5:58

and righ, and fit univary regression models.

And fit lm y to the x, and it'll [UNKNOWN] i.

And then we want to, we want to get the p

values out, so we need the summary of that model fit.

6:23

And it's the forth, column here.

And so we are going to accumulate all of those p values.

And then once we've done that, we want to, we'll look at.

We want to sort the p values.

We'll look at all of the ordering of p values.

In order by default, we'll sort them for lower to

giver the index indicies for the smallest to the largest.

6:59

And then we want a, so now we create a new data set called x10.

With the kind of top ten predictors, from the old data set, the original data set,

and fit this final linear model, [SOUND]

and then grab the coefficients, from the model and

so that's kind of what the top ten function returns.

7:20

So we've got our top ten function, which kind of fits the model,

it, it goes through all the p values and picks the top ten.

So I'm going to write one for function that that takes the coefficients from the

final fitted model, and it takes some, kind of, additional input vectors, and

gives you a prediction for the new values, right, so it takes the

coefficients and some data and gives you

predictions, a predicted response for each one.

So we will call that, lets call predict10 and

8:05

and then we're going to do a matrix multiplication of the

new kind of data and the coefficients from the model.

So, I'll do x times b is the matrix multiplication and then I want to

drop the dimension eh, so that it gives me a vector I'll call drop.

So this just returns a vector predicted responses.

8:26

Okay, so we've got our code for the package it's pretty straightforward.

Just this two functions, but in order to have

a package of course, we need a, a bunch

of other stuff; we need some documentation, and we

need to be able to specify the namespace file.

So the way that we're going to documentation in

this package is we're going to use Roxygen 2 package.

And so the, what the, what's nice about this, the format is

that allows you put all the documentation in the code file itself.

And then what the Roxygen 2 package does

is it strips out the documentation that you put

in the code file and it makes the

man pages, it formats it in the appropriate way.

So, you don't have to worry about,

9:01

documenting separate, writing out separate documentation files.

Two nice things about this is one is that it kind of keeps you focused on one file.

You don't have to kind of constantly switch back and forth.

And the other thing is that since the documentation is actually

close, physically close to the code, there's a better chance that the

documentation will stay up to date because you'll be able to see

if there are any discrepancies between the documentation and the code itself.

So, let's go ahead and do that.

9:29

So the first function I want to document here is

the top ten function and so the first, so one

of, I need to give it a, so the first you want to do is give it a little hash

then the and then the apostrophe there, and I'll give

it a title so I'll call it building a model

with top ten features, and you can see that our

studio nicely kind of insert those extra hashes for you.

And I'll give a little title, description here so this function develops

a prediction algorithm based on the top

ten features in x that are most predictive of y.

Okay seems reasonable.

10:31

There's another parameter here called y, [SOUND]

and this is a vector of length n representing

the response, okay, and I will save the return

value Is a vector of

coefficients from the final fitted model with

12:03

Now what's important about, actually important about this function.

First of all, it's going the be the function that we want to export.

So we want to make sure that we export in the main space file.

So we gotta put the export directive here.

And then furthermore, because this function uses

the LM function, which comes from the

stats package, we need to import that function, so that we can use it.

So we need to import from the stats package, the LM function.

So that's going to go into the name space file.

Alright, so that's our documentation for the top ten function.

Now let's go down here and look at the predict ten function.

So we gotta do, probably gotta do the same thing here.

Start a little documentation with the prediction of the top ten features.

13:49

top ten function.

Alright.

So, then, of course, we also want to export this function.

So we hit the export directive.

This function doesn't export anything special.

So, we don't need an import directive.

14:08

Okay, so we've written our R-code.

We've written some documentation in the R-code, and so now it's going to

be time to kind of process

this information and start building our package.

So, let's see how we do that in R Studio.

14:22

So in R Studio we can go over here to the build tab and you'll notice that this

bill tab only exists if you, if your, if

you've created a project that's specifically an R package okay.

So let's try to build our package and then load it into R.

Okay?

So you can see that at the top, in the top window here,

the R session has restarted, and then the top ten package has been loaded.

Okay?

So, one of the things you'll notice, if you go over here

to the left, in the files, if you, let's go in the man

directory here, Oh, and you see that there's a, there's a default kind

of, documentation file here that talks, that just kind of describes the package.

But you notice that the documentation files have

not been created by the Roxygen 2 package.

And so what we actually have to do is

we gotta go into here and configure our build tools.

15:11

And then we, what we want to do, we want to generate documentation

with Roxygen and we want to create the RD file from the manual.

Let's create the namespace file from the documentation in the, that we wrote.

And then we'll do it when we build and reload.

'Kay?

15:37

Alright, so let's build and reload R package again.

And you'll notice now that the the two

documentation files here for our code have been written,

so if you look in the top10.rd file

you can see all this information has been extracted

from the the documentation that we wrote in the R file, so this is all of the

stuff that we just wrote, and for the

predict ten function again, another set of documentation here.

16:03

So, Let's take a look at our top ten package here.

I can do library, its already been loaded but lets take a, library help top

ten, ops, and you can see here's the description file that we wrote

out as a bit of information these the exported functions over here, ahm, if

I look at, if I, I can print out the code for top ten.

And you'll see that's the code that we wrote.

If I do question mark top ten, you'll see that

the documentation file that, has been built is right here.

So, it, it's formatted nicely here.

We can look at the documentation for predict ten.

16:47

And so that's R package is pretty much all, most of the way there.

If you want to see if it passes R command check, it's very simple.

You just click the check button here and it will run our command

check, it will build the package and then run R command check dot.

And so you see all the tests are going by.

[BLANK_AUDIO]

Okay, so you can see that R command check finished

and it looks like we passed all of the tests.

You see everything is okay over here so that is a good sign.

17:32

So this very simple R package, we just wrote a little.

We wrote two functions, but you can see that

in R studio there are a lot of handy tools.

For putting a package together and for building documentation files.

And so try I,I encourage you to try it

out and write, write some functions, maybe take some

of the homework that you've done already and build

in R package out of the functions that you've written.

It's not too hard and particularly when you use

the tools in R studio, you can go very quickly.