0:40

The candidate explanatory variables include gender, race, alcohol, marijuana,

cocaine, or inhalant use.

Availability of cigarettes in the home, whether or not either parent was on public

assistance, any experience with being expelled from school.

Age, alcohol problems, deviance, violence, depression, self esteem,

parental presence, activities with parents, family and

school connectedness, and grade point average.

Following the lib name statement and data step, which I am using to call in this

data set called triad health, I will include PROC HPFOREST.

Next, I name my response, or target variable, TREG1.

And indicate with a forward slash and the level option that is a categorical

variable by including the word nominal following the equal sign.

Categorical, quantitative, and even ordinal variables for

my random forest need to be included in separate input statements.

Here I include my categorical explanatory variables,

followed by the word input and end the statement with a forward slash,

level option, and the word nominal following the equal sign.

1:54

Then a second input statement for my quantitative explanatory variables,

Indicating that they are on an interval scale.

As always, all statements are ended with a semicolon.

Finally I end my program with a run statement, so

let's run the program and take a look at the output.

We can see in the model information section that variables to try is equal to

5, indicating that a random selection of five explanatory variables was

selected to test each possible split for each node in each tree within the forest.

By default, SAS will grow 100 trees and

select 60% of their sample when performing the bagging process.

That is the inbag fraction.

The prune fraction specifies the fraction of training observations that

are available for pruning a split.

The value can be any number from 0 to 1,

although a number close to 1 would leave little to grow the tree.

The default value is 0.

In other words, the default value is not to prune.

Leaf size specifies the smallest number of training observations

that a new branch can have.

The default value is 1.

3:07

The split criteria used in HPFOREST is the Gini index.

In terms of missing data, if the value of our target or

response variable is missing, the observation is excluded from the model.

If the value of an explanatory variable is missing,

PROC HPFOREST uses the missing value as a legitimate value by default.

Notice, too, that the the number of observations read from my data set was

6,504 while the number of observations used was 6,500.

Within the baseline fit statistics output,

you can see that the misclassification rate of the random forest is displayed.

Here we see that the forest misclassified 19.8% of the sample.

Suggesting that the forest correctly classified 80.2% of the sample.

Now I'll show the first ten and last ten observations of the fit statistics table.

PROC HPFOREST computes fit statistics for

a sequence of forests that have an increasing number of trees.

As the number of trees increases, the fit statistics usually improve, that is,

decrease at first and then they level off and fluctuate in a small range.

Forest models provide an alternative estimate of average square error and

misclassification rate, called the out of bag or OOB estimate.

The OOB estimate is a convenient substitute for

an estimate that is based on test data and

is a less biased estimate of how the model will perform on future data.

We end up with near perfect prediction in the training samples as the number of

trees grown gets closer to 100.

When those same models are tested on the out of bag sample,

the misclassification rate is around 16%.

The final table in our output represents arguably the largest contribution of

random forests.

Specifically, the variable importance rankings.

The number of rules column shows the number of splitting rules

that use a variable.

Each measure is computed twice, once on training data and

once on the out of bag data.

As with fit statistics, the out of bag estimates are less biased.

The rows are sorted by the out of bag Gini measure or OOB Gini measure.

The variables are listed from highest importance to lowest importance

in predicting regular smoking.

In this way, random forests are sometimes used as a data reduction technique,

where variables are chosen in terms of their importance to be

included in regression and other types of future statistical models.

Here we see that some of the most important variables in predicting regular

smoking include marijuana use, alcohol use, race,

cigarette availability in the home, cocaine use, deviant behavior, etc.

To summarize, like decision trees, random forests are a type of data mining

algorithm that can select from among a large number of variables,

those that are most important in determining the target or

response variable to be explained.

Also, like decision trees,

the target variable in a random forest can be categorical or quantitative.

And the group of explanatory variables can be categorical or

quantitative, or any combination.

Unlike decision trees, however,

the results of random forest generalize well to new data

since the strongest signals are able to emerge through the growing of many trees.

Further, small changes in the data do not impact the results of random forests.

In my opinion, the main weakness of random forests is simply that results

are somewhat less satisfying, since no trees are actually interpreted.

Instead, the forest of trees is used to rank the importance of variables

in predicting the target.

Thus, we get a sense of the most important predictive variables,

but not their relationship to one another.

[MUSIC]