So, in this section, we'll know and show how to account for the uncertainty in some of the estimates, we get namely the slope and also the intercept in logistic regression. So, the learning objective of this section after revealing, you should be able to create 95 percent confidence intervals for the intercept and slopes from a simple logistic regression model and convert these to 95 percent confidence intervals for odds and odds ratios. Then, in terms of group comparisons, you should be able to estimate p-values for testing the null hypothesis, that the true slope or log odds ratios equal to zero. Hence, the true population level odds ratio is one and also be able to interpret these. So, in the previous sections, we show the results from several simple logistic regression models. So, for example, when relaying the response to treatment, ART therapy to baseline CD4 counts. In our sample, citywide population with 1,000 HIV positive individuals, the resulting analysis or logistic regression was the log odds response was equal to an intercept of negative 1.67, plus a slope of 0.58, times X_1. Where X_1 was equal to one, for the baseline CD4 count of less than 250 and zero for subjects with baseline CD4 count are greater than or equal to 250. This was estimated from individual level data using a computer package, but if I put these data into any computer package whether it be STATA, R, SAS, SPSS, et cetera, we should always get the same results for the same data. So, what is the common algorithm used to estimate this equation? Well, for logistic regression, this approach is called maximum likelihood. The estimates for the intercept and slope are the values that make the observed data most likely among all possible choices for the intercept and slope. So, what this means, is that given our sample 1,000 persons, there was a certain proportion. Hence, odds of responding to therapy in each of the two CD4 count groups. The choice that the computer makes for Beta none and Beta one, are the choice that makes this observed sample, and those outcomes the counts of those outcomes in each of those CD4 count groups, most likely among all possible choices for Beta none and Beta one. So, the computer essentially iterates through choices for Beta none and Beta one, and evaluates how likely the resulting sample data is under those choices. Then, chooses the value of Beta none and Beta one, that make our sample results most likely among all possibilities. In general, this mode of estimation as with the everything we've done, must be done with a computer. So, the values chosen for Beta none hat and Beta one hat, are just estimates based on a single sample. Where we'd have a different random sample of 1,000 subjects from the same HIV positive population, the resulting estimates would likely be different because of sampling variability. As such, all regression coefficients, the intercept and slope have an associated standard error that can be used to make statements about the true relationship between the log odds of the outcome occurring and X_1 based on a single sample. So, the method of maximum likelihood that yields the estimates of the intercept and slope, also gives rise to the estimated standard errors for these two quantities. The standard error will allow for the computation of 95 percent confidence intervals and p-values for these two quantities. The random sampling behavior of the regression slopes and intercepts is approximately normal, because these are log odds and large odds ratio estimates in large samples. Adjustments can be made in smaller samples that'll be handled by the computer. So, in generally speaking, it's just once again business as usual for getting 95 percent confidence intervals and doing hypothesis tests with one caveat. The confidence intervals like we solve for odds ratios and other ratios in the first term, these are done on the log scale. Then, the results are exponentiated to the ratio scale. So, again, the drill is as follows the central limit theorem tells us that, if we were to take all samples, all possible random samples of 1,000 persons for example from the HIV positive population and estimate intercept and slope based on those 1,000 persons. In our logistic regression model, if we plotted over and over again the estimates of the intercept or the estimates of the slope across all possible random samples from the population and did a histogram, the histogram will be roughly normally distributed and centered at the true value of our intercept or our slope. We're only going to get one result under this curve. We're only going to observe one sample and we're going to get one estimate of our interceptor, our slope. But because this is a normal curve, the theoretical normal curve, most of the estimates we get will fall within plus or minus two standard errors of the true value. So, if we take a result from any of our estimates and add subtract two standard errors, 95 percent of the time that will include the true value of our parameter, whether it be the intercept or the slope. When doing group comparisons, the slope compares the outcome between two groups who differ by one unit next one. The null would be a slope of zero, which corresponds to an odds ratio of one. For the outcome between the two groups, what we do, is we assume the null to be true. We assume that the true log odds ratio is zero. Then, we figure out how far our estimate is, from zero in terms of standard errors. We know if it's far more than two standard errors away, then our result is less than five percent likely to occurred just by chance. If the null were true and we were directed to reject the null, if we are closer than two standard errors. Then, the probability of observing our result or something more extreme is greater than five percent and we would fail to reject the null based on that criteria. So, it's business as usual. So, again, our regression result when relating response to treatment to baseline CD4 counts in our sample of 1,000. The resulting logistic regression was the log odds response, equals intercept of negative 1.67 and a slope of 0.58 times X_1. Where X_1 is an indicator of whether the baseline CD4 count is less than 250 or not. So, here again, the slope was 0.58 as estimated by the computer and the corresponding standard error estimate is 0.16. Similarly, the estimated intercept was negative 1.67 and the corresponding standard error intercept is 0.12. So, let's go ahead and construct a 95 percent confidence interval for the slope and then, the exponentiated slope or odds ratio. So, our slope estimate was 0.58, the estimated standard error was 0.16. So, it's business as usual. We take our estimated slope, plus or minus two estimated standard errors, 0.58 plus or minus two times 0.16, 95 percent confidence interval for the true slope or true log-odds ratio, relating response to CD4 count of 0.26 to 0.9. Notice, this does not include the null value for slopes or log odds ratios which is zero. So, we already know the result is statistically significant at the five percent level. We wouldn't want to present this, though the estimated log odds ratio in it's confidence interval. We would leave the audience hanging, if we didn't convert it to the odds ratio scale. So, to convert to the odds ratio scale, we already know that e to the 0.58 power is 1.78, that's our estimated odds ratio. In order to get the 95 percent confidence interval on the odds ratio scale, we simply take the estimated endpoints on the slope scale and exponentiate each of them. So, e to 0.26 is the lower bound of 1.30 on the odds ratio scale and either 0.9 is the upper bound of 2.46. Notice, just this our interval for the log odds ratio, did not include zero. When we exponentiate it, we do not get the null value of one. Does not include one, which is e to the zero. So, again our result is statistically significant. We wanted a p-value, we can do a null hypothesis that the slope is equal to zero versus the alternative that it's not equal to zero. That's akin or exactly the same as saying the null is that the odds ratio is equal to one versus the null that the odds ratio is not equal to one. We assume the null is true again we're doing this on the slope scale, so we assume a null of zero and calculate how far our estimated slope is in units of standard error. And so our slope estimate was 0.58 and we divide it by its standard error 0.016, we get an estimate of 3.6. We have a result that is 3.6 standard errors above zero. So, if we look this up or use the computer to translate this to a p-value, you already probably know, we can a p-value it's clearly less than 0.05, it's very small it's less than 0.001. And that syncs up with the decision we make from either of our confidence intervals as they did not include their respective null values. You could use a similar approach to get a 95% confidence interval for the true population level intercept. Don't need a p-value, because the intercept does not compare two groups, it's only an estimate of a log odds for a single group, and then you can turn that into a 95% confidence interval for the exponentiated version or the odds of response in the reference group. So, in summary this research use simple logistic regression, to estimate the association between treatment response in baseline CD4 counts, in a population of HIV+ individuals using data on a random sample of a thousand. A statistically significant association was found, the p-value less than 0.01. The results estimate that individuals with lower CD4 counts at the time of treatment, less than 250, have 78% greater odds of responding to ART as compared to individuals with higher CD4 counts, with a 95% confidence intervals with 30% to 146% greater odds. And that corresponds to the confidence intervals we had before 1.3 to 2.46. So now, let's look at a situation where our predictor is continuous. Let's look at the resulting with logistic regression we had for relating obesity to HDL cholesterol level, and that resulting logistic regression was the log odds of obesity equals an intercept of 1.2 plus a slope of -0.034 times x1, where x1 is HDL in milligrams per deciliter. The estimated slope is -0.034 with an estimated standard error of 0.002. And the estimated intercept is -1.2, with a standard error estimate of 0.11. So, if we did a confidence interval for the slope which compares the log odds of obesity for two groups who differ by one unit in cholesterol level, we'd simply take our estimated slope of -0.034 and subtract two estimated standard errors, and we get a confidence interval that goes from -0.038 to -0.030. All values are negative this does not include zero, so it's statistically significant. If we exponentiate this result, the result odds ratio of obesity of comparing two groups who differ by one milligram per deciliter in HDL, was 0.967 to get the 95% confidence interval on the odds ratio scale, we simply exponentiate the end points on the slope scale, gives us a confidence interval of 0.963 to 0.970 does not include the null value of one. If we wanted a p-value for this comparison, we could have start by assuming the null on the slope scale is the true log odds ratio, the true slope is zero. We assume this is the truth and calculate how far our estimate of -0.034 was in terms of standard errors, it is -17 standard errors below what we'd expect, and so clearly we already knew the resulting p-value is less than 0.05 clearly it's very small, well less than 0.001. So one could use a similar approach to get a 95% confidence interval with p-value, for the true population level intercept, and then it's exponentiated version. The odds of obesity and persons with HDL cholesterol level zero, but of course that's not relevant scientifically. We could also, and I'll show how to do this in the additional examples, are all you can already think about how to do this but it's pretty straightforward, show how to get a confidence interval for an odds ratio that compares the odds of obesity for two groups who differ by a different value than one milligram per deciliter, for example, when we looked at 100 versus 80 before. We'll show how to get a confidence interval on that odds ratio. So in summary, what we found from this analysis with this research use simple logistic regression to estimate the association between obesity and HDL cholesterol levels using data on adults from the 2013-14, NHANE survey. A statistically significant association was found. The results estimate that each additional milligram per deciliter, of cholesterol is associated with 3.3% reduction in the odds of obesity. The odds ratio was 0.967, and now we can give the confidence interval of 0.963 to 0.97. So in summary, that the construction of confidence intervals for logistic regression slopes and intercepts is business as usual. Take the estimate and add and subtract two estimated standard errors for large samples. In smaller samples, and this is the computer will handle this without blinking, the 95% confidence interval and p-values are based on the exact computations, but again, this will be dealt with by a computer. The interpretation of the CIs and p-values are the same regardless of the sample size they're based on. So confidence intervals for slopes or confidence intervals for log odds ratios, and as we've seen these endpoints of these can be exponentiated to get a confidence interval for the odds ratio, confidence intervals for intercepts, or confidence intervals for log odds that y equals one, for a specific group. The group with x1 equals zero. It's not always relevant when x1 is continuous, but if we get a confidence interval for the intercept, these results can be exponentiated to get a confidence interval for the odds. In the next section, we'll show how to use the results from logistic regression equations, to estimate the proportion or probability of individuals having the outcome given a group value of x. So again, just as I said with linear regression, in a lot of cases the intercept is not a scientifically useful quantity especially when there's predictor x is continuous, and such, it doesn't necessarily make sense always to compute a confidence interval for the intercept and then exponentiate those results to get a confidence interval for the odds when x equals zero. But nevertheless, even if that's not scientifically worthy, we know the interceptors are a crucial part of our equation even if it doesn't stand alone, is anything useful. And again, like with linear regression, a lot of cases, if we're estimating the proportion or probability of an outcome given these x values, we want to put confidence limits on that proportion or probability. And since that's also at least some function of the intercept plus some multiple of the slope, the standard error is going to be some function of the standard error for the intercept and the standard error for the slope. And while it's a complicated function and we would defer it for the computer to do such computations and confidence intervals, rest assured that standard error for the intercept plays an important role.