So we've described some examples, some very basic examples of random variables.

So, what we need is a, a mathematics. Of random variables to use them.

And we have a mathematics of probability. And, we've acknowledged we're at least

willing to think of. Kinds of variables as if they're random.

We'd like to, put those two ideas together.

So we need functions that map the rules of probability to random variables, and so

for discrete random variables the kind of functions that we are talking about are so

called probability mass functions. So probability mass function is simply a

function that takes the values that random variable can take and maps it to their

associated probabilities. So for a die, the probability of p of one

would be one-sixth for example. And it turns out quite a few functions

satisfy the definition of being a probability mass function.

In fact, you only have to satisfy two rules if you'd like to be a probability

mass function. The first rule is that you have to be

bigger than zero for all of the arguments, where here, x is the collection of

possible values that a random variable can take.

And the second rule is that if you sum over all possible values, then you get

one. This is just exactly analogous to our

probability statement that the probability of the whole sample space has to be one.

But here we've put it in the terms of a probability mass function.

I want to talk a little bit about this notation.

Notice here I have this small x, and when we define random variables, two pages

previously, we used the capital X. So this is very common and maybe slightly

unfortunate notation, but it is used everywhere, so you might as well get used

to it instead of fighting it, that we use an uppercase letter.

Typically, to represent the random variable as a conceptual entity.

So if we say capital X, we're talking about a die roll that we could have.

When we use a lower case x, or a lower case y, or a lower case letter of any

sort, we tend to be talking about. Realized values of the random variable.

So the X lower case should be something that you should be able to plug a number

into where capital X is a conceptual random variable.

It's a conceptual flip of a coin, it's a conceptual role of a dye.

Lower case X is one or two or three or zero okay and it's slightly unfortunate

notation it takes a little bit of getting use to.

But I think for everyone who works on statistics of probability they've gotten

use to it and everyone does it so you might as well do it too.

Let's go over an example of constructing a probability mass function.

Let's take the simplest possible example, a coin flip.

So let's let x be the result of a coin flip, where zero represents tails and one

would represent a head. So we want the function, and let's assume

that the coin is fair. So we want a function that maps zero to

one-half and one to one-half, and there's infinitely many ways you could write down

that function. Well, we're gonna pick one.

Here we write it as one-half raised to the power of x.

Times one-half raised to the power of one minus x.

And notice that if you plug in x equals zero, you get one-half, and if you plug in

x equal one, you also get one-half. Now let's go to a slightly more

complicated example where we assume that the coin is potentially biased i.e.

That it's not fair. So let's let theta be the probability of a

head. In this case, expressed as a proportion

between zero and one. So, just as an example, imagine if theta

was.3 instead of a half, then we would think that the probability of a head was.3

and the probability of a tail is.7, but let's leave it as theta for right now.

So we want a function that says probability of a zero is one minus theta

and the probability of a one is theta. And then we see our function here, theta

to the x, one minus theta to the one minus x exactly satisfies these properties.

This is another common notation in the field of statistics like greek letters

like theta represent the things we don't know, that we would like to know.

So imagine if you had a coin, and you didn't know whether or not it was fair, we

would represent that unknown probability of head as theta.

So I want to give you a sense of where we were going.

In this case the probability of mass function is the entity that governs the

population of coin flips. And so, if we want it to know theta, we

are gonna collect data to estimate it, and then to evaluate the uncertainty in that

estimate. And the way we are going to evaluate

uncertainty in that estimate is using this probability distribution.

So all the probability distributions we are going to talk about are conceptual

models of populations and they are the entities that are going to tie our data to

the population. So at any rate, right now this may sound a

little heavy, and we'll discuss this in much more detail throughout the entire

class, but the one rule I want you to remember right now is that unknown things

that we want to know, like in this case, what would be the probability of a head,

are generally denoted in Greek letters. These are called parameters usually.

I also want to note one other thing. Why is it among all the possible ways that

we could have written out this probability mass function did we choose theta to the

x, one minus theta to the one minus x? There's lots of different ways we could

have done this. You can try and figure some of them out

yourself. Well it turns out, and we'll discuss this

at length, that in probability, multiplying is very useful.

And so we want, probability mass functions that make.

Multiplication very easy. So if we take things and raise them to

powers, then multiplying becomes easy. And that's a general, rule.

And we'll tell, you'll see later on why this is the case.

But any rate, this is why we choose this particular form.

Of the probability mass function when you could write it so many different ways.

But I wanna say, people have. Thought about this a lot, and this is.

Definitely the most useful way to write out, this particular probability mass

function. So consider again the unfair coin.

Our probability mass function satisfies p of zero equals one minus theta and p of

one equals theta. Let's just go through the exercise to

prove to ourselves that this is in fact a probability mass function.

It's greater than zero because it's one minus theta for zero and theta for one and

in this case theta is in between theta and one.

So, it's going to be greater for zero for x equal zero and one.

And then, the sum of the probabilities, probability of zero plus the probability

of one, in this case is theta plus one minus theta which is one.

So it satisfies the two rules that probability mass functions have to

satisfy. So that covers our principle entity that

we're going to use to model discrete random variables, probability mass

functions. So now we need to cover our principle

entity that we're going to use to model continuous random variables, which are

called probability density functions. So probability density functions are

abbreviated PDF by the way, so it stands for probability density function not

portable document format, which is what lots of people think of it as pdf, but in

statistics no one thinks of pdfs that way. I want you to remember one very important

role and I put it in italics to make it sure everyone remembers it and by the end

of the course this will be second nature to you, but if haven't seen it before, it

might seem a little odd. But the way that probability density

functions work are that areas under probability density functions correspond

to probabilities for the random variable. And there's definitely one undisputed king

of all PDFs, and that is the so-called bell curve.

So if you ever wondered what a bell curve was, if you hear it talked about a lot,

the so-called normal density function, you might wonder what in the world is a bell

curve accomplishing. Well.

Areas under bell curves correspond to probabilities.

So if you're modeling something as if the population it belongs to follows a bell

curve, then you are saying that, that probabilities associated with that random

variable are governed by areas under that bell curve.

That's just one example of a pdf. There is a lots of different kinds of

pdfs. So just like probability maths functions

have to follow two rules, probability density functions have to follow two rules

to be a valid probability density function.

They have to positive for all the possible values that the random variable can take,

that's called a support usually, and their integral has to be one.

I would also say a, a small point here. We define probability density functions as

if they, operate on the whole. Real line.

So even if your, random variable can only take values say between zero and two like

we talked about earlier with the pencil experiment.

Even if that's the case, we define the probability density function as zero below

zero and zero above two so that there's no associated probability, but we've defined

the probability on the whole real line so that we define its integral from minus

infinity to plus infinity. And I think in this class we tend to be a

little bit fuzzy about, sometimes operate on minus infinity to plus infinity, in

other times we will just write out zero to two, discarding all the area where the

function is zero and I hope from the context it will be clear what we are

doing. This final property, property two here

that the integral of your whole real line of the probability density function has to

be one, is simply again saying that the random variable has to take some value,

that it has to be in some interval in the real whole line.

Let's go through it specific example of a P D F and let's put it in a context.

So let's soon that the time in years from diagnosis until death with a specific kind

of cancer follows the density that looks like this.

Alpha vacsillesiii each as a negative x of five divided by five for x greater than

zero. The greater than zero been contextually

clear because you can have negative time from diagnosis and the person is

presuminglyiii alive at the time of diagnosis.

This is a very restricted example of a density that's commonly used in these

sorts of analyses of things like survival times.

It's called the exponential density function.

And again here you see that we have f(x) written as e to the negative x over five

over five for x bigger than zero, and zero otherwise, like I talked about in the

previous slide, we often just ditched that zero and talked about f(x) being, the

kernel of the function and the just either explicitly write or sometimes we will

fudge a little bit that x has to be greater than zero, if it is clear from the

context if that has to be the case. In this case it would be clear from the

context. Is this a valid density?

Could we model survival time after diagnosis with this density?

Well first of all we know that the function is positive because e raise to

any power is always positive, and then lets just check whether or not it

integrates to one. So we want the integral form minus

infinity to plus infinity but like we said that all of the meet of the distributions

starts at zero and goes from infinity, so lets just say the integral from zero to

infinity, f(x) dx is in this case the, anti-derivative is negative e to the

negative x over five, which when evaluated from zero to infinity yields one.

Let's go through an example, of. Using, this probability density function

to assign probabilities. So imagine if, we were to model this

population as if it followed this specific.

Exponential probability distribution. And imagine if someone asked us the

question, "What's the probability that a randomly selected person from this

population survived more than six years?" So if X, is the.

Conceptual, value. That, a random person takes.

We want to know, what's the probability that X is greater than or equal to six?

As represented by this, probability statement.

Remember again the golden rule for. Probability density functions that areas

under the curve correspond to probabilities.

So, if we want the probability x is greater than six.

We want the integral from sticks to infinity of the probability density

function and you can go through the calculus here to get the that works out to

be about 30%. In the statistical programming language or

you can do this automatically, it just does the integral for you, it uses a

numerical approximation and you just write Px6, for the fact that we want the

probability of six or larger. One fifth represents this parameter five

that you see in the exponential distribution.

Lower dot tail equals false, means that we want the probability being larger than six

rather than the probability being smaller than six.

So lower dot tail equals true, will give you six or smaller, lower dot tail equals

false, will give you six or larger. I want to elaborate on that point, by the

way. For a continuous random variable, the

probability that it takes any specific value is in fact zero.

Now that seems strange, but it's true. So remember areas under probability

density functions correspond to probabilities.

So what's the area of a line? It's zero.

Now, you might say, now that doesn't make any sense at all.

Specific values have to take probabilities because we see specific values when we

actually observe variables. The point is, is that our.

Probability density function is a model and it is defined on continuous random

variables. Continuous means measured to infinite

precision. And so, when we observe things, we never

measure them to infinite position, we never measure them to finite position.

And probability density functions are perfectly happy with saying, the

probability that x is 6.01 to 5.99 in assigning a perfectly valid probability to

that. But the probability that is exactly six is

zero. Because remember exactly six means 6.0

followed by an infinite trail of 0s, or 5.99 followed by an infinite trail of 9s.

Either way, that's the idea behind what probability density functions are getting

at. They're modeling truly continuous random

variables. So just remember that, when we observe

data. We of course measure them with finite

precision, but. Our, continuous.

Model is exactly that, it's a model. We find it far more useful in many

circumstances, to model random variables as if they were truly continuous.

Than to account for all the potential specific values they could take.

So, in this specific example a, a person will only measure how long they survive to

the year. Maybe to the month, maybe to the day.

Maybe to the hour, to the minute. To the second, but probably not much

further than that. And so, we're only going to measure to

finite precision. Nonetheless, it's still is much more

useful to model that as if it was continuous because we don't want to have

to assign probabilities to every single value.

We want to assign a general function. And that's why.

Continuous random variables are so intrinsically useful.

So my, the belabored point I'm trying to make.

This, by the way, is that whether or not you write probability x being greater than

or equal to six. Or the probability of x being strictly

greater than six in this case doesn't change this calculation whatsoever.

You get.301 either way. And so it also doesn't make a difference

in the probability exponential for the, our example.

It doesn't matter. Whether you specify lower tail or upper

tail, whether you're thinking about whether or not that includes six, it

doesn't care about that. However, for discrete random variables, it

makes a big difference, right? Because specific values.

Have actual probabilities assigned to it, so a die can take the value one, two,

three, four, five, or six. So in R if you are using these probability

functions, so Px are probabilities from the exponential distribution, P binomial

are probability from binomial distribution, P Poisson or Pois for

Poisson is probability from the Poisson distribution, P gamma probabilities from

the gamma distribution, are follows that rule pretty neatly.

If it's a discrete random variable, you have to be careful about whether or not

it's including the six. For a continuous random variable, you can

be very sloppy about it. So here I'm just depicting the area that

we're calculating. This grey area is the survival time from

six to infinity. This is simply the integral that we're

actually calculating, and I'll put the R code to generate exactly this figure in

the files for the course.