a value of a particular data point for a particular sample.

So for example, we might use the letter C to represent the color

of one of the symbols in our data set,

so in this case it's this upper right pink symbol is given with a capital C.

If we have more than one value that we need to denote,

we usually do that with subscripts.

If you want count the values for each of the different sub-samples, in this case,

you would look at the three different symbols that we have in our sample.

Each one gets a subscript, so you get C1, C2 and C3.

That's how you represent the values we've measured for those data.

And then the next thing that you would want to do is measure

something about an estimate that you would like to infer back to the population.

So we don't know the whole population, but we can get an estimate of the parameter

in the population by calculating a similar function on our sample.

So in this case, say we want to estimate the fraction of pink symbols,

what we would do is just count the fraction of pink symbols in our sample.

And so when we do that, we get two thirds of the sample values are pink.

And so our estimate of that population parameter, theta,

remember, represents the fraction in the real population.

Our estimate of that is theta hat.

And we almost always use hats over Greek symbols to represent the estimate

that we have in the sub-population with the sample that we've taken.

And so we use that to infer back to the bigger population.

We'll talk more about that later so just to summarize.

Data points are represented usually by the letters.

When we're talking about hypothetical values of the data,

they're usually capital letters and when were talking about

concrete values of specific data points, they're usually lowercase.

If we have more than one value of a particular variable, we use subscripts.

So there's C1, C2, C3.

And sometimes we write X for more than one variable.

We do this sort of to make the math notation easier,

it can be very confusing and frustrating, but

sometimes X1 X sub 1 with a subscript will represent one variable and

X sub 2 will represent a second variable, so X with a subscript 2.

In that case, what we need to do is add another subscript to be able to indicate

which person we'd measure those different values on, so X11 might be the count for

gene one on person one.

And then you would get X21, which is the count for gene 2 on person 1,

and so forth.

And so sometimes you have multiple subscripts to annotate

the different variables as well as the different samples in your sampled set.

And so, when we want to look at quantities in the global population,

we look at Greek letters, so usually here we use

theta to represent the proportion of pink samples in the population.

It could also be a more concrete example,

suppose you wanted to measure the heights of everybody in the US, and

you wanted to look at the average height of a person in the United States.

You could call that value theta, and then if you took a sub-sample of the people,

obviously it's expensive to measure the heights of everyone in the US.

So suppose you took a random sample of 1,000 people, and

you took their average height.

That would be an estimate of the population parameter, and

you would denote that with a hat, so

it would be theta hat would be the estimate average height in the population.

In regression models, we'll talk about those a lot later, we usually treat

the variable Y as the outcome and X variables are the covariates,

or the variables that you're trying to predict the outcome from.

So the two most common letters are used to represent variables in statistics are Y

for the outcome and X for the covariate variables.

And so that's how you represent data with mathematical notation.