So, let's talk a little bit about something that will

form the basis for what we do through both terms of this course.

Looking at samples is some imperfect representation of some larger population.

So, upon completion of this lecture section,

you should be able to explain the difference between

a population and a sample so far as the terms are used in research.

Give examples of populations and of a corresponding sample from a population.

Explain the characteristics of a randomly selected data samples should

imperfectly mimic the characteristics of the population from which the sample was taken.

Also explain how non-random samples may differ

systematically from the populations from which they were taken.

So, in general, population is the entire group for which we want some information about.

For example, the population of

all 18-year-old male college students in the United States in a given year.

A sample is a subset or part of a larger group

the population from which information is collected to learn about the larger group.

So, in order to study the population of

all 18-year-old male college students in the United States in the year 2017 for example,

we might take a sample of

25 18-year-old male college students in the United States in the year 2017.

So, a small subset of that larger group.

So, one the concepts we'll spend a fair amount of time on in this quarter,

is rectifying the fact that it estimates based on samples are based on

imperfect realizations of the larger process

or population that we can't observe directly.

So, one of the things we'll have to contend with is

the error in the estimate that comes from

the fact that we don't observe everyone in the population just those in our sample.

So, for studies it is optimal if the sample which

provides the data is representative of the population under study.

Certainly, not always possible but for this term,

we will make this assumption unless otherwise specified.

So, one way of getting

a representative sample is something called simple random sampling.

This is a sampling scheme in which every possible sub-sample of

a given size we'll call it n. Whether it be 20 people,

100 people, or n equals 1,000 people.

N is just the number of subjects in our sample.

So, simple random sampling is a scheme which

yields such that every possible subset of a given size,

size n from a population is equally likely to be selected.

Generally speaking, you can think of,

if you're able to enumerate the entire population putting

every member of the populations information into pieces of paper into a hat,

shaking the hat up and pulling out n slips of paper.

Certainly we can get the computer to do this,

but that's the idea of a simple random sample.

So, the idea is generally we want to study

the population with research we want to learn truths about a population,

but it's only practical to estimate these truths from

an imperfect sample of observations from the population.

The idea though is if we have a representative sample then the distribution of

characteristics in the sample will mimic if not perfectly,

the characteristics of the population.

So, for example, if we had a population whose age and

sex distribution was as follows 20 percent males less than 30-years-old.

15 percent were males greater than 30.

26 percent were females less than 30 and

39 percent were females over 30 or whatever population we have.

Then when we do the sample we'd expect the distribution

to be approximately equal in terms of these characteristics.

So, approximately 20 percent of the sample should

be and approximately 15 percent of the samples should be male.

It may not be exactly that distribution,

but the idea is it should be close if our sample is

representative and if it's randomly sampled,

it should be representative and certainly,

the other as well.

So, approximately 26 percent in the female less than 30-years-old.

39 percent in the female greater than equal to 30.

So, for example, if researchers wanted to

learn about the pulmonary health of clinical population of men,

they maybe only to examine a certain number.

In this case, they were able to sample 113 men from

this population and measure the systolic blood pressure of each male.

So, the huge population is all men in

this clinical group and what the researchers were able to get was

a subset of 113 from that larger population.

Researchers wanted to characterize the risk of mother to

infant HIV transmission amongst HIV pregnant women.

They wanted to follow the children for

up to 18 months after birth to see who develop HIV.

The researchers wanted to study the population of

births to HIV positive women.

Certainly couldn't study all such women,

but they got a representative sample of 183 births from this population.

What they found was that in this sample 22 percent of

the children developed HIV within 18 months.

So, the idea is that at 22 percent is a good if not perfect estimate of

the true proportion of children who would develop HIV

in this entire population of children born to HIV positive women.

So, as we go through this course,

we'll talk about ultimately rectifying the fact that

our estimate is based on an imperfect subset of a larger group and

how we can incorporate our uncertainty about

this estimate into a statement about the true underlying in this case,

proportion of transmissions in the population from which the sample is taken.

Suppose researchers wanted to study geographic variation in

lung cancer cases and potential factors associated with lung cancers.

Biological sex, environmental exposures,

access to health care.

Using data from a single year for a single state.

So, we can think about what their population is here.

It could be they're studying

the population of interest that one year from that one state.

In many cases, even things that seem definitive in terms of the fact that they encompass

everyone in a particular group

are intended to be a representative sample of some larger prompts.

When you're looking at factors associated with lung cancer in

the US and they may use the results from

a single state to hopefully be a representative sample of that process.

So, other types of sampling are frequently necessary.

It's actually very hard to get a simple random sample

because enumerating the population of interest is a challenge.

Certainly, in many situations,

that's not possible especially for populations we wish to study in public health.

So, other types of sampling may be necessary but may also result in samples

whose elements do not reflect the makeup of the population of interest.

So, they will call bias systematic differences in

our sample characteristics versus that of the population.

So, other such types of non-random samples occur when we

try and get voters in the United States not those who are registered,

but those who actually vote.

It's very difficult to get a representative sample of

people who actually vote in US presidential election.

We wanted to study intravenous drug users in Chennai, India.

There's certainly no master list of such drug users so,

we would have to think very deeply about how to choose a sample from that population.

Similarly, if we wanted to have patients with

a certain disease where there's no registry for the disease.

We want to look at homeless persons in Baltimore.

We want to look at men who have sex with men in Malawi.

It's going to be difficult to obtain a random sample.

As I said, we may still very much want to study these,

but we may have to caveat our results with the fact that our sample may

end up with different characteristics systematically different than the population.

So, for example, if our population looks like this maybe this

is IV drug users in a certain city,

but males are less likely to be open about

their drug use we may end up with a sample that's predominantly female.

We may end up with 40 percent females who are less than 30,

and 50 percent females who are greater than 30,

and only 10 percent males.

I'm just being hypothetical here.

10 percent males split across the two age groups.

So, this will systematically differ in

its composition than the population from which we hope to study.

So, let's think about this,

what kind of sampling strategies can be

employed that may or may not result in a random sample?

A lot of times what is done in the United States and other countries,

we want to predict the outcome of an election,

is they'll use random digit phone dialing to get folks.

Certainly in many countries it's become more common for

people to have telephones than it was 30 years ago or 40 years ago and perhaps,

the bias is not so much now that people do not

have access to a phone as it was in other years,

but certainly there are also biases

in actually having listed phone numbers et cetera, they can be accessed.

So, it's not clear when they do random digit

dialing whether they're getting a representative sample or not.

How might we study populations that can

be considered to be marginalized in some situations,

like intravenous drug users,

or homeless persons, or men who have sex with men?

Well, some strategies involve convenience sampling,

sample from those who show up to a certain clinic or

participate in a needle exchange who has stayed a homeless shelter.

So they becoming more common,

it's called respondent-driven sampling.

So, we start with a few persons or subjects from

the sample and ask them to recruit their colleagues and associates as well.

Again, these are very creative ways of dealing with situations where

it's hard to get a representative random sample,

but nevertheless where the population is of interest to be studied.

So, generally speaking with regards to public health and medical research,

not all elements of a study can be sampled.

As such a sample is taken from the population of interest.

Random sampling is the best strategy for getting the sample

whose characteristics imperfectly mimic the population.

What we'll be dealing with in this class as I said before,

is how to deal with the fact that our estimates are

based on an imperfect realization of the population we want to

study and how do we account for the uncertainty in our estimates that comes

from the fact that they're based on this imperfect subset.

Random sampling as we've discussed is not always feasible,

and other approaches can be used,

and the sampling procedure needs to be considered when

applying the results from the sample to the larger population.