This lecture is about data.

So, we're going to be talking a lot about Statistics and Statistics for Genomics.

And Statistics is the science of learning generalizable knowledge from a set

of data.

So, first we're going to look at the definition of data and

see if we can unpack it a little bit.

So, the definition of data from Wikipedia, are values of qualitative or

quantitative variables measured on a set of items.

And so, the set of items could be the population that you care about, or

more generally when you're actually making the measurements,

it's the set of items that you have in your sample.

So it's the set of

people that you've collected in order to measure genomic data on them.

And so, then the next thing that you would do is you might take measurements on

them and so those are the variables.

So the measurements might be, in a genomic study, measures of transcription or

gene expression, measures of variance in those people, DNA variability or

you might measure epigenetic quantities and so forth.

So those are the variables that you've measured on that set of items,

the set of people in your collection.

And the values might be either qualitative or quantitative.

And so, qualitative variables are variables that take only a fixed set of

values and maybe don't necessarily match on a quantitative scale.

So you might think about something like what batch a set of samples are run in

is a qualitative variable.

Whereas, a quantitative variable might be something like gene expression level where

there's a continuous set of values that the gene expression can take.

So those are the two kinds of variables that you often see in genomic study.

And so, there are a couple of levels of data that you see in Genomics and so,

the two that are most common that we will deal with are raw data and process data.

Now raw data is sort of relative to the person that you're talking to, so

some people think of raw data as one thing and

some people think of raw data as another thing.

And so I'm going to explain one example of a processing pipeline here,

and what is raw data and what's processed data.

So we're going to look at an Illumina sequencing machine and

we're going to look at the data, how they're generated and

what are the different levels the data might be measured at.

So this is an example of a processing pipeline from Illumina.

So what happens is you have these sequences and

they get attached to a slide.

And then there's a PCR process that amplifies those sequences into

little clusters where that cluster has all the same sequence on it.

Then what the sequencing machine does is through a set of chemical steps,

it attaches A, C, T or G, each with a different color and

it happens one slice of the sequence at a time.

So at each slice of the sequence it takes an image, and

all the sequences that are colored with a blue dot it's A or a C and

all the ones colored with a yellow dot are an A and so forth.

And so then you can take those image files and

process them into how likely it is that in each little dot on the slide,

you have an A, C, or T, or G at every single slice through the data.

And so there's a likelihood for associated with that,

it's not that there's variability associated with that.

So it means that it's not just guaranteed that it's going to

be an A at any given spot.

So another example of the raw data might be that how likely you are to be

an A,C, T or G.

You've might finally process that into the actual sequence files,

which are typically what we think of as raw data because that's usually

what comes out of the sequencing core.

And so, that we will then summarize even further into counts or reads.

And so the two types of data that you'll be dealing with in the class

are the raw data.

Typically for this class, that will be the sequence read files that were the end of

the pipeline that I just showed you.

Or the process data, that will be read counts or variant calls or

something like that, that's a process version of those data.

So the goal here is to compress the data down into the useful parts, and

the real goal is to be interactive.

Ideally when you are doing data analysis, you don't want to be dealing with these

huge files that make it very hard to do computing and expensive to do computing.

So there's this great quote by Robert Gentleman that says,

what you want to do is you want to make big data as small as possible,

as quickly as possible, so that you can interact with it.

You can make plots, you can look at the data carefully.

And so, we're going to be talking a lot about

doing that sort of interactive analysis.

This is typically the end of a long processing pipeline.

And so a lot of the Statistics we're going to be learning is about trying to

detect if there were any problems in the upstream processing, or if there are any

sort of extra variables we should be paying attention to, and so I like this

plause, it's kind of said tongue-in-cheek but it's your data aren't that big.

You don't need to dupe.

And what does that mean?

Typically when you're processing in the really big files, you need complicated big

data type software like Hadoop which allows massively parallel computation.

But if you're actually analyzing data with Statistics, you typically only want to be

analyzing it at the level of summarized data that you can do it interactively.

So that's a little bit about what the types of data are and

what types of data you will see out there.