In previous lectures,

we considered system which consist of

only one locus with two alleles or with more alleles.

So we considered Hardy–Weinberg equilibrium which determines

the distribution of genotypes in panmictic population.

However, loci are located on

the chromosome and actually the loci which are very close together,

they are linked and therefore, frequently,

the statistical distribution of the genotypes

within such two loci is statistically dependent.

Michel in one of the previous lectures has been talking about biological basis

of linkages disequilibrium and he'd been talking about haplotypic blocks.

So, these blocks are defined in the mosaic from which our genomes are composed of.

Now, we are going to take this topic

and give it a little bit more mathematical treatment.

For that we're going to consider a relatively simple system which is

going to consist of only two loci presumably linked together.

And in this locus there will be a possibility for two alleles,

say 'A' capital and 'a' small for one of

the locus, and 'B' capital and 'b' small for the second locus.

Before we continue, I need to introduce you with

one definition which will become very handy in further consideration.

We say that two alleles coming from two different loci

are in cis when these two alleles are located at the same chromosome.

When we are talking about two alleles which are located at different chromosomes,

we will call these alleles to be in trans.

So now, let's think of just two locations

in the genome and we start with this simple system, where at the first location

this locus is polymorphic, it has two alleles,

while at the second location,

it is monomorphic in the population considered.

So we only see 'B' capital alleles.

However, at certain time point,

a mutation may occur giving rise to initially only one copy of

the chromosome which will have capital 'A' at the first location and small 'b',

the mutant b at the second location.

After some time, if this allele is in a way lucky,

it may propagate in the population because of random factors such as drift.

And then, after a certain time in the population,

we can see the collection of three possible chromosomes,

'A' capital, 'B' capital, 'a' small,

'B' capital, these are the original chromosomes

and that the chromosome and haplotype which arose because

of mutation which consist of capital 'A' in one location and small 'b' in another location.

If these loci are relatively far apart,

or a lot of time has passed,

then there may be recombination between two loci,

and in that after some time,

you may see that in the population,

all four haplotypes carrying all combination of

capital and small letters at both locations is possible.

And if we consider this system infinitely long time,

we expect that it will come to

equilibrium point where the probability of specific haplotype will be determined

by the product of probabilities of

specific alleles from which this haplotype is composed of.

However, if these loci are very closely linked together,

this process takes a lot of time and therefore,

if you go to some database for example HapMap and pick two loci,

two SNPs, which are relatively close by,

most likely you will see deviation from linkage equilibrium.

So, this is the phenomenon which we call linkage disequilibrium.

So, how can we quantify linkage disequilibrium?

The idea is very simple.

We can contrast what we would observe on the equilibrium

towards what we see in the investigated population or system.

And under equilibrium, we expect that the probabilities of alleles

in cis are equivalent to the probabilities of alleles in trans,

and this gives rise to the metric of

linkages disequilibrium which is known as D. Mathematically,

if you code 'A' capital in the first locus as one,

and 'a' small as zero,

and you do the same for the second locus coding 'B' capital as one and '[b]' small as zero,

you can make a two by two table for

the observed haplotype distributions and you can see that

the definition of D is equivalent to mathematical quantity known as covariance.

However, D is not the most convenient measure of linkage disequilibrium because

its possible maximal and minimal values are

strictly defined by the frequencies of specific alleles.

However, most of the time,

it's desirable that your measure is scaled between specific boundaries,

for example, zero and one,

or minus one and one.

And therefore, most used measures of linkage disequilibrium

are based on D but there is some extra component to it.

For example, one of the measures of linkage disequilibrium which is

commonly used in statistical genetics is R^2.

This is basically square of the coefficient of correlation.

Again, I remind you that D is covariance but knowing the frequencies of alleles,

we can scale it into the coefficient of correlation,

which will be distributed between minus one and one,

and we can square it.

And then the squared correlation coefficient

or coefficient of determination metric of linkage disequilibrium,

will be distributed between zero and one,

with zero meaning no linkage disequilibrium and

one meaning very strong or perfect linkage disequilibrium.

Other important metrics of linkage disequilibrium which is also

widely used is called Lewontin's D_prime.

So, what we do with D in this context,

we're going to look up what are maximum possible values for the coefficient D under

these specific allelic frequencies and we are going to

divide D by this maximum and then again,

we're getting a scaled metric of linkages disequilibrium.

Now, we have quantified and I defined

two major metrics of the linkage disequilibrium which are used.

And let's think what could be the reasons for

linkage disequilibrium, slightly extending on

the topic which was already covered by Michel.

And of course, first of all you expect

strong linkage disequilibrium between loci which are closely linked together.

So, they're physically very, very close.

However, sometimes you see in your data that linkage disequilibrium

is high for loci which are far away and what could be the possible explanations?

Well, one explanation is trivial in a way but this is something to always keep in mind,

there may be a mapping error.

So, actually what you consider are polymorphisms which should be very close together,

maybe polymorphisms which are actually very far apart from each other.

Then other things where you will see linkage disequilibrium too high for the distance,

are factors similar to those which define the deviation from Hardy–Weinberg equilibrium.

So, these are specific types of genetic structures.

For example, if in your sample you are

investigating a mixture between two genetic populations,

you're going to find abundant linkage disequilibrium

also between loci which are very, very far apart.

At the same time, if you look into specific genetic populations within your mixed sample,

you'll see that linkage disequilibrium follows regular rules.

So, it's high for closely linked loci and it falls very quickly to very low levels.

Other factors which may in theory contribute to

linkage disequilibrium are things like mutation and selection.

But again, it's not very expected that you are likely to see it in your real data.