Assistant Professor, Principle Investigator Center for Bioinformatics, School of Life Science

Liping Wei 魏丽萍, Ph.D.

Professor, Director Center for Bioinformatics, School of Life Sciences

Hello everyone, this is the supplementary video for this week.

I will explain here the models and calculation of SNP calling and genotyping.

First, I will introduce briefly the concept of likelihood and Bayesian approach.

Then I will introduce the genotyping model used by MAQ, one of the early mapping programs.

and the one in SNVMix, which is used to handle cancer genome data.

The likelihood function is a term frequently used in statistics.

It is a function of the parameters of a statistical model.

It represents the probability of observing the current data given the values for parameters.

The likelihood function plays a key role in statistical inference.

Bayesian approach, or Bayesian statistics, is a major branch of statistics.

It is also considered a counterpart of frequentist statistics.

This approach follows the Bayesian rule to treat indeterminate parameters, theta, as random
variables.

Then it transform the problem of estimating parameters based on data

into the problem of estimating the posterior probability distribution of the parameters, given
the observed data.

We can write these procedures by Bayesian theorem as this formula:

the posterior probability distribution equals to the prior probability distribution multiplied by the
likelihood function.

Please note that all the three terms in this formula are functions of Î¸.

This is a simple demonstration of likelihood function.

Assume that we toss again and again a coin whose probability of landing head-side up is not
equal to that of landing tail-side up.

We also assume that these two probabilities are fixed.

Letâs denote by Î¸ the probability of landing head- side up in a trial.Then the probability of landing
tail-side up in a trial will be 1-Î¸ .

Letâs try to calculate the probability of getting a sequence of HTHH(Head-Tail-Head-Head) after
4 trials.

It is easy to work out the result, which is Î¸*(1-Î¸)*Î¸*Î¸=Î¸^3*(1-Î¸).

Students having studied statistics course will immediately tell that the model used here is 4
Bernoulli sampling processes.

The HTHH sequence observed in this problem is the Data.

What we have calculated is the probability of observing such data, given the value of Î¸, the
parameter for this model.

If we did not know the probability of the coin landing head up, and we want to estimate it from
data,

then a straightforward way is to find out the parameter that maximizes the likelihood function
â which is the MLE method.

We could do the MLE by differentiating the likelihood function with respect to the parameter
Î¸.

Set the derivative to 0 and the MLE of Î¸ can be obtained.

In this example, the MLE of Î¸ is 3/4 .

If, however, we had only known that there are 3 head-side ups in the sequence, we might need
to modify the model.

We can switch to binomial distribution model, and find out that the likelihood function is the
number of combinations of 4 elements taken 3 at

a time, multiplied by the Î¸^3*(1-Î¸) before,

making it 4*Î¸^3*(1-Î¸) .

Given the same data, the MLE of Î¸ under this model is still 3/4.

Similar ideas are used in SNP calling and genotyping based on NGS data.

A statistical model is used to depict the NGS data and to estimate the genotype.

The statistical models used by bioinformatics tools in practice are diverse and more complex.

To make it easier to understand, Dr. Gao has used in class a very simplified model, with only
the erroneously sequenced part considered.

This model is overly simplified compared to models used in practice.

There is no âcorrectâ or âincorrectâ statistical models in practice.

It is just that you can âfeelâ whether it is good or not by checking the power of this model in fitting
and depicting data, and by checking how much of

the result is correct.

Sometimes data in practice violate the assumptions of classical simple models.

Thatâs why the models used in practice often appear complex.

They might even take advantage of some technical details of algorithm implementation.

If youâre interested in statistical models and algorithms used for SNP calling and genotyping,
you can check these original papers for the

several representative tools.

Of course there are more tools than I have listed here.

I have listed here only several models that, in my opinion, are easier to understand.

MAQ is one of the early mapping programs.

The original paper and supplementary materials have explained in detail the genotyping model it
employs.

This model might be simpler than other models developed later, and I will introduce it in detail
here.

You can see next how the model used in practice is adapted according to some features of the
data observed.

First, letâs review our problem again.

We get a lot of reads by NGS sequencing.

After mapping to the reference genome, a locus might be covered by multiple reads, and might
correspond to multiple bases (from the reads).

Each base from the reads has a probability of being erroneously sequenced, which is
evaluated by baseQ and denoted by Îµ.

What we want to do is to get the genotype of this locus.

For common diploid species, the genotype of a locus in an individual can have three types.

They are homozygosity with the reference allele/base, heterozygosity, and homozygosity
with the alternative allele/base.

For a diploid individual, we focus only on the two bases, denoted by b and bâ, that are the most
dominantly sequenced.

Assume that they are sequenced k and (n-k) times, respectively.

Now letâs calculate the likelihoods for each genotype.

Letâs tackle with the heterozygosity first.

If the genotype is heterozygosity, then it is expected to have lots of âbâs and âbââs here.

If baseQ is high enough, the variation of k (or (n- k)) depends primarily on the variation of binomial
variation.

We can thus ignore the effects of baseQ, and use a binomial distribution model with the probability
being 0.5 and the total number being n.

The likelihood function is written at the lower-right corner of this slide.

In fact, this can be obtained by setting Î¸=1/2 in the binomial distribution example mentioned
earlier.

If the genotype is homozygosity (with the reference base or the alternative base), then the
base not seen in this homozygosity must be from

sequencing error (if we include the error coming from PCR in early stage).

Therefore, the paper defined the symbol Î±_nk as the probability of n bases having exactly k
erroneously sequenced.

his is just the likelihood function we want.

If the bases from the reads have the same baseQ, and their probabilities of being
erroneously sequenced are independent of each

other,

then we can easily work out Î±_nk using binomial distribution.

We denote it by Î±-bar.

However, the paper mentioned that the reads data in practice do not have identical baseQ
values.

Also, the independence between probabilities of being erroneously sequenced does not seem
hold.

For some locus, if one is sequencing error, others may more probably be sequencing errors.

Therefore, the paper defined another variable Î²_nk as the probability of n bases being
observed to have more than k erroneously

sequenced given more than (k-1) erroneously sequenced.

We can deduce by definition the following 4 formulae.

Letâs start from the case where the probabilities of being erroneously sequenced are the same
and independent of each other.

First we write down the formulae for Î±-bar and Î²- bar here.

As you can see, Î²-bar can be defined by Î±-bar.

Next we assume that the real Î² is the corresponding Î²-bar to the power of f_k.

The f_k is larger than 0 and not larger than 1.

Feeding these variables to the formulae with Î± and Î² in the previous slide, we can get an Î±
taking into account the dependency between

probabilities of being erroneously sequenced.

Here we define c_nk as the slide shows.

As the paper has claimed, c_nk is not sensitive to fluctuations in Îµ-bar when f_k=1 .

Also, when compared to the production term of Îµ- bar on the right, the c_nk contributes less to
Î±_nk.

Therefore, the paper made an approximation here for the case where the probabilities of being
erroneously sequenced are not the same.

This C_nk can be pre-computed given Î²ï¼Îµbar and f.

This formula is an approximation to the situation that each base has different quality.

The Îµ-bar is a weighted harmonic mean of Îµ, the probability of being erroneously sequenced.

The dependency between probabilities of being erroneously sequenced is determined by f_k.

In principle, it can be estimated from data.

But MAQ used a simple formula instead to calculate it, and it is said that this formula works
well.

In practice, c_nk(Îµ-bar) has been calculated in advance and stored in a table for the program to
check.

Also, the paper claimed that in practice, mapped reads with different directions tend to be
independent of each other.

Thus the likelihood calculated in practice is the formula below, which seems relatively more
complex.

The likelihood for the other type of homozygosity can be calculated similarly.

Now we get the likelihoods for all three genotypes.

We can also consider the fact that these three genotypes do not necessarily occur with the
same probability.

For example, there might be more homozygous loci than heterozygous ones.

In this case, we can define prior probabilities for the three genotypes.

Multiplying the prior probability with the corresponding likelihood and normalizing them
will result in the posterior probabilities of the

three genotypes.

The complete formulae are very long.

hatâs all for the main algorithm of calling genotype by MAQ.

Currently, most genotypers use similar methods to get the Bayesian posterior probabilities of
different genotypes.

Finally, let me introduce briefly the model used by SNVMix.

SNVMix is a tool to call SNVs (single nucleotide variations) and genotype for cancer genomes.

he figure on the right displays the model used.

Such graph visualizing the relationship of probabilities are often called the probability graph
model.

First, let me explain the subscripts in this figure.

âiâ stands for some position in the genome.

âjâ stands for the index of a read covering position i.

âkâ stands for the type of genotype selected.

In this model, the value of G_i is k.

Next letâs have a look at the nodes.

Circles stand for random variables.

Rounded rectangles represent fixed constants.

Shaded nodes can have their values observed.

Unshaded nodes cannot have their values observed, and their values need to be estimated.

Specifically, G_i is the genotype at position i,

and it is the ultimate variable we need to estimate.

a_j^i denotes whether the âj baseâ (i.e. the base the j-th read have on the locus in question) on
locus i matches the reference allele.

q_j^i is the probability of the base being correctly called

z_j^i indicates whether the j-th read aligned to its stated position.

r_j^i is the probability of the alignment being correct

Î¼_k is the parameter of the Bernoulli distribution the a_j^i has (see next slide).

How the random variables from SNVMix model relates to each other are described by these formulae.

This model assumes that the genotype of each locus, G_i, comes from a multinomial distribution with the parameter being Ï.

Multinomial distribution can be explained by referring to binomial distribution (or Bernoulli distribution for only one trial).

A random variable has a Bernoulli distribution if it can have a value of either 0 or 1, but not any other values.

A random variable has a multinomial distribution if there are only several values this random variable could be.

For example, each of the three genotypes here can be chosen with a probability.

These probabilities are determined by a vectorial parameter, Ï.

SNVMix model considers Ï not to be fixed, but to be subject to tuning according to the data.

Thatâs because SNVMix is specifically designed for cancer genomes where variations are much higher.

An inconstant parameter is a random variable in Bayesian models.

We need to set a prior probability for it to model and estimate it.

Thus SNVMix uses as the prior of Ï the Dirichlet distribution, which is conjugated to multinomial distribution.

The parameter, Î´, is a fixed vector (1000, 100, 100).

Next comes the distribution for each âtheoretically realâ (i.e. incapable of being observed directly) base a_j^i mapped to this locus.

SNVMix assumes that this distribution conforms to a Bernoulli distribution,

where the probability of choosing reference allele is Î¼_k, and the subscript k is the value of the genotype, G_i.

Simple models often consider Î¼ to be 0 or 1 for homozygosity, and 0.5 for heterozygosity.

This changes to a random variable that can be fitted and tuned by observed data in the SNVMix tackling cancer genome.

Likewise, SNVMix set for Î¼ as the prior distribution a Gamma distribution with parameters Î± and Î².

There are three sets of Î¼, Î± and Î² for different genotypes.

Homozygous genotypes have Î± being 1000 and Î² being 1 (or vice versa).

Heterozygous genotypes have both Î± and Î² being 500.

Aside from generating models for these unobservable âa_j^iâs, SNVMix also models base calling error and mapping error.

As shown above, z_j^i is the random variable

that denotes whether the mapping is correct and cannot be observed directly.

The prior probability is set as a Bernoulli distribution with the parameter being 0.5.

This means that the two prior probabilities of correct mapping and incorrect mapping are the same 0.5 .

Modeling q_j^i, the probability of correctly calling base, needs to be divided into two cases as described below.

If the mapping is not correct, then this probability is 0.5 .

If the mapping is correct, then this probability needs to be determined by âtheoretically realâ bases and calculated from base quality

We also need to consider two cases for modeling r_j^i, the probability of mapping correctly.

If the ârealâ mapping is not correct, then this probability is the probability of mapping incorrectly implied by mapping quality.

Otherwise, if the ârealâ mapping is correct, then this probability is the probability of mapping correctly implied by mapping quality.

SNVMix model is more complex compared to that used by MAQ.

This model has introduced many latent variables than are fitted and estimated from observed data.

The estimation of these latent variables is done using the EM algorithm (Expectation-Maximization Algorithm).

Its main idea is the idea of MLE introduced at the beginning of this supplementary video.

It uses some recursion to optimize locally to find out the locally maximum of likelihood.

EM algorithm is a classical algorithm involving more than simple mathematics.

You are encouraged to find by yourself related learning materials, or to consult teachers and classmates majoring in statistics.

It might be possible to have EM explained in detail in later lectures of this MOOC.