So far, we've examined statistical tests for the overrepresentation of categories in a gene set. This leads us on to looking at measures of the relationships between sets of genes. So, for example genes that are found in a matter analysis to be associated with diabetes and genes that are members pathway for example. Now something you might see very often in the literature, is the use of the Jaccard index. So this is essentially a measure of the similarity between two sets. It's a measure of how large is the overlap between the sets, with respect to the combination of those two sets. Here we can see, on the right an illustration of this. Let's say we have two sets, a and b, and a is a big yellow circle and b is a smaller set and this is the red circle. And this, you can read this but like a Venn diagram, exactly like a Venn diagram such that, everything that is yellow belongs only to set A, all those genes that are a part of this region belong only to set B. And this intersection, of the two circles are genes that belong to both sets. So, when this terms the Jaccard index is given by the intersection, the size of the intersection which is the number of genes that are present in both sets. Divided by the union which is the total number of genes that we have, total number of unique genes that we have in the two sets. Now the Jaccard index is a number that varies between zero and one. The value is 1 when the sets are identical, and the value is 0 when there is no overlap between the two sets. The analysis of the enrichment of gene sets is something that perhaps, has developed from over representation analysis. And it is a little bit more of a sophisticated approach. It has a similar aim in that it attempts to find relevant gene sets. But it does it in a. Perhaps the most sophisticated way, such that we don't draw hard thresholds here. And so we don't, for example, just take the top most deferentially expressed genes and use those to determine the gene sets. We actually use the whole gene expression profile in a weighted fashion. So, here, I can give a brief overview, of the aims and the advantages of this approach. So, a microarray profile measures the expression levels of a large number of genes. So, something on the scale of the whole genome. And a typical biological question is to try to understand the difference in expression between two sets of samples. So these might be a diseased and non-diseased or control samples and samples that have undergone some kind of perturbation. Traditionally, in the first instance, we might look for individual genes which are differentially expressed between the two classes. And then use that as the basis of further analysis, perhaps overrepresentation analysis. There are two problems, potential problems with this. The first is the biological data, generally gene expression data, can be noisy. There are lots of sources of variation, and no single gene, it may be the case that there are no genes stand out above the noise as being significantly differentially expressed. Alternatively, many, many genes might be identified as deferentially expressed but it's difficult then to interpret the list of genes to infer a unified biological theme for these genes. So, there was a problem of biology interpretation. The analysis in the enrichment's of gene sets, methods such as GSEA. Instead of looking for individual genes that are deferentially expressed. Look for whole collections of genes that are collectively differentially expressed and this has two main advantages. The first is that there is more statistical power here. So, a set of genes is more likely to stand out above the noise than individual genes. We expect a larger signal to noise ratio, in other words. Secondly, if the gene sets are chosen carefully such that they all belong to a particular biological theme, then the biological interpretation is built into the analysis or it's almost automatic. [MUSIC]