Right, in this week's lab we're exploring cis elements, and cis elements are really important for the regulation of the expression level of a given gene. So they determine where and when a given gene is active in particular tissues or in cell types or in response to a perturbation / environmental stress. So what we're looking at in question A is the output of a word count program which looks at the frequency of occurrence of particular k-mers. We take all the promoters, in this case APETALA3 coexpressed genes, promoters of those genes, and we look for whether particular k-mers, 6-mers in this case, are over represented in those promoters relative to random sets of promoters. And question a asks what the colour of the background distribution is and that colour is in blue, here. So this is just randomly picked promoter sets and the colour for the distribution of the input cluster, i.e. our AP3 coexpressed promoters, genes that are coexpressed with AP3 - promoters thereof, the distribution is shown in red, here. So in this case, we're looking at several k-mers, 6-mers here. Each row consists of two distributions, the blue distribution and the red distribution. And we see gradually, decreasing distances between those two distributions as the significance decreases. So these first two are quite significant and in fact the first two do appear that they could overlap. So, if we look at the composition of the first k-mer it's TAATTA. And the second one is ATTAAT and the first four nucleotides of the second six-mer could overlap with the last four nucleotides of the first one. It seems like they could overlap. All right, so question c asks, are the two top 6-mers similar to any known motif? And if we look at, here's the second one actually, we see that it's similar to AGAMOUS binding sequence from Arabidopsis thaliana, as is the first one, here. So again, AGAMOUS pops up as a consensus binding site. It appears that these elements could be binding sites for the AGAMOUS transcription factor. The next part of the lab uses TF2Network for Questions d,e. Here we're asking if there are any enriched transcription factor binding sites in the promoters of the AP3-coexpressed genes. Question d asks: Which gene products shows the most PWM Mapping or Protein-DNA edges with this gene set file? There are 4 with 18 hits each, and one of these is ATMYC2 here, which we're looking at, which from its annotation seems to be involved in response to wounding, dessication and ABA. It's not clear how this relates to flower development... but it is also worth noting that AGAMOUS binding sites are not enriched in these promoters! Question e asks: What do you notice when you slide the smaller Protein-protein all the way to the right? Well, we see that 3 of the proteins (ATMYC2, ATMYB24 and AGL4) all interact with themselves. So question f asks: do you think that the two motifs from the Promomer analysis, those two top 6-mers, are part of a larger motif? So what we can do is you can take the consensus sites for those two k-mers, motif1 and motif2, and then use a tool called Cistome to view where those motifs occur in the promoters of our AP3-coexpressed genes. So we paste in our motifs here for those two 6-mers, we paste in the AGI IDs, just the IDs, of the AP3-coexpressed promoters and then we click Map, and this is the kind of output that we get. We see that they're actually not that overlapping, so the green and the red in this case denote the two separate motifs. And we don't see a lot of overlap, and in fact we can ask what the overlap is here by clicking on the merge button down in the bottom left. And we actually see only a few cases where those motifs actually do overlap. So they don't seem to be part of a larger motif. Questions g and h deal with a database, a really nice open access database called JASPAR developed by Wyeth Wasserman's group at University of British Columbia, and others, and here we're searching, because we saw that there could be an AGAMOUS binding site in the promoters of those genes, we're asking whether or not, we want to explore that information in more detail, the AGAMOUS binding sites. So using JASPAR, which is nicer because instead of containing just a consensus binding site for given transcription factor, it actually contains a position specific scoring matrix. So we know the preference at each position for binding. We can use that information to understand our genes, our coexpressed genes in a bit more detail. So by specifying the species identifier 3702 in this search interface here, you can actually pull up all the records for Arabidopsis transcription factors that are in JASPAR. And when we do that, we see the output here. Here is AGAMOUS, the thing that we think is actually present in the promoters of the AP3-coexpressed gene... the cis-element that we think, or the binding site that we think is present in the promoters of the AP3- coexpressed genes. The identifiers you can just get from this column here, the first column. And how was the specificity for the AGAMOUS entry determined? So what we can do is we can click on that entry and under here, under the type, we see that the method used to determine the bindings specify of AGAMOUS, the AGAMOUS transcription factor was SELEX and we talked about that in the lecture. All right, so now we revisit this question as to whether or not AGAMOUS or AGAMOUS-Like 15 binding sites are statistically enriched in the promoters of our AP3-coexpressed genes. By taking that position-specific scoring matrix from the JASPAR database, and using that to search, instead of these consensus motifs which contain less information... so the PSSM is a better way to search, and when we search with those, we actually don't see any enrichment of those sites that would match to the PSSM in the promoters of the coexpressed, The AP3-coexpressed genes, so we would probably say that AGAMOUS doesn't bind there, all right. So next, we're using a tool called MEME to discover motifs in a more probabilistic way. So Promomer is a word count method whereby MEME is a more probabilistic method for discovering motifs. And here, the output that we get describing these motifs is as a PSSM, not just as a 6-mer word or something like that. MEME stands for Multiple EM for Motif Elicitation and what we're doing here in the case of MEME is we're taking our set of AP3-coexpressed promoters, the actual sequences we've provided, you're uploading those to the main interface and then you are clicking Submit after adjusting a few parameters, as per the lab report. And here's the output of the MEME algorithm, and we see that there's a good motif. I would call it a good motif because it's got a pretty good e-value. A pretty low evalue in the promoters of the AP3-coexpressed genes. In the second one, perhaps it's not as good, but it's still below a 0.05 threshold. Now does that motif match to anything in the databases of known binding specificities for known transcription factors? So if we use the CisBP database, and we take our PSSM from this, for this, for this motif here, we can query against all of the motifs that have been, all of the binding specificities that have been entered into that database, and see if there are any matches. So, the top matches is actually to DOF5.8 and REM19, which is potentially involved in regulating flowering. This is potentially of note as AP3 is involved in flowering actually... is involved in regulating floral development at least. And if we look in the JASPAR database, we actually see that DOF5.8, again, and CDF5 are two of the best matches. Perhaps not as convincing a match... I would say that the REM19 looks a little bit more like the motif that we discovered with MEME. In this case, the CDF5 match, it's also known that CDF5 is involved in regulating flowering time in Arabidopsis. So we could do some hand waving and say potentially any of those three could be the targets or our discovered motif could be the targets of DOF5.8 or CDF5 or REM19 in regulating the AP3 coexpressed genes. All right, now we turn to some human data. And here we're looking at the promoters of insulin-coexpressed genes. Obviously, insulin is important for regulating our metabolism when we take in glucose, we get some insulin production going on. This helps to regulate glucose levels in the blood. And when we feed in the promoters of insulin-coexpressed genes from human, we get some pretty good looking motifs back again. So at least this first one seems to be very / quite believable with a very low e value 1 times ten to the minus 76. And that motif looks like, depending on when you run that algorithm (there's an element of stochasticity), when you run this algorithm you may get four or five Cs followed by TGT and then followed by four or five Cs. And the peculiar thing about that for this first motif is that it occurs many, many times in the promoter of the insulin gene. So this motif is represented by these red boxes in this larger overview of where those motifs map. And interestingly, if we look at what's known, what the known motifs are present in the insulin gene, we don't really see any sort of repetitive stretch from this particular publication. So, here's our thousand base pair long promoter and remember from the lab we said that it started 300 base pairs upstream of transcription and continued 700 base pairs... sorry, 300 base pairs down- stream of transcription and continued 700 base pairs upstream. Here's the corresponding map of known motifs from this known publication by Kay and Doherty and in fact we don't see, as I mentioned, this repetitive element anywhere. So we may have actually discovered a new element in the promoter of the insulin gene, so that would be cool. Now, does that element actually match anything known in the databases? So when we search using a tool called TomTom, we don't really see any particularly great matches. So this is the best match in the JASPAR database these other Cs aren't here, present at the end. In this case, the second best match. Sure, there are a lot of Cs, but we're missing the Gs and neighboring Ts on either side of that G. So it could be a bonafide novel cis element in the promoter of insulin. All right, by the end of Lab 6, which comprised the labs, and the boxes, and the lectures. You should know the main technologies for identifying a transcription factor binding sites in vitro. You should be able to name some methods for identifying potential transcription factor binding sites in silico. Such as word counting, these Gibbs sampling methods, more probabilistic methods. Should also understand how those word count and Gibbs sampling methods work. And you should understand the differences between those two methods. You should understand why we would want to generate results, which referenced a background set of all promoters. It's really nice to give you a feeling for how often or how much over representation there is relative to background set. You should also know how to use online tools, such as TF2Network or JASPAR to identify known cis-elements in promoters of coexpressed genes. And you should also be able to use the online version of MEME to generate elements from a set of coexpressed genes' promoters