In this session we will start the package AnnotationHub. Let's load the package. AnnotationHub is an interface to a lot of different online resources. The idea is that you create a hub, which is a local database of a lot of different online data that's out there. You take this local database, you query it, and you figure out which data do you want and then you go online and you retrieve them. This is a scalable way to access tons of different data and as time goes on, we hope to see more and more data resources being part of AnnotationHub. It's also relatively new Bioconductor package. So a lot of people are not fully aware of it. I will say my own research using AnnotationHub has really transformed my capacity to examine many data sets of many online resources simultaneously. The way you use it is you ask to create a local AnnotationHub, like this. And the first time you use it, it'll go online and download the database, which is usually very quick. And if it has already download of database, it'll just use it from a cache. So let's see what this here looks like. [INAUDIBLE] interface is to 350,000 different data sets. That's a lot, there's a snapshot date. That's a key thing. This AnnotationHub provides a snapshot of how the different data resources looked like on that specific date. Then there is for each data set, there's information such as who provided the data set, what species is it on. [COUGH] What kind of data is it? And so on and so forth. Let's look at one of these specific data sets that's stored inside the AnnotationHub. We can do that using standard subsetting, knowing that annotation hub is essentially works a little bit like a vector. [COUGH] So let's look at the first element. Here we can see this is a FASTA DNA sequence for species. It provided by am sample and it's in the form of a fasting record. The way you retrieve objects from an AnnotationHub is that you use double brackets. So if I wrote A8[[1]], it would go online and retrieve the object. Let's look a little bit at what type of data flow provide us some species we have here so we have for example. Relatively small set of well known players in the online database world. We see Ensembl, UCSC, NCBI, GEO, dbSNP, the BroadInstitute, and as time goes on we expect to see more and more resources being added here on the data providers. We also look at stuff like species. Here's a lot, probably because a lot of bacteria are represented with relatively little data. It's a fair case that a main candidate here is humans. And the various model organisms like Drosophila and mouse. So how do you actually search this database? Because right now we just have 35,000 records and how do we know what's in there and [INAUDIBLE]? The way you use the database is you do search or you select, progressively you narrow down the database to a few elements and then you download what you want. You can do this in two ways. One way is you use the function called subset, which selects a subset of the data that has certain properties. So, one thing that's very natural to do for a lot of us who work in human is to just say we only want data sets where the species is homo sapiens. Okay. If something happens. And we can see that we have indeed gotten rid of 11,000 records. We only have 24,000 records left. That's still a lot of data. And it's a little unwieldy to work with. To search it you can use the function called query, which basically searches for a search term in all the different components of the database. So I'm interested in, let's say I'm interested in histomodification data. I'm going to query my AnnotationHub and I'm going to look for a very specific histonomodification data. Let's see how much we have there. Okay, that helped a little bit but we still have 2,000 records on this specific histonomodification. Let's say I'm very interested in a specific cell type, and I have some idea of what the cell type is called. So now I'm going to search for the history modification and the cell type. And the way you do that is to separate the search term in a vector here, and let's take a encode tier one cell line. And I happen to know that it's Gm12878, because I've studied that cell line quite a bit, and let's see what comes out of that. So now we in play, now we have selected the database down to 11 records of this particular histomodification on this particular cell line. We can see here from the print that or if you look a little bit closer, you'll see that there are sets of data coming from the Broad. So data coming from the University of Washington, something called E116, which turns out it's from the road map histogenomics. But for each of the different things, they also have multiple files. They are something called .broadPeak. Something called .narrowPeak, and what's left. Well AnnotationHub doesn't really provide you the information about this. This interfaces to online sources, but you could for example take this particular title for the data set, search for it in Google, and hope to find some documentation. Sometimes there is very little to be had unfortunately. Another way of working with this is using a more spreadsheet like interface. Which you can do by using a function called display. Note that I assigned the return value of display to a new variable. That's because I've got to select some things and then. I've got to send them from the spreadsheet back into R. So let's run this and the way it pops up in our studio is that this little window. Not that we can see much, I'm going to maximize it. I'm going to make it bigger here and now we have something that looks a little bit more useful right. We have the IDs, the dataproviders, the different species and so on, but we can search inside here for a little something like, let's stick to the same search term as we had before and. And now I'm selecting a couple of rows. And I can return it into my thing by saying return rows to our session. So now we enter R once more. I get back to my console and 882 now. Contains the five rows that I've just selected.