0:39
First gene expression signatures are most informative of global cellular state.
The cellular state could be disease, physiology, or drug induced.
Second, the Connectivity Map project is potentially transformative.
I think most people should have heard about the Connectivity Map or the
CMAP project published in 2006 by the Broad Institute.
They applied 164 small molecules,
on a few cell-lines, and then measured the gene expression profiles, using Affymetrix array technology.
They were able to connect these gene expression signatures of cancer
1:22
perturbed cells to battle disease and pathophysiology, and that
provides a systematic approach to find a connections between drugs and disease.
Obviously, the more gene expression signatures we have, the more resources
we can utilize to find a meaningful connection for our own research.
But common gene expression profile methods are expensive and
are not as scalable to be performed in high throughput.
Like the Affymatrix array technology used in the original CMAP project.
So LINCS L1000 project was launched as part of the LINCS project to generate a large
number of gene expression signatures, using the L1000 technology.
The LINCS project produces the largest set of perturbations to cell-lines, but
still covering a small fraction of all possibilities.
2:53
So the reason that the L1000 technology is cost effective in that
it only measures 978 genes instead of tens of thousands of genes in common methods.
Thus, the L1000 technology is suitable for gene expression measurements in high
throughput. So
why 978 genes are enough to reasonably represent the whole transcriptome?
The short answer is gene expressions are correlated.
You only need to know one to know the other.
To demonstrate this idea, the CMAP team compared the connections between all
the available gene expression data on GEO, and they found only 978 carefully
picked genes are necessary to recover 80% of the connections,
the criteria for the landmark genes are: that they are minimally redundant;
Widely expressed in different cellular contexts; and
contain inferential value.
3:58
This slide is an overview of the protocol of how the landmarked genes are
measured. First, mRNA is reversely transcribed into cDNA.
Then landmark genes specific upstream and downstream probes are annealed
to the cDNA and ligated.
The upstream probe has a unique barcoded sequence.
4:45
The previous slide is the protocol of how a sample is measured in each well.
Hundreds of experiments are mirrored simultaneously on 384 well plate.
This slide shows the common experimental setup.
About 360 experiments are normally married together in a batch.
Each experiment has two to four replicates.
The figure shows the batch of the experiments with three replicates.
The red circles are the control replicates and the blue are experimental replicates.
The replicates of the same experiment
are placed in the same well on separate plates.
So the number of plates equals to the number of replicates in the batch.
There are normally 18 controls.
Replicates per plate, where the field name for
batch is called brew_prefix in the LINCS L1000 metadata.
So this line shows the data levels of the LINCS L1000 data.
6:29
Level three is quantile normalized data.
Gene expression profiles both directly measured landmark transcripts
plus imputed genes, normalized using invariant set scaling followed by
quantile normalization first within-plate and then across replicate plates.
Which means plates in the same batch.
Level 4, are the z-score data, they are profiles of
differentially expressed genes computed by robust z-scores for
each profile relative to the population control.
7:30
I think it is useful to have an in depth analysis
of the LINCS L1000 ID system.
The IDs reflect the experiment set up and a link between data and metadata.
This slide shows the distil_id, which is the ID for
replicated gene expression profile.
Since both level 3 and level 4 data are on replicate level.
It is this ID that I use for the index for the level three and level four data.
Distil_id consists of Brew prefix, plate index, and a well index.
Brew prefix equals to batch as previously mentioned.
Each batch name is in turn made up of three parts:
The perturbagen group, the cell line, and the time point, which indicates
experiments in the same batch have the same cell-lines at a timepoint.
For the perturbagen group is just a broader group concept,
consisting of several related batches.
Clear index is the index of a plate using a batch,
together with a group prefix, they make up the unique identifier for plate.
8:58
This slide shows the sig_id, which is the ID for level 5 data.
sig_id also consists of three parts:
brew_prefix, pert_id and pert_dose.
Pert_id the component index of the Broad Institute has BRD,
the abbrevation for Broad.
In the LINCS data each experiment is determined
by 4 pieces of information: perturbation, cell line, dose and time point.
You can see sig_id with flags 4 essentials.
Together with the perturbagen group, the unique identifier level 5
gene expression signature sometimes the pert_id is the pert_mfc_id.
Which is the ID for the drug-class/batch information.
10:08
If you are an experienced gene expression analyst,
you probably have wondered why there are level 4 data.
Most experimental expression analysis methods start with normalized data and
yield differetial expression signatures.
10:25
Namely, jumping from level 3 to level 5 with no intermediary level 4 step.
The reason for having a level 4 data here is that there is a strong
plate batch effect in level 3 data, as shown in the figure.
The figure, the PCA plot, is an example of experimental replicates and control replicates
in level 3 data with the same conditions across plates in the same batch.
The broad pink and the blue dots are control replicates from four plates.
And the yellow dots are experimental replicates
11:03
from four plates.
Normally you should expect control replicates to group together and
experimental replicates to group together.
But here the replicates of the same plate are grouped together
which makes no sense and is surely an artifact.
So level 4 L1000 data are calculated to correct this plate bias effect.
11:30
The level 5 characteristic direction data
are directly computed from level 3 normalized data.
How this jump overcomes the batch defect?
If you carefully observe the figure, you can find that the control replicates
point to the experimental replicate in the same direction in each plate.
Which reflects the biology of how the experimental
replicate is systematically deviated from the control.
The characteristic direction approach computes the 4 directions and
averages them to get the level 5 characteristic direction signatures.
12:15
So now let's see what perturbations and cell lines we have
in LINCS L1000 data.
There are more than 20,000 small-molecule compounds
in which there are 1,300 FDA approved drugs.
About 5,585 bioactive tool compounds and
more than 2,000 screen hits.
And there are also 22,000 genetic constructs for knocking-down genes or
over-expressing genes.
12:48
They consist of 900 targets or pathways of FDA-approved drugs.
600 candidate disease genes, and more than 500 community nominations
which have the genes that are interesting to the biology community in general.
The data set covers more than 18 cells, including primary cells,
cancer cell lines, stem-cell lines, and
differentiated cell lines from different tissue types.
So here we arrived to the question that I think now is
probably most important to the audience.
13:25
Where to find the LINCS L1000 data?
There are three locations to find and download the data:
There is LINCS cloud, GEO, and the Ma'ayan lab website.
The LINCS cloud hosts about 95% of the data from level 1 to
level 4 with the chemical perturbations, gene knock downs and
the gene over-expression perturbations.
GSE70138 holds about 5% of
the data via the LJP005 to LJP009
pertubation group datasets.
And it consists of only chemical perturbations,
you can find them on GEO on this GSE70138 index.
14:40
You have to first register for an account to access the website.
Entering to the website there are four icons on the upper left corner.
The first icon provides access to several apps
that help users interact with the data.
They are easy to use and will not be covered in this presentation.
The second API icon offers functionalities to download and
analyze the data and the metadata.
The first API icon is the HTTP service to search and
query metadata that is actively updated.
The following data icon enables user to download level 3 and
level 4 data and associated metadata.
The Code icon provides code in various programming languages to
parse and analyze the downloaded data.
The face way function enable user
to analyze the data on the cloud without downloading.
Note set and the metadata that can be accessed through the API are generally
more complete and accurate than the directory downloaded metadata.
This slide shows the services provided by the LINCS cloud API.
After downloading the data, you will get a basic data matrix.
With row ids and the column ids, the row ids are probe ids for
each gene and it can be correlated to the GeneInfo service.
The column ID for level 3 and level 4, is the distil_id and
it can be quiried with the InstInfo service.
16:22
The column ID for level 5 data is sig_id and
it can be quiried by the SigInfo.
Yes, you probably realize that although moderated Z-score level 5
data are not available for download,
you can broaden the search using the SigInfo service.
16:42
Here is the workflow for processing data downloaded from the lincscloud.
This workflow uses level 3 data as an example,
but also applicable to level 4 data.
First, download the the big matrix file onto our computer,
16:59
then download the code of your preferred language and use the parse_gctx() function
to the parse the big matrix file with the annot_only option set to true.
The result will be a list of column IDs, the cid, and row ids that are rid.
Correlate the row ids through the GeneInfo API to gather the information of
each gene.
Correlate the instant info to find the column IDs of gene matching profiles
that are interesting to your project, then use parse gctx function again
with the second option, send it to the selected column ID queried from the API.
The matrix will be sliced according to the selected column IDs and
output a submatrix that contains only the data of your interest.
Notice that by default,
the pase_gctx() function will try to parse the whole matrix into your memory.
This is not practical since it requires hundreds of gigabytes of memory available.
That's why we need to sort the matrix by column IDs to analyze the data.
The LINCS L1000 data on GEO is much easier to handle.
The GEO page lists several files to download covering data from
level two to level four.
Both data and the metadata are assembled in a single file and
there's no need to query a separate API.
The files are in GCT format, which is a simplified version of GCTX format.
You can still use the parse GCTX function to parse the GCT files.
As you can see, they are a two level 4 files.
The Z-scores are computed relatively to the population background and
the in the Z-scores in the other is computed relativey to control vehicles.
Generally, I think the Broad Institute will prefer the file
computed relative to population background.
If you want to get the level 5 characteristic direction signatures
processed by the Ma'ayan Lab, you need to install MongoDB.
MongoDB is the most popular non SQL database that assumes
data as objects rather than rows in tables.
In this L1000 database,
each signature is represented as an object with data and metadata as attributes.
The files downloaded from our webpage are MongoDB
files that can be directly used in the database.
Instructions of how to set up this are provided on this web page.
The only thing that require your attention is that the database does not store genes
metadata.
The genes metadata needs to be downloaded separately in three files.
19:53
The LINCS cloud rod.json is the real proper IDs.
Matching the order of genes the full character direction in LINCS cloud fashion.
The order of landmark genes in this array matches the order of
landmark genes in the landmark direction.
The GSE70138 are adjacent with an array of probe IDs
matching the order of the genes in the GSE70138 collection
under the API role metadata file, in the metadata information for
each probe ID downloaded from the Broad API and
it can be used to convert proper IDs to gene symbols and
also used to determine if a probe ID is a landmark gene.
Here are some apps developed by the BD2K-LINCS DCIC
that use LINCS L1000 data.
LIFE is a search engine developed by the University of Miami that
integrates all LINCS content leveraging a semantic knowledge model and
a common LINCS managed data standards.
iLINCS is a computational biology app developed by the University of Cincinnati
that aims to provide statistical methods and a computational tool for
integrated analysis of the data produced by the LINCS program.
L1000CDS2 is a Search engine developed in our lab to search for
level 5 characteristic direction signatures that
are either mimic or reverse user input signatures.
Lich is a metadata search engine for LINCS L1000 data deposited on GEO and
it provides customized download of level 3 and level 5 data.
GEO2Enrichr is a Chrome and a Firefox extension that helps users
extract the signatures from studies deposited in GEO.
Although it does not directly use LINCS L1000 data GEO2Enrichr
pipes the GEO signatures to L1000CDS2 to search for
similar or reverse signatures in the LINCS L1000 database.
The last slide is a summary of resources that might help
your research with LINCS L1000 data.
[MUSIC]