Perform and explore enrichment analyses based on Ligand Discovery primary screening data
The repository contains a large amount of orthogonal data collected from the public domain. The data corresponds to protein annotations of multiple types, including (but not limited to):
- Domain, family, philogeny groups
- Active site, binding site
- Protein class
- Molecular function
- Biological processes
- Pathways
- Complexes
- Subcellular localization
- Cellular component
- Drug target classes
- Target druggability
- Disease category
All annotations are available here.
For each fragment, we ranked proteins by their Log2FC (z-normalized) and performed a ranksum (GSEA-like) enrichment test across all annotations. The figure below shows some annotations found to be enriched for fragment C001.
Conventional Ranksum enrichment analysis.In addition, we performed hypergeometric tests, based on binarized data, as well as top-25, 50, 100, 250 and 500. The protein universe used was the basal proteome of HEK293T. We designed a primitive version of the Streamlit App to navigate the enormous amount of enrichment results. Two limitations became apparent:
- A panel displaying a large number of top enrichment results was necessary in order to extract biological insights.
- Promiscuity of proteins propagates to promiscuity of enrichment results, resulting in frequently occurring annotation terms.
To address the above limitations, we provide the following two plot types.
The leaderboard below corresponds to fragment C170. Vacuolar proteins (a GO Cellular componet) are enriched for this fragment. In the leading edge of this enrichment result, we find TMEM59, TPP1, etc. The normalized enrichment score (NES) is 5.95, and the P-value is 2.9e-09. In red, we see high Log2FC values, for the vacuolar proteins, and in blue lower Log2FCs. The dot at the right is colored by category (in this case, localization).
Leaderboard plot. The leaderboard can have an arbitrary length (10, 50, 100...).Here we focus on one particular annotation (Vacuolar Lumen) and fragment (C175).
In depth-plot. (Left) The promiscuity plot highlights proteins in the annotaiton (coloured). Filled circles correspond to the leading edge. Color denotes promiscuity (blue) or specificity (red). (Center) On top, ranksum plot, including circles denoting the result of a hypergeometric test at top-25, top-50, top-250 and top-500. Empty dots denote non-significant result (P-value > 0.05). In the bottom, top-10 proteins in the leading edge, colored and located by promiscuity. (Right) In the upper-left panel, the expected normalized enrichment score (NES) of this annotation across other fragments is shown (mean and standard deviation), along with fragments of the same pull down (in black). In the upper-right panel, a griddified projection of annotations is shown (coloured by promiscuity), in order to geolocate the annotation with respect to the rest of annotations. In the lower-left panel, the number of proteins in the leading edge at different degrees of promiscuity is shown. In the lower-right panel, proteins are projected (and griddified) by sequence similarity, and the leading edge proteins are highlighted (coloured by promiscuity).The current Streamlit App capitalizes on these two display items to provide informative navigation of the enrichment results. The following is a mockup of the Streamlit App:
Mockup of the Streamlit Protein Set Enrichment Analysis App. The two main pages are highlighted. On the left, we sketch the leaderboard page, focused on a given fragment. On the right, we sketch the focus page, specific to a fragment-category pair.Below we use the case of fragment C310 to illustrate the pages of the protein enrichment app.
Table view, filtering for SQSTM1: Leaderboard page, table view, where SQSTM1 is used as a filtering gene in the leading edge.
Plot view: Plot view of the leaderboard page
Table view, focused on localization terms: Enriched terms, in a table view. At the bottom, there is the possibility to explore proteins.
Basic plots: Basic enrichment plots. Fill color of the curve indicates strength of enrichment signal.
Advanced plots: Advanced enrichment plots. Please see above for interpretation.
First of all, you have to a few big files download data files. These files need to be unzipped in the protein-set-enrichment-analysis/
folder.
- Data: https://ligand-discovery.s3.eu-central-1.amazonaws.com/protein-set-enrichment-analysis/data.zip
- Results: https://ligand-discovery.s3.eu-central-1.amazonaws.com/protein-set-enrichment-analysis/results.zip
- Cache: https://ligand-discovery.s3.eu-central-1.amazonaws.com/protein-set-enrichment-analysis/cache.zip
This app has very few dependencies. You can install them as follows:
pip install -r requirements.txt
Then, you can simply run the app as follows:
streamlit run app/app.py
In case you don't want to use the cache data, we have a much more complete version of the app that dynamically creates plots, etc. We do not recommend using this version of the app unless you are a developer of the Ligand Discovery project.
To install the dynamic version of the app, we recommend using Conda. Make sure a C++ compiler is installed:
conda install -c conda-forge cxx-compiler
Install the necessary dependencies:
pip install -r requirements_dynamic.txt
Finally, you can run the app as follows:
streamlit run app/app_dynamic.py
You can also run an app for quick exploration, such as identifying good enrichment signals for further inspection.
streamlit run app/explore.py
This project was performed at Georg Winter Lab, based at CeMM, Vienna.