This repository contains the code for our software, Functional Feature Amplification Via Entropy Sorting (FFAVES) and Entropy Sorting Feature Weighting (ESFW). A detailed description of these software can be found in our paper, Entropy sorting of single-cell RNA sequencing data reveals the inner cell mass in the human pre-implantation embryo.
- Retreive the ripository with:
git clone https://github.com/aradley/FFAVES.git
- Navigate to the directory where the clone was downloaded to, for example:
cd FFAVES/
- Run the following on the command line:
python setup.py install
Dependencies for FFAVES and ESFW are outlined in the requirements.txt file. Dependencies will be automatically installed by python setup.py install
.
The Synthetic_Data folder in this repository provides the synthetic data described in our paper, and all the code needed to re-create the results. The process of re-creating the synthetic data results should be relitively quick.
The human pre-implantation embryo data used in our publication is available for download at https://data.mendeley.com/datasets/689pm8s7jc. Likewise, the code required for re-creating the results can be found in the linked directory. Because re-creating the results for the human pre-implantation embryo data would require significant computational resources, the linked repository also contains a minimal set of FFAVES and ESFW outputs that are required for re-creating the plots presented in the paper.
The main input for FFAVES and ESFW is a discretised state matrix (rows are samples and columns are features) where the samples of a feature can be represented as existing in one of two states by 0's or 1's. In our paper we discretise scRNA-seq data such that genes can be considered as active or inactive (it does not matter whether 1 or 0 indicates active or inactive, but we chose 0 to indicate inactive). Discretisation in this manner is often appropriate. Other examples include chromatin accessibility/inaccessibility or genome sequence methylation/non-methylation. We remind potential users that discretisation of your data need not be perfect. As long as discretisation is carried out in a reasonable and rational manner, FFAVES has built in methodology to attempt to correct sub-optimal discretisation.
- We advise carfully going through our code provided alongside the human pre-implantation data before undertaking your own analysis since there are relitively simple, but carefully thought out workflows for applying FFAVES and ESFW. For example, in our feature selection workflow for the human embryo data, we provide a simple rational for simultaneously picking the optimum number of cycles of FFAVES imputation applied to the data and the feature weight threshold to use when picking a sub-set of highly structured features.
- For single cell RNA sequencing data, we have found that discretising gene expression matricies such that any value greater than 0 becomes a 1 works well. The assumption is that any observed gene expression is a gene being actively expressed, and that the automatic sub-optimal discretisation correction built into FFAVES will fix any incorectly discretised data points. However, if you wish to use more sophisticated discretisation methods you are very welcome to. For example, techniques already exist to discretise ATAC-seq data into open/closed regions of the genome.
- We strongly advise that after application of ESFW for feature selection, you do not perform log transformations or PCA reductions on your data prior to generating embeddings. We have found on multiple datasets that such transformations remove the high resolution of gene expression dynamics that are revealed by ESFW feature selection. Of course for some datasets transformations may be benificial, and as such we simply suggest at least starting without transforming the counts matricies in any way.
- Following from point 3, rather than transforming the data, we suggest using the correlation metric (rather than the conventional euclidean metric) for generating distance matricies or UMAP/tSNE embeddings. The correlation metric is a scale free metric that will identify when cells have similar gene expression profiles in a manner that is less sensitive to genes having higher/lower average gene expression in relation to each other. However, if you are using UMI counts data, differences in average gene expression between genes can be dramatically reduced. Hence if you are using UMI data the euclidean distance may be comparable/more appropriate.