Skip to content

File Definitions

Brian Haas edited this page Oct 18, 2018 · 27 revisions

There are several files that may be needed depending on the analysis. These files, as well as, files output by inferCNV are described here.

Input Files

Raw Counts Matrix for Genes x Cells

InferCNV is compatible with both smart-seq2 and 10x single cell transcriptome data, and presumably other methods (not tested). The counts matrix can be generated using any conventional single cell transcriptome quantification pipeline, yielding a matrix of genes (rows) vs. cells (columns) containing assigned read counts.

The format might look like so:

MGH54_P16_F12 MGH54_P12_C10 MGH54_P11_C11 MGH54_P15_D06 MGH54_P16_A03 ...
A2M 0 0 0 0 0 ...
A4GALT 0 0 0 0 0 ...
AAAS 0 37 30 21 0 ...
AACS 0 0 0 0 2 ...
AADAT 0 0 0 0 0 ...
... ... ... ... ... ... ...

The matrix can be provided as a tab-delimited file. (note, sparse matrices now supported ADD_DOC)

Sample annotation file

The sample annotation file is used to define the different cell types, and optionally, indicating how the cells should be grouped according to sample (ie. patient). The format is simply two columns, tab-delimited, and there is no column header.

MGH54_P2_C12    Microglia/Macrophage
MGH36_P6_F03    Microglia/Macrophage
MGH54_P16_F12   Oligodendrocytes (non-malignant)
MGH54_P12_C10   Oligodendrocytes (non-malignant)
MGH36_P1_B02    malignant_MGH36
MGH36_P1_H10    malignant_MGH36

The first column is the cell name, and the 2nd column indicates the known cell type. For the normal cells, if you have different types of known normal cells (ie. immune cells, normal fibroblasts, etc.), you can give an indication as to what the cell type is. Otherwise, you can group them all as 'normal'. If multiple 'normal' types are defined separately, the the expression distribution for normal cells will be explored according to each normal cell grouping, as opposed treating them all as a single normal group. They'll also be clustered and plotted in the heatmap according to normal cell grouping.

The sample (ie. patient) information is encoded in the cell name as "{patient}_...", and so whatever is provided as the first string before the underscore is leveraged as a sample name. If the sample name is encoded in the cell name, then the tumor cells can be clustered and plotted according to sample (patient) in the heatmap.

Only those cells listed in the sample annotations file will be analyzed by inferCNV. This is useful in case you cells of interest are a subset of the total counts matrix, without needing create a new matrix containing the subset of interest.

Gene ordering file

The gene ordering file provides the chromosomal location for each gene. The format is tab-delimited and has no column header, simply providing the gene name, chromosome, and gene span:

WASH7P  chr1    14363   29806
LINC00115       chr1    761586  762902
NOC2L   chr1    879584  894689
MIR200A chr1    1103243 1103332
SDF4    chr1    1152288 1167411
UBE2J2  chr1    1189289 1209265

Every gene in the counts matrix to be analyzed should have the corresponding gene name and location info provided in this gene ordering file.

Note, only those genes that exist in both the counts matrix and the gene ordering file will be included in the inferCNV analysis.

Some Genomic Position Files have been generated from common references and made available at TrinityCTAT.

If you need to construct your own custom genomic positions file, see instructions for creating a genomic position file.

References File

(Optional, useful when working with controls/reference files)

  • This is a simple text file with the names of the cells that should be treated as references or controls.
  • Cell names should be identical to the cell names in the Expression Matrix.
  • Cell names should be comma delimited and can be on an arbitrary number of lines.
  • Example References File

Output Files

A directory of output files is generated per run. This output directory can be found in the same location as the output pdf and is named the same name as the output pdf (excluding the extension). Several files are provided in the directory to enable further analysis.

Please let us know if there are other files that would be helpful as you explore your results!

expression_pre_vis_transform.txt

This is the expression matrix after all data manipulation except the last transform for data visualization. The last step of preparing data for visualization allows one to bound measurements (using the --vis_bound_threshold argument). Although helpful in making visualization more vivid in the presence of outliers, this may not be as appropriate for additional analysis. The matrix before this bounding is given here.

observations.txt

All observations and associated measurements as shown in the visualization.

references.txt

(Optional, only generated when reference cells are indicated)

All references and associated measurements as shown in the visualization.

*_members.txt

If groups of observation are generated (for instance, by the --obs_groups argument), the names of samples (cells) in each group of observations are recorded in separate files. The file names indicate the cluster group and the method observations were clustered. "General" indicates clustering using all genomic positions; a contig name indicates clustering just by that contig (see --obs_cluster_contig).

observation_groupings.txt

If groups of observations are generated (for instance by the --obs_groups argument), sample (cell) name, cluster membership, and color (shown in the figure) are recorded here.

observations_dendrogram.txt

A newick output of the observation matrix dendrogram so that it can be reconstructed.

Clone this wiki locally