File Definitions

There are several files that may be needed depending on the analysis. These files, as well as, files output by inferCNV are described here.

Input Files

Expression Matrix

(REQUIRED, this is the data matrix)

The input data matrix is expected to be log(TPM+1). If your data is TPM data, use the --transform command line argument and the data will be transformed.
The file should be tab delimited.
It is also expected that the matrix will be genes (rows) by cells (columns) and that the gene and cells are labeled.
Gene names in the expression matrix should match gene names in the genomic positions file.
Example - Please look at example_expression.txt in the example directory of the download for an example.

Genomic Position Files

(Optional, contains which genes are viewed and their order)

This is a tab delimited file of 4 columns (gene name, contig/chr, start position, stop position).
Gene name should match the expression matrix row labels.
This is used to order the expression data in genomic order.
Contigs/chr will be ordered by first appearance in this file.
Example Position File

Making a Genomic Position File

Some Genomic Position Files have been generated from common references and made available at TrinityCTAT.
To generate a Genomic Positions file from a GTF file please use the gtf_to_position_file.py script provided in the src directory.

# By Default use gene_id as the name of your feature
python ./src/gtf_to_position_file.py your_reference.gtf your_gen_pos.txt

# You can change what gtf attribute key is used, here transcript_id is used.
python ./src/gtf_to_position_file.py --attribute_name transcript_id your_reference.gtf your_gen_pos.txt

(This command should work in both Python 2.X and 3.X environments).

References File

(Optional, useful when working with controls/reference files)

This is a simple text file with the names of the cells that should be treated as references or controls.
Cell names should be identical to the cell names in the Expression Matrix.
Cell names should be comma delimited and can be on an arbitrary number of lines.
Example References File

Output Files

A directory of output files is generated per run. This output directory can be found in the same location as the output pdf and is named the same name as the output pdf (excluding the extension). Several files are provided in the directory to enable further analysis.

Please let us know if there are other files that would be helpful as you explore your results!

expression_pre_vis_transform.txt

This is the expression matrix after all data manipulation except the last transform for data visualization. The last step of preparing data for visualization allows one to bound measurements (using the --vis_bound_threshold argument). Although helpful in making visualization more vivid in the presence of outliers, this may not be as appropriate for additional analysis. The matrix before this bounding is given here.

observations.txt

All observations and associated measurements as shown in the visualization.

references.txt

(Optional, only generated when reference cells are indicated)

All references and associated measurements as shown in the visualization.

*_members.txt

If groups of observation are generated (for instance, by the --obs_groups argument), the names of samples (cells) in each group of observations are recorded in separate files. The file names indicate the cluster group and the method observations were clustered. "General" indicates clustering using all genomic positions; a contig name indicates clustering just by that contig (see --obs_cluster_contig).

observation_groupings.txt

If groups of observations are generated (for instance by the --obs_groups argument), sample (cell) name, cluster membership, and color (shown in the figure) are recorded here.

observations_dendrogram.txt

A newick output of the observation matrix dendrogram so that it can be reconstructed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly