-
Notifications
You must be signed in to change notification settings - Fork 6
UserGuide: Data exploration and visualization
Some genes are not expressed in any samples and others are expressed at extremely low levels. For these reasons, gene expression is rarely considered at the level of raw counts, it is common practice to transform raw counts onto a scale that accounts for such library size differences.
Here raw counts are converted to CPM and log-CPM values using the cpm function in edgeR. CPMs are calculated by normalizing the read counts by the total counts per sample. By default, a gene has to have more than 0.5 CPM in at least 3 samples. Otherwise, the gene is removed. The log-CPM values will be used for exploratory plots. When log=TRUE, the cpm function adds an offset to the CPM values before converting to the log2-scale.
You can specify another "CPM threshold" or "minimum samples" by changing these options, and next run GEfilt function:
# CPM's threshold
parameters$threshold_cpm = 0.5
# minimum of sample which are upper to cpm threshold
parameters$replicate_cpm = 3
# run filtering
asko_filt<-GEfilt(data, parameters)
# Total number of filtered genes:
dim(asko_filt$counts)[1]
## [1] 10991
The filtered data is saved in a structure called here: asko_filt.
In the folder DEG_test/DataExplore/, you should find the images representing your data before and after filtering.
These graphs allow you to check that you are filtering your data correctly or sufficiently. If, however, the peak of non-expressed or very weakly expressed genes remains high, you can modify "CPM threshold" and/or "minimum samples" options.
From the boxplots we see that overall the density distributions of raw log-intensities are not identical but still not very different. If a sample is really far above or below the blue horizontal line we may need to investigate that sample further.
Barplots and density graphs representing data before and after the filter process. These graphs allow to check that the data are correctly or sufficiently filtered. If the peak of non-expressed or very weakly expressed genes remains high, "CPM threshold" and/or "minimum samples" options can be modified.
You notice that the legend of the density graphs is too far below compared to the graph. You can correct this with the options parameters$densinset
which modifies the position of the legend, it is also possible to define the number of columns with parameters$legendcol
. Sometimes you will have to test different values to get a correct position.
Finally, re-run "GEfilt" function:
# Set position the legend in bottom density graphe
parameters$densinset = 0.20
# Set numbers of column for legends
parameters$legendcol = 8
# run filtering
asko_filt<-GEfilt(data, parameters)
Since the sequencing depth might differ between samples, a per-sample library size normalization must be performed before samples can be compared.
By default, we use the TMM (trimmed mean of M values) normalization method to calculate effective libraries sizes, which are then used as part of the per-sample normalization. This normalization method is based on the hypothesis that most genes are not differentially expressed and it's implemented in the edgeR Bioconductor package as the default normalization method.
You can choose another normalization method by modify parameters$normal_method="TMM"
(allowed methods: TMM, RLE, upperquartile or "none").
# run normalization
asko_norm<-GEnorm(asko_filt, asko_data, data, parameters)
You can see heatmap if you use parameters$norm_counts == TRUE
:
parameters$CompleteHeatmap=TRUE
asko_norm<-GEnorm(asko_filt, asko_data, data, parameters)
Normalized data is saved in a structure called here : asko_norm.
In the folder DEG_test/DataExplore/, you should find the images representing your data after normalization. Two files are automatically generated, in DEG_test/NormCountsTables/ folder, because they will be used for co-expression analysis: "DEG_test_CPM_NormCounts.txt" and "DEG_test_CPM_NormMeanCounts.txt".
Before proceeding with the computations for differential expression, it is possible to produce a plot showing the sample relations based on multidimensional scaling. The ideal being that the intergroup variability, representing the differences between the experimental conditions, is greater than the intragroup variability, which can represent the technical or biological variability.
From the matrix of CPM, AskoR produces Multi-dimensional Scaling (MDS) and Principal Component Analysis (PCA) plots, displaying the coordinates of the samples on 3 axes, a heatmap of the correlation between the samples (with dendograms) and with their respective conditions encoded by a color, and a correlogram. This result provides an overview of the samples and allows for the identification of outliers or inconsistencies in biological or technical replicates. These plots can help visualize clustering among replicas and help identify technical or biological outliers.
GEcorr(asko_norm,parameters)
Several graphics will be saved in the DEG_test/DataExplore/ folder.
MDS and PCA arranges the points on the plot so that the distances among each pair of points correlates as best as possible to the dissimilarity between those two samples. In the plot above, samples from each conditions are more likely to be similar to samples from the same condition, than samples from the other conditions. Visualized as clusters of different colors, all condition B samples are on one side of the diagonal and condition A samples are on the other side.
Correlogram plot generated by AskoR to describe sample correlations. Each cell represents a pairwise comparison and each correlation coefficient is represented by an ellipse whose ‘diameter’, direction, and color depict the accordance for that pair of samples. Highly correlated samples are depicted as thin blue ellipses, while poorly correlated samples are depicted as red ellipses with wide diameters.