-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation. #5732
Merged
Merged
Changes from 3 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
09eadc9
Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV met…
samuelklee 4ad1d58
Updated links for documentation on topological persistence.
samuelklee 8262ffd
Added ARCHIVED headers.
samuelklee 00e8c29
Addressed PR comments.
samuelklee File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,10 +3,10 @@ | |
\usepackage{amsthm} | ||
\usepackage{algorithm} | ||
\usepackage{algpseudocode} | ||
\usepackage{lmodern} | ||
\usepackage{graphicx} | ||
\usepackage{color} | ||
\usepackage{mathrsfs} | ||
\usepackage{fancyhdr} | ||
|
||
%Put an averaged random variable between brackets | ||
\DeclareMathOperator*{\argmax}{\arg\!\max} | ||
|
@@ -86,7 +86,12 @@ | |
|
||
\begin{document} | ||
|
||
\title{Notes on CNV Methods} | ||
\pagestyle{fancy} | ||
\lhead{} | ||
\chead{(ARCHIVED) Notes on CNV Methods} | ||
\rhead{} | ||
|
||
\title{(ARCHIVED) Notes on CNV Methods} | ||
|
||
\author{Mehrtash Babadi} | ||
\email{[email protected]} | ||
|
@@ -100,10 +105,10 @@ | |
\email{[email protected]} | ||
\affiliation{Broad Institute, 75 Ames Street, Cambridge, MA 02142} | ||
|
||
\date{\today} | ||
\date{January 12, 2017} | ||
|
||
\begin{abstract} | ||
Some notes on current and proposed methods used in the GATK CNV and ACNV workflows. | ||
These notes describe methods that were implemented or proposed for various iterations of the GATK somatic and germline CNV pipelines. The majority of the methods are deprecated and have been superseded by improved methods in GATK 4.0 onwards. (Archived on \today.) | ||
\end{abstract} | ||
|
||
\maketitle | ||
|
@@ -124,7 +129,7 @@ \section{Steps in the GATK CNV and ACNV Workflows} \label{recapseg-overview} | |
|
||
\subsection{Coverage collection} | ||
|
||
\SL{Details of coverage collection go here.} This is implemented by the GATK command-line tool \texttt{CalculateTargetCoverage}. | ||
This is implemented by the GATK command-line tool \texttt{CalculateTargetCoverage}. | ||
|
||
\subsection{Creation of a panel of normals} | ||
We cannot simply divide the coverage of each target by the average sequencing depth to obtain an estimate of its copy ratio. This is because the coverage of different targets is heavily-biased by factors including the efficiency of their baits, GC content, and mappability. In order to detect CNVs, we must determine these systematic effects on the coverage of each target in the absence of CNVs, which requires a panel of normal samples (PoN) that are representative of the sequencing conditions of the case sample. PoN samples must also be created using the same baits as the case sample. | ||
|
@@ -169,7 +174,7 @@ \subsection{Segmentation by tangent-normalized coverage} | |
Finally, the tangent-normalized coverage vector is passed to CBS to obtain coverage segments. This is implemented by the GATK tool \texttt{PerformSegmentation}. | ||
|
||
\subsection{Calling of events from coverage segments} \label{gatk-cnv-caller} | ||
\SL{Description of caller goes here.} This is performed by the GATK tool \texttt{CallSegments}, which is the final step in the GATK CNV portion of the case-sample workflow. | ||
This is performed by the GATK tool \texttt{CallSegments}, which is the final step in the GATK CNV portion of the case-sample workflow. | ||
|
||
\subsection{Collection of allele counts at het sites} | ||
The first step in the GATK ACNV portion of the case-sample workflow is to gather the necessary allele-count data. This procedure is implemented by the GATK tool \texttt{GetHetCoverage}. | ||
|
@@ -301,8 +306,8 @@ \subsection{Detection of het sites using a Bayesian model} \label{bayesian-het-c | |
|
||
\begin{figure} | ||
\center | ||
\includegraphics[scale=0.7]{figs/AlleleFractionPrior1.pdf} | ||
\includegraphics[scale=0.7]{figs/AlleleFractionPrior2.pdf} | ||
\includegraphics[scale=0.7]{figures/AlleleFractionPrior1.pdf} | ||
\includegraphics[scale=0.7]{figures/AlleleFractionPrior2.pdf} | ||
\caption{Two examples of the \REF~allele fraction prior $P(f|\mathsf{Het})$ at Het sites based on minimum/maximum non-germline cells and maximum copy number. The blue lines denote the continuous approximation given in Eq.~\eqref{eq:AFpriorcont}, The discontinuous organge lines denote the result with discrete copy number summation given in Eq.~ (the delta function peak at $f=1/2$ is not shown).} | ||
\label{fig:AFprior} | ||
\end{figure} | ||
|
@@ -386,7 +391,7 @@ \subsection{Allelic model} \label{allelic-model} | |
\begin{figure} | ||
$ | ||
\begin{array}{c} | ||
\includegraphics[width=0.8\linewidth]{ACNV_model.png} | ||
\includegraphics[width=0.8\linewidth]{figures/ACNV_model.png} | ||
\end{array} | ||
$ | ||
\label{graphical_model} | ||
|
@@ -502,7 +507,7 @@ \subsection{Calling segments after allelic CNV workflow} \label{ACNV-caller} | |
\begin{figure} | ||
$ | ||
\begin{array}{c} | ||
\includegraphics[width=0.8\linewidth]{ACNV_caller_model.png} | ||
\includegraphics[width=0.8\linewidth]{figures/ACNV_caller_model.png} | ||
\end{array} | ||
$ | ||
\label{acnv_caller_fig} | ||
|
@@ -751,8 +756,8 @@ \subsubsection{The model} | |
|
||
\begin{figure} | ||
\center | ||
\includegraphics[scale=0.7]{figs/{gauss_poisson_0.1}.pdf} | ||
\includegraphics[scale=0.7]{figs/{gauss_poisson_10}.pdf} | ||
\includegraphics[scale=0.7]{figures/{gauss_poisson_0.1}.pdf} | ||
\includegraphics[scale=0.7]{figures/{gauss_poisson_10}.pdf} | ||
\caption{Gaussian approximation to the Poisson likelihood (see Eq.~\ref{eq:gaussian_approx_coverage}). The left and right panels show $\mathrm{Poisson}(n|\alpha\,e^b)$ and $n^{-1}\mathcal{N}(b|\ln(n/\alpha), n^{-1})$, respectively for $\alpha=0.1$ (top) and $\alpha=10.0$ (bottom). The black lines show $b = \ln(n/\alpha)$ the maximum likelihood bias estimate. The Gaussian approximation breaks down at $n=0$ (no coverage). It also slightly overestimates the variance at small $n$. Otherwise, it is an excellent approximation.} | ||
\label{fig:gaussian_approx_coverage} | ||
\end{figure} | ||
|
@@ -1022,13 +1027,13 @@ \subsubsection{Results} | |
\center | ||
$D=10$\\ | ||
\vspace{10pt} | ||
\includegraphics[scale=0.45]{figs/comp_random_events_10.pdf} | ||
\includegraphics[scale=0.45]{figures/comp_random_events_10_compressed.pdf} | ||
\vspace{20pt} | ||
\includegraphics[scale=0.45]{figs/comp_corr_events_10.pdf} | ||
\includegraphics[scale=0.45]{figures/comp_corr_events_10_compressed.pdf} | ||
$D=20$\\ | ||
\vspace{10pt} | ||
\includegraphics[scale=0.45]{figs/comp_random_events_20.pdf} | ||
\includegraphics[scale=0.45]{figs/comp_corr_events_20.pdf} | ||
\includegraphics[scale=0.45]{figures/comp_random_events_20_compressed.pdf} | ||
\includegraphics[scale=0.45]{figures/comp_corr_events_20_compressed.pdf} | ||
\caption{Comparison of PCA with the probabilistic coverage model in different modes. Top two rows: $D=10$; random events, correlated events. Bottom two rows: $D=20$; random events, correlated events.} | ||
\label{fig:comp_regularizer} | ||
\end{figure} | ||
|
@@ -1224,6 +1229,8 @@ \section{Marginalizing out latent variables of the allelic model} \label{margina | |
% | ||
Algorithm \ref{phi_calculation} shows the entire computation. | ||
|
||
See the ipython notebook \texttt{docs/CNV/allele-fraction-model-approximation.ipynb} for some plots that justify this approximation. | ||
|
||
\begin{algorithm} | ||
\begin{algorithmic}[1] | ||
\State $n = a + r$ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,7 @@ | |
\usepackage{color} | ||
\usepackage{mathrsfs} | ||
\usepackage{bm} | ||
\usepackage{fancyhdr} | ||
|
||
%Put an averaged random variable between brackets | ||
\newcommand{\ave}[1]{\left\langle #1 \right\rangle} | ||
|
@@ -88,7 +89,12 @@ | |
|
||
\begin{document} | ||
|
||
\title{A probabilistic model for coverage bias estimation and CNV detection} | ||
\pagestyle{fancy} | ||
\lhead{} | ||
\chead{(ARCHIVED) A probabilistic model for coverage bias estimation and CNV detection} | ||
\rhead{} | ||
|
||
\title{(ARCHIVED) A probabilistic model for coverage bias estimation and CNV detection} | ||
|
||
\author{Mehrtash Babadi} | ||
\email{[email protected]} | ||
|
@@ -102,10 +108,10 @@ | |
\email{[email protected]} | ||
\affiliation{Broad Institute, 75 Ames Street, Cambridge, MA 02142} | ||
|
||
\date{\today} | ||
\date{January 12, 2017} | ||
|
||
\begin{abstract} | ||
These notes exclusively cover the target coverage model in the GATK CNV pipeline. | ||
These notes describe the coverage model for a previous version of the current GATK \texttt{GermlineCNVCaller} pipeline. This version of the pipeline (sometimes referred to elsewhere as ``gCNV Spark'') was developed primarily in the deprecated \texttt{gatk-protected} repository and was removed prior to the release of GATK 4.0. Some of the material below is covered in less detail in the archived notes on CNV methods. (Archived on \today.) | ||
\end{abstract} | ||
|
||
\maketitle | ||
|
@@ -160,13 +166,13 @@ \section{The model} | |
\label{eq:m_sigma_def} | ||
\end{align} | ||
|
||
% \begin{figure} | ||
% \center | ||
% \includegraphics[scale=0.7]{figs/{gauss_poisson_0.1}.pdf} | ||
% \includegraphics[scale=0.7]{figs/{gauss_poisson_10}.pdf} | ||
% \caption{Gaussian approximation to the Poisson likelihood (see Eq.~\ref{eq:gaussian_approx_coverage}). The left and right panels show $\mathrm{Poisson}(n|\alpha\,e^b)$ and $n^{-1}\mathcal{N}(b|\ln(n/\alpha), n^{-1})$, respectively for $\alpha=0.1$ (top) and $\alpha=10.0$ (bottom). The black lines show $b = \ln(n/\alpha)$ the maximum likelihood bias estimate. The Gaussian approximation breaks down at $n=0$ (no coverage). It also slightly overestimates the variance at small $n$. Otherwise, it is an excellent approximation.} | ||
% \label{fig:gaussian_approx_coverage} | ||
% \end{figure} | ||
\begin{figure} | ||
\center | ||
\includegraphics[scale=0.7]{figures/{gauss_poisson_0.1}.pdf} | ||
\includegraphics[scale=0.7]{figures/{gauss_poisson_10}.pdf} | ||
\caption{Gaussian approximation to the Poisson likelihood (see Eq.~\ref{eq:gaussian_approx_coverage}). The left and right panels show $\mathrm{Poisson}(n|\alpha\,e^b)$ and $n^{-1}\mathcal{N}(b|\ln(n/\alpha), n^{-1})$, respectively for $\alpha=0.1$ (top) and $\alpha=10.0$ (bottom). The black lines show $b = \ln(n/\alpha)$ the maximum likelihood bias estimate. The Gaussian approximation breaks down at $n=0$ (no coverage). It also slightly overestimates the variance at small $n$. Otherwise, it is an excellent approximation.} | ||
\label{fig:gaussian_approx_coverage} | ||
\end{figure} | ||
|
||
Note that $\Sigma_{st}$ can be thought of as the width of the distribution of $b_{st}$ about its maximum likelihood estimate such that in the limit $n_{st}, d_s \rightarrow \infty$, ${\rm Poisson}(n_{st} | d_s c_{st} e^{b_{st}}) \rightarrow \delta(b_{st} - b^*_{st})$ where $b^*_{st} = \lim_{n,d \rightarrow \infty} m_{st}$ is the true bias. The above approximation, while being excellent for well-covered targets (see Fig.~\ref{fig:gaussian_approx_coverage}), inevitably breaks down for targets that are uncovered {\em ex ante} in some samples, such as $Y$ chromosome targets in female samples. To this end, we define a ``sample-target mask matrix'' $\MM_{st}$ such that $\MM_{st} = 0$ if $\PP_{st} = 0$, and $\MM_{st} = 1$ if $\PP_{st} \neq 0$, and for each sample-target pair $(s,t)$, we only consider targets where the $\MM_{st} \neq 0$ in the joint likelihood function. The latter is thus written as: | ||
\begin{equation} | ||
|
@@ -416,20 +422,20 @@ \section{Results} | |
|
||
Fig.~\ref{fig:comp_regularizer} compares PCA denoising against our probabilistic model with different features turned on/off (ARD, CNV event regularization) for random and correlated events, respectively. It is clearly observed that the regularized model retains all of the events even when the number of latent features chosen is greater than the true number. | ||
|
||
% \begin{figure} | ||
% \center | ||
% $D=10$\\ | ||
% \vspace{10pt} | ||
% \includegraphics[scale=0.45]{figs/comp_random_events_10.pdf} | ||
% \vspace{20pt} | ||
% \includegraphics[scale=0.45]{figs/comp_corr_events_10.pdf} | ||
% $D=20$\\ | ||
% \vspace{10pt} | ||
% \includegraphics[scale=0.45]{figs/comp_random_events_20.pdf} | ||
% \includegraphics[scale=0.45]{figs/comp_corr_events_20.pdf} | ||
% \caption{Comparison of PCA with the probabilistic coverage model in different modes. Top two rows: $D=10$; random events, correlated events. Bottom two rows: $D=20$; random events, correlated events.} | ||
% \label{fig:comp_regularizer} | ||
% \end{figure} | ||
\begin{figure} | ||
\center | ||
$D=10$\\ | ||
\vspace{10pt} | ||
\includegraphics[scale=0.45]{figures/comp_random_events_10_compressed.pdf} | ||
\vspace{20pt} | ||
\includegraphics[scale=0.45]{figures/comp_corr_events_10_compressed.pdf} | ||
$D=20$\\ | ||
\vspace{10pt} | ||
\includegraphics[scale=0.45]{figures/comp_random_events_20_compressed.pdf} | ||
\includegraphics[scale=0.45]{figures/comp_corr_events_20_compressed.pdf} | ||
\caption{Comparison of PCA with the probabilistic coverage model in different modes. Top two rows: $D=10$; random events, correlated events. Bottom two rows: $D=20$; random events, correlated events.} | ||
\label{fig:comp_regularizer} | ||
\end{figure} | ||
|
||
|
||
\appendix | ||
|
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
File renamed without changes.
File renamed without changes.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a DEPRECATED or ARCHIVED watermark to each page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a header to each page---a little less distracting than a watermark.