Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation. #5732

Merged
merged 4 commits into from
Mar 22, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
},
"outputs": [],
"source": [
"# Here we check the validity of the approximation in Appendix A of CNV-methods.pdf"
"# Here we check the validity of the approximation in Appendix A of docs/CNV/archived/archived-CNV-methods.pdf"
]
},
{
Expand Down Expand Up @@ -396,7 +396,9 @@
"collapsed": true
},
"outputs": [],
"source": []
"source": [
""
]
}
],
"metadata": {
Expand All @@ -408,7 +410,7 @@
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"version": 3.0
},
"file_extension": ".py",
"mimetype": "text/x-python",
Expand All @@ -420,4 +422,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}
Binary file added docs/CNV/archived/archived-CNV-methods.pdf
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
\usepackage{amsthm}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{lmodern}
\usepackage{graphicx}
\usepackage{color}
\usepackage{mathrsfs}
\usepackage{fancyhdr}

%Put an averaged random variable between brackets
\DeclareMathOperator*{\argmax}{\arg\!\max}
Expand Down Expand Up @@ -86,7 +86,12 @@

\begin{document}

\title{Notes on CNV Methods}
\pagestyle{fancy}
\lhead{}
\chead{(ARCHIVED) Notes on CNV Methods}
\rhead{}

\title{(ARCHIVED) Notes on CNV Methods}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a DEPRECATED or ARCHIVED watermark to each page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a header to each page---a little less distracting than a watermark.


\author{Mehrtash Babadi}
\email{[email protected]}
Expand All @@ -100,10 +105,10 @@
\email{[email protected]}
\affiliation{Broad Institute, 75 Ames Street, Cambridge, MA 02142}

\date{\today}
\date{January 12, 2017}

\begin{abstract}
Some notes on current and proposed methods used in the GATK CNV and ACNV workflows.
These notes describe methods that were implemented or proposed for various iterations of the GATK somatic and germline CNV pipelines. The majority of the methods are deprecated and have been superseded by improved methods in GATK 4.0 onwards. (Archived on \today.)
\end{abstract}

\maketitle
Expand All @@ -124,7 +129,7 @@ \section{Steps in the GATK CNV and ACNV Workflows} \label{recapseg-overview}

\subsection{Coverage collection}

\SL{Details of coverage collection go here.} This is implemented by the GATK command-line tool \texttt{CalculateTargetCoverage}.
This is implemented by the GATK command-line tool \texttt{CalculateTargetCoverage}.

\subsection{Creation of a panel of normals}
We cannot simply divide the coverage of each target by the average sequencing depth to obtain an estimate of its copy ratio. This is because the coverage of different targets is heavily-biased by factors including the efficiency of their baits, GC content, and mappability. In order to detect CNVs, we must determine these systematic effects on the coverage of each target in the absence of CNVs, which requires a panel of normal samples (PoN) that are representative of the sequencing conditions of the case sample. PoN samples must also be created using the same baits as the case sample.
Expand Down Expand Up @@ -169,7 +174,7 @@ \subsection{Segmentation by tangent-normalized coverage}
Finally, the tangent-normalized coverage vector is passed to CBS to obtain coverage segments. This is implemented by the GATK tool \texttt{PerformSegmentation}.

\subsection{Calling of events from coverage segments} \label{gatk-cnv-caller}
\SL{Description of caller goes here.} This is performed by the GATK tool \texttt{CallSegments}, which is the final step in the GATK CNV portion of the case-sample workflow.
This is performed by the GATK tool \texttt{CallSegments}, which is the final step in the GATK CNV portion of the case-sample workflow.

\subsection{Collection of allele counts at het sites}
The first step in the GATK ACNV portion of the case-sample workflow is to gather the necessary allele-count data. This procedure is implemented by the GATK tool \texttt{GetHetCoverage}.
Expand Down Expand Up @@ -301,8 +306,8 @@ \subsection{Detection of het sites using a Bayesian model} \label{bayesian-het-c

\begin{figure}
\center
\includegraphics[scale=0.7]{figs/AlleleFractionPrior1.pdf}
\includegraphics[scale=0.7]{figs/AlleleFractionPrior2.pdf}
\includegraphics[scale=0.7]{figures/AlleleFractionPrior1.pdf}
\includegraphics[scale=0.7]{figures/AlleleFractionPrior2.pdf}
\caption{Two examples of the \REF~allele fraction prior $P(f|\mathsf{Het})$ at Het sites based on minimum/maximum non-germline cells and maximum copy number. The blue lines denote the continuous approximation given in Eq.~\eqref{eq:AFpriorcont}, The discontinuous organge lines denote the result with discrete copy number summation given in Eq.~ (the delta function peak at $f=1/2$ is not shown).}
\label{fig:AFprior}
\end{figure}
Expand Down Expand Up @@ -386,7 +391,7 @@ \subsection{Allelic model} \label{allelic-model}
\begin{figure}
$
\begin{array}{c}
\includegraphics[width=0.8\linewidth]{ACNV_model.png}
\includegraphics[width=0.8\linewidth]{figures/ACNV_model.png}
\end{array}
$
\label{graphical_model}
Expand Down Expand Up @@ -502,7 +507,7 @@ \subsection{Calling segments after allelic CNV workflow} \label{ACNV-caller}
\begin{figure}
$
\begin{array}{c}
\includegraphics[width=0.8\linewidth]{ACNV_caller_model.png}
\includegraphics[width=0.8\linewidth]{figures/ACNV_caller_model.png}
\end{array}
$
\label{acnv_caller_fig}
Expand Down Expand Up @@ -751,8 +756,8 @@ \subsubsection{The model}

\begin{figure}
\center
\includegraphics[scale=0.7]{figs/{gauss_poisson_0.1}.pdf}
\includegraphics[scale=0.7]{figs/{gauss_poisson_10}.pdf}
\includegraphics[scale=0.7]{figures/{gauss_poisson_0.1}.pdf}
\includegraphics[scale=0.7]{figures/{gauss_poisson_10}.pdf}
\caption{Gaussian approximation to the Poisson likelihood (see Eq.~\ref{eq:gaussian_approx_coverage}). The left and right panels show $\mathrm{Poisson}(n|\alpha\,e^b)$ and $n^{-1}\mathcal{N}(b|\ln(n/\alpha), n^{-1})$, respectively for $\alpha=0.1$ (top) and $\alpha=10.0$ (bottom). The black lines show $b = \ln(n/\alpha)$ the maximum likelihood bias estimate. The Gaussian approximation breaks down at $n=0$ (no coverage). It also slightly overestimates the variance at small $n$. Otherwise, it is an excellent approximation.}
\label{fig:gaussian_approx_coverage}
\end{figure}
Expand Down Expand Up @@ -1022,13 +1027,13 @@ \subsubsection{Results}
\center
$D=10$\\
\vspace{10pt}
\includegraphics[scale=0.45]{figs/comp_random_events_10.pdf}
\includegraphics[scale=0.45]{figures/comp_random_events_10_compressed.pdf}
\vspace{20pt}
\includegraphics[scale=0.45]{figs/comp_corr_events_10.pdf}
\includegraphics[scale=0.45]{figures/comp_corr_events_10_compressed.pdf}
$D=20$\\
\vspace{10pt}
\includegraphics[scale=0.45]{figs/comp_random_events_20.pdf}
\includegraphics[scale=0.45]{figs/comp_corr_events_20.pdf}
\includegraphics[scale=0.45]{figures/comp_random_events_20_compressed.pdf}
\includegraphics[scale=0.45]{figures/comp_corr_events_20_compressed.pdf}
\caption{Comparison of PCA with the probabilistic coverage model in different modes. Top two rows: $D=10$; random events, correlated events. Bottom two rows: $D=20$; random events, correlated events.}
\label{fig:comp_regularizer}
\end{figure}
Expand Down Expand Up @@ -1224,6 +1229,8 @@ \section{Marginalizing out latent variables of the allelic model} \label{margina
%
Algorithm \ref{phi_calculation} shows the entire computation.

See the ipython notebook \texttt{docs/CNV/allele-fraction-model-approximation.ipynb} for some plots that justify this approximation.

\begin{algorithm}
\begin{algorithmic}[1]
\State $n = a + r$
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
\usepackage{color}
\usepackage{mathrsfs}
\usepackage{bm}
\usepackage{fancyhdr}

%Put an averaged random variable between brackets
\newcommand{\ave}[1]{\left\langle #1 \right\rangle}
Expand Down Expand Up @@ -88,7 +89,12 @@

\begin{document}

\title{A probabilistic model for coverage bias estimation and CNV detection}
\pagestyle{fancy}
\lhead{}
\chead{(ARCHIVED) A probabilistic model for coverage bias estimation and CNV detection}
\rhead{}

\title{(ARCHIVED) A probabilistic model for coverage bias estimation and CNV detection}

\author{Mehrtash Babadi}
\email{[email protected]}
Expand All @@ -102,10 +108,10 @@
\email{[email protected]}
\affiliation{Broad Institute, 75 Ames Street, Cambridge, MA 02142}

\date{\today}
\date{January 12, 2017}

\begin{abstract}
These notes exclusively cover the target coverage model in the GATK CNV pipeline.
These notes describe the coverage model for a previous version of the current GATK \texttt{GermlineCNVCaller} pipeline. This version of the pipeline (sometimes referred to elsewhere as ``gCNV Spark'') was developed primarily in the deprecated \texttt{gatk-protected} repository and was removed prior to the release of GATK 4.0. Some of the material below is covered in less detail in the archived notes on CNV methods. (Archived on \today.)
\end{abstract}

\maketitle
Expand Down Expand Up @@ -160,13 +166,13 @@ \section{The model}
\label{eq:m_sigma_def}
\end{align}

% \begin{figure}
% \center
% \includegraphics[scale=0.7]{figs/{gauss_poisson_0.1}.pdf}
% \includegraphics[scale=0.7]{figs/{gauss_poisson_10}.pdf}
% \caption{Gaussian approximation to the Poisson likelihood (see Eq.~\ref{eq:gaussian_approx_coverage}). The left and right panels show $\mathrm{Poisson}(n|\alpha\,e^b)$ and $n^{-1}\mathcal{N}(b|\ln(n/\alpha), n^{-1})$, respectively for $\alpha=0.1$ (top) and $\alpha=10.0$ (bottom). The black lines show $b = \ln(n/\alpha)$ the maximum likelihood bias estimate. The Gaussian approximation breaks down at $n=0$ (no coverage). It also slightly overestimates the variance at small $n$. Otherwise, it is an excellent approximation.}
% \label{fig:gaussian_approx_coverage}
% \end{figure}
\begin{figure}
\center
\includegraphics[scale=0.7]{figures/{gauss_poisson_0.1}.pdf}
\includegraphics[scale=0.7]{figures/{gauss_poisson_10}.pdf}
\caption{Gaussian approximation to the Poisson likelihood (see Eq.~\ref{eq:gaussian_approx_coverage}). The left and right panels show $\mathrm{Poisson}(n|\alpha\,e^b)$ and $n^{-1}\mathcal{N}(b|\ln(n/\alpha), n^{-1})$, respectively for $\alpha=0.1$ (top) and $\alpha=10.0$ (bottom). The black lines show $b = \ln(n/\alpha)$ the maximum likelihood bias estimate. The Gaussian approximation breaks down at $n=0$ (no coverage). It also slightly overestimates the variance at small $n$. Otherwise, it is an excellent approximation.}
\label{fig:gaussian_approx_coverage}
\end{figure}

Note that $\Sigma_{st}$ can be thought of as the width of the distribution of $b_{st}$ about its maximum likelihood estimate such that in the limit $n_{st}, d_s \rightarrow \infty$, ${\rm Poisson}(n_{st} | d_s c_{st} e^{b_{st}}) \rightarrow \delta(b_{st} - b^*_{st})$ where $b^*_{st} = \lim_{n,d \rightarrow \infty} m_{st}$ is the true bias. The above approximation, while being excellent for well-covered targets (see Fig.~\ref{fig:gaussian_approx_coverage}), inevitably breaks down for targets that are uncovered {\em ex ante} in some samples, such as $Y$ chromosome targets in female samples. To this end, we define a ``sample-target mask matrix'' $\MM_{st}$ such that $\MM_{st} = 0$ if $\PP_{st} = 0$, and $\MM_{st} = 1$ if $\PP_{st} \neq 0$, and for each sample-target pair $(s,t)$, we only consider targets where the $\MM_{st} \neq 0$ in the joint likelihood function. The latter is thus written as:
\begin{equation}
Expand Down Expand Up @@ -416,20 +422,20 @@ \section{Results}

Fig.~\ref{fig:comp_regularizer} compares PCA denoising against our probabilistic model with different features turned on/off (ARD, CNV event regularization) for random and correlated events, respectively. It is clearly observed that the regularized model retains all of the events even when the number of latent features chosen is greater than the true number.

% \begin{figure}
% \center
% $D=10$\\
% \vspace{10pt}
% \includegraphics[scale=0.45]{figs/comp_random_events_10.pdf}
% \vspace{20pt}
% \includegraphics[scale=0.45]{figs/comp_corr_events_10.pdf}
% $D=20$\\
% \vspace{10pt}
% \includegraphics[scale=0.45]{figs/comp_random_events_20.pdf}
% \includegraphics[scale=0.45]{figs/comp_corr_events_20.pdf}
% \caption{Comparison of PCA with the probabilistic coverage model in different modes. Top two rows: $D=10$; random events, correlated events. Bottom two rows: $D=20$; random events, correlated events.}
% \label{fig:comp_regularizer}
% \end{figure}
\begin{figure}
\center
$D=10$\\
\vspace{10pt}
\includegraphics[scale=0.45]{figures/comp_random_events_10_compressed.pdf}
\vspace{20pt}
\includegraphics[scale=0.45]{figures/comp_corr_events_10_compressed.pdf}
$D=20$\\
\vspace{10pt}
\includegraphics[scale=0.45]{figures/comp_random_events_20_compressed.pdf}
\includegraphics[scale=0.45]{figures/comp_corr_events_20_compressed.pdf}
\caption{Comparison of PCA with the probabilistic coverage model in different modes. Top two rows: $D=10$; random events, correlated events. Bottom two rows: $D=20$; random events, correlated events.}
\label{fig:comp_regularizer}
\end{figure}


\appendix
Expand Down
File renamed without changes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading