Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV met…

…hods, and deleted some legacy documentation. (#5732) * Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation. * Updated links for documentation on topological persistence. * Added ARCHIVED headers.
broadinstitute · Mar 22, 2019 · 02b95b3 · 02b95b3
1 parent a980ec6
commit 02b95b3
Show file tree

Hide file tree

Showing 42 changed files with 426 additions and 210 deletions.
diff --git a/...CNVs/justifying_gamma_approximation.ipynb → ...allele-fraction-model-approximation.ipynb b/...CNVs/justifying_gamma_approximation.ipynb → ...allele-fraction-model-approximation.ipynb
@@ -8,7 +8,7 @@
    },
    "outputs": [],
    "source": [
-    "# Here we check the validity of the approximation in Appendix A of CNV-methods.pdf"
+    "# Here we check the validity of the approximation in Appendix A of docs/CNV/archived/archived-CNV-methods.pdf"
    ]
   },
   {
@@ -396,7 +396,9 @@
     "collapsed": true
    },
    "outputs": [],
-   "source": []
+   "source": [
+    ""
+   ]
   }
  ],
  "metadata": {
@@ -408,7 +410,7 @@
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 3
+    "version": 3.0
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
@@ -420,4 +422,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}
diff --git a/docs/CNV/archived/archived-CNV-methods.pdf b/docs/CNV/archived/archived-CNV-methods.pdf
diff --git a/docs/CNVs/CNV-methods.tex → docs/CNV/archived/archived-CNV-methods.tex b/docs/CNVs/CNV-methods.tex → docs/CNV/archived/archived-CNV-methods.tex
@@ -3,10 +3,10 @@
 \usepackage{amsthm}
 \usepackage{algorithm}
 \usepackage{algpseudocode}
-\usepackage{lmodern}
 \usepackage{graphicx}
 \usepackage{color}
 \usepackage{mathrsfs}
+\usepackage{fancyhdr}
 
 %Put an averaged random variable between brackets
 \DeclareMathOperator*{\argmax}{\arg\!\max}
@@ -86,7 +86,12 @@
 
 \begin{document}
 
-\title{Notes on CNV Methods}
+\pagestyle{fancy}
+\lhead{}
+\chead{(ARCHIVED) Notes on CNV Methods}
+\rhead{}
+
+\title{(ARCHIVED) Notes on CNV Methods}
 
 \author{Mehrtash Babadi}
 \email{[email protected]}
@@ -100,10 +105,10 @@
 \email{[email protected]}
 \affiliation{Broad Institute, 75 Ames Street, Cambridge, MA 02142}
 
-\date{\today}
+\date{January 12, 2017}
 
 \begin{abstract}
-Some notes on current and proposed methods used in the GATK CNV and ACNV workflows.
+These notes describe methods that were implemented or proposed for various iterations of the GATK somatic and germline CNV pipelines.  The majority of the methods are deprecated and have been superseded by improved methods in GATK 4.0 onwards.  (Archived on \today.)
 \end{abstract}
 
 \maketitle
@@ -124,7 +129,7 @@ \section{Steps in the GATK CNV and ACNV Workflows} \label{recapseg-overview}
 
 \subsection{Coverage collection}
 
-\SL{Details of coverage collection go here.}  This is implemented by the GATK command-line tool \texttt{CalculateTargetCoverage}.
+This is implemented by the GATK command-line tool \texttt{CalculateTargetCoverage}.
 
 \subsection{Creation of a panel of normals}
 We cannot simply divide the coverage of each target by the average sequencing depth to obtain an estimate of its copy ratio.  This is because the coverage of different targets is heavily-biased by factors including the efficiency of their baits, GC content, and mappability.  In order to detect CNVs, we must determine these systematic effects on the coverage of each target in the absence of CNVs, which requires a panel of normal samples (PoN) that are representative of the sequencing conditions of the case sample.  PoN samples must also be created using the same baits as the case sample.
@@ -169,7 +174,7 @@ \subsection{Segmentation by tangent-normalized coverage}
 Finally, the tangent-normalized coverage vector is passed to CBS to obtain coverage segments.  This is implemented by the GATK tool \texttt{PerformSegmentation}.
 
 \subsection{Calling of events from coverage segments} \label{gatk-cnv-caller}
-\SL{Description of caller goes here.}  This is performed by the GATK tool \texttt{CallSegments}, which is the final step in the GATK CNV portion of the case-sample workflow.
+This is performed by the GATK tool \texttt{CallSegments}, which is the final step in the GATK CNV portion of the case-sample workflow.
 
 \subsection{Collection of allele counts at het sites}
 The first step in the GATK ACNV portion of the case-sample workflow is to gather the necessary allele-count data.  This procedure is implemented by the GATK tool \texttt{GetHetCoverage}.
@@ -301,8 +306,8 @@ \subsection{Detection of het sites using a Bayesian model} \label{bayesian-het-c
 
 \begin{figure}
 \center
-\includegraphics[scale=0.7]{figs/AlleleFractionPrior1.pdf}
-\includegraphics[scale=0.7]{figs/AlleleFractionPrior2.pdf}
+\includegraphics[scale=0.7]{figures/AlleleFractionPrior1.pdf}
+\includegraphics[scale=0.7]{figures/AlleleFractionPrior2.pdf}
 \caption{Two examples of the \REF~allele fraction prior $P(f|\mathsf{Het})$ at Het sites based on minimum/maximum non-germline cells and maximum copy number. The blue lines denote the continuous approximation given in Eq.~\eqref{eq:AFpriorcont}, The discontinuous organge lines denote the result with discrete copy number summation given in Eq.~ (the delta function peak at $f=1/2$ is not shown).}
 \label{fig:AFprior}
 \end{figure}
@@ -386,7 +391,7 @@ \subsection{Allelic model} \label{allelic-model}
 \begin{figure}
 $
 \begin{array}{c}
-\includegraphics[width=0.8\linewidth]{ACNV_model.png} 
+\includegraphics[width=0.8\linewidth]{figures/ACNV_model.png} 
 \end{array}
 $
 \label{graphical_model}
@@ -502,7 +507,7 @@ \subsection{Calling segments after allelic CNV workflow} \label{ACNV-caller}
 \begin{figure}
 $
 \begin{array}{c}
-\includegraphics[width=0.8\linewidth]{ACNV_caller_model.png} 
+\includegraphics[width=0.8\linewidth]{figures/ACNV_caller_model.png} 
 \end{array}
 $
 \label{acnv_caller_fig}
@@ -751,8 +756,8 @@ \subsubsection{The model}
 
 \begin{figure}
 \center
-\includegraphics[scale=0.7]{figs/{gauss_poisson_0.1}.pdf}
-\includegraphics[scale=0.7]{figs/{gauss_poisson_10}.pdf}
+\includegraphics[scale=0.7]{figures/{gauss_poisson_0.1}.pdf}
+\includegraphics[scale=0.7]{figures/{gauss_poisson_10}.pdf}
 \caption{Gaussian approximation to the Poisson likelihood (see Eq.~\ref{eq:gaussian_approx_coverage}). The left and right panels show $\mathrm{Poisson}(n|\alpha\,e^b)$ and $n^{-1}\mathcal{N}(b|\ln(n/\alpha), n^{-1})$, respectively for $\alpha=0.1$ (top) and $\alpha=10.0$ (bottom). The black lines show $b = \ln(n/\alpha)$ the maximum likelihood bias estimate. The Gaussian approximation breaks down at $n=0$ (no coverage). It also slightly overestimates the variance at small $n$. Otherwise, it is an excellent approximation.}
 \label{fig:gaussian_approx_coverage}
 \end{figure}
@@ -1022,13 +1027,13 @@ \subsubsection{Results}
 \center
 $D=10$\\
 \vspace{10pt}
-\includegraphics[scale=0.45]{figs/comp_random_events_10.pdf}
+\includegraphics[scale=0.45]{figures/comp_random_events_10_compressed.pdf}
 \vspace{20pt}
-\includegraphics[scale=0.45]{figs/comp_corr_events_10.pdf}
+\includegraphics[scale=0.45]{figures/comp_corr_events_10_compressed.pdf}
 $D=20$\\
 \vspace{10pt}
-\includegraphics[scale=0.45]{figs/comp_random_events_20.pdf}
-\includegraphics[scale=0.45]{figs/comp_corr_events_20.pdf}
+\includegraphics[scale=0.45]{figures/comp_random_events_20_compressed.pdf}
+\includegraphics[scale=0.45]{figures/comp_corr_events_20_compressed.pdf}
 \caption{Comparison of PCA with the probabilistic coverage model in different modes. Top two rows: $D=10$; random events, correlated events. Bottom two rows: $D=20$; random events, correlated events.}
 \label{fig:comp_regularizer}
 \end{figure}
@@ -1224,6 +1229,8 @@ \section{Marginalizing out latent variables of the allelic model} \label{margina
 %
 Algorithm \ref{phi_calculation} shows the entire computation.
 
+See the ipython notebook \texttt{docs/CNV/allele-fraction-model-approximation.ipynb} for some plots that justify this approximation.
+
 \begin{algorithm}
 \begin{algorithmic}[1]
 \State $n = a + r$

diff --git a/docs/CNVs/target-coverage.tex → .../CNV/archived/archived-coverage-model.tex b/docs/CNVs/target-coverage.tex → .../CNV/archived/archived-coverage-model.tex
@@ -8,6 +8,7 @@
 \usepackage{color}
 \usepackage{mathrsfs}
 \usepackage{bm}
+\usepackage{fancyhdr}
 
 %Put an averaged random variable between brackets
 \newcommand{\ave}[1]{\left\langle #1 \right\rangle}
@@ -88,7 +89,12 @@
 
 \begin{document}
 
-\title{A probabilistic model for coverage bias estimation and CNV detection}
+\pagestyle{fancy}
+\lhead{}
+\chead{(ARCHIVED) A probabilistic model for coverage bias estimation and CNV detection}
+\rhead{}
+
+\title{(ARCHIVED) A probabilistic model for coverage bias estimation and CNV detection}
 
 \author{Mehrtash Babadi}
 \email{[email protected]}
@@ -102,10 +108,10 @@
 \email{[email protected]}
 \affiliation{Broad Institute, 75 Ames Street, Cambridge, MA 02142}
 
-\date{\today}
+\date{January 12, 2017}
 
 \begin{abstract}
-These notes exclusively cover the target coverage model in the GATK CNV pipeline.
+These notes describe the coverage model for a previous version of the current GATK \texttt{GermlineCNVCaller} pipeline.  This version of the pipeline (sometimes referred to elsewhere as ``gCNV Spark'') was developed primarily in the deprecated \texttt{gatk-protected} repository and was removed prior to the release of GATK 4.0.  Some of the material below is covered in less detail in the archived notes on CNV methods.  (Archived on \today.)
 \end{abstract}
 
 \maketitle
@@ -160,13 +166,13 @@ \section{The model}
 \label{eq:m_sigma_def}
 \end{align}
 
-% \begin{figure}
-% \center
-% \includegraphics[scale=0.7]{figs/{gauss_poisson_0.1}.pdf}
-% \includegraphics[scale=0.7]{figs/{gauss_poisson_10}.pdf}
-% \caption{Gaussian approximation to the Poisson likelihood (see Eq.~\ref{eq:gaussian_approx_coverage}). The left and right panels show $\mathrm{Poisson}(n|\alpha\,e^b)$ and $n^{-1}\mathcal{N}(b|\ln(n/\alpha), n^{-1})$, respectively for $\alpha=0.1$ (top) and $\alpha=10.0$ (bottom). The black lines show $b = \ln(n/\alpha)$ the maximum likelihood bias estimate. The Gaussian approximation breaks down at $n=0$ (no coverage). It also slightly overestimates the variance at small $n$. Otherwise, it is an excellent approximation.}
-% \label{fig:gaussian_approx_coverage}
-% \end{figure}
+\begin{figure}
+\center
+\includegraphics[scale=0.7]{figures/{gauss_poisson_0.1}.pdf}
+\includegraphics[scale=0.7]{figures/{gauss_poisson_10}.pdf}
+\caption{Gaussian approximation to the Poisson likelihood (see Eq.~\ref{eq:gaussian_approx_coverage}). The left and right panels show $\mathrm{Poisson}(n|\alpha\,e^b)$ and $n^{-1}\mathcal{N}(b|\ln(n/\alpha), n^{-1})$, respectively for $\alpha=0.1$ (top) and $\alpha=10.0$ (bottom). The black lines show $b = \ln(n/\alpha)$ the maximum likelihood bias estimate. The Gaussian approximation breaks down at $n=0$ (no coverage). It also slightly overestimates the variance at small $n$. Otherwise, it is an excellent approximation.}
+\label{fig:gaussian_approx_coverage}
+\end{figure}
 
 Note that $\Sigma_{st}$ can be thought of as the width of the distribution of $b_{st}$ about its maximum likelihood estimate such that in the limit $n_{st}, d_s \rightarrow \infty$, ${\rm Poisson}(n_{st} | d_s c_{st} e^{b_{st}}) \rightarrow \delta(b_{st} - b^*_{st})$ where $b^*_{st} = \lim_{n,d \rightarrow \infty} m_{st}$ is the true bias. The above approximation, while being excellent for well-covered targets (see Fig.~\ref{fig:gaussian_approx_coverage}), inevitably breaks down for targets that are uncovered {\em ex ante} in some samples, such as $Y$ chromosome targets in female samples. To this end, we define a ``sample-target mask matrix'' $\MM_{st}$ such that $\MM_{st} = 0$ if $\PP_{st} = 0$, and $\MM_{st} = 1$ if $\PP_{st} \neq 0$, and for each sample-target pair $(s,t)$, we only consider targets where the $\MM_{st} \neq 0$ in the joint likelihood function. The latter is thus written as:
 \begin{equation}
@@ -416,20 +422,20 @@ \section{Results}
 
 Fig.~\ref{fig:comp_regularizer} compares PCA denoising against our probabilistic model with different features turned on/off (ARD, CNV event regularization) for random and correlated events, respectively. It is clearly observed that the regularized model retains all of the events even when the number of latent features chosen is greater than the true number.
 
-% \begin{figure}
-% \center
-% $D=10$\\
-% \vspace{10pt}
-% \includegraphics[scale=0.45]{figs/comp_random_events_10.pdf}
-% \vspace{20pt}
-% \includegraphics[scale=0.45]{figs/comp_corr_events_10.pdf}
-% $D=20$\\
-% \vspace{10pt}
-% \includegraphics[scale=0.45]{figs/comp_random_events_20.pdf}
-% \includegraphics[scale=0.45]{figs/comp_corr_events_20.pdf}
-% \caption{Comparison of PCA with the probabilistic coverage model in different modes. Top two rows: $D=10$; random events, correlated events. Bottom two rows: $D=20$; random events, correlated events.}
-% \label{fig:comp_regularizer}
-% \end{figure}
+\begin{figure}
+\center
+$D=10$\\
+\vspace{10pt}
+\includegraphics[scale=0.45]{figures/comp_random_events_10_compressed.pdf}
+\vspace{20pt}
+\includegraphics[scale=0.45]{figures/comp_corr_events_10_compressed.pdf}
+$D=20$\\
+\vspace{10pt}
+\includegraphics[scale=0.45]{figures/comp_random_events_20_compressed.pdf}
+\includegraphics[scale=0.45]{figures/comp_corr_events_20_compressed.pdf}
+\caption{Comparison of PCA with the probabilistic coverage model in different modes. Top two rows: $D=10$; random events, correlated events. Bottom two rows: $D=20$; random events, correlated events.}
+\label{fig:comp_regularizer}
+\end{figure}
 
 
 \appendix

diff --git a/docs/CNVs/ACNV_caller_model.png → ...NV/archived/figures/ACNV_caller_model.png b/docs/CNVs/ACNV_caller_model.png → ...NV/archived/figures/ACNV_caller_model.png
diff --git a/docs/CNVs/ACNV_model.png → docs/CNV/archived/figures/ACNV_model.png b/docs/CNVs/ACNV_model.png → docs/CNV/archived/figures/ACNV_model.png
diff --git a/docs/CNVs/figs/AlleleFractionPrior1.pdf → ...archived/figures/AlleleFractionPrior1.pdf b/docs/CNVs/figs/AlleleFractionPrior1.pdf → ...archived/figures/AlleleFractionPrior1.pdf
diff --git a/docs/CNVs/figs/AlleleFractionPrior2.pdf → ...archived/figures/AlleleFractionPrior2.pdf b/docs/CNVs/figs/AlleleFractionPrior2.pdf → ...archived/figures/AlleleFractionPrior2.pdf
diff --git a/docs/CNV/archived/figures/comp_corr_events_10_compressed.pdf b/docs/CNV/archived/figures/comp_corr_events_10_compressed.pdf
diff --git a/docs/CNV/archived/figures/comp_corr_events_20_compressed.pdf b/docs/CNV/archived/figures/comp_corr_events_20_compressed.pdf
diff --git a/docs/CNV/archived/figures/comp_random_events_10_compressed.pdf b/docs/CNV/archived/figures/comp_random_events_10_compressed.pdf
diff --git a/docs/CNV/archived/figures/comp_random_events_20_compressed.pdf b/docs/CNV/archived/figures/comp_random_events_20_compressed.pdf
diff --git a/docs/CNVs/figs/gauss_poisson_0.1.pdf → ...NV/archived/figures/gauss_poisson_0.1.pdf b/docs/CNVs/figs/gauss_poisson_0.1.pdf → ...NV/archived/figures/gauss_poisson_0.1.pdf
diff --git a/docs/CNVs/figs/gauss_poisson_10.pdf → ...CNV/archived/figures/gauss_poisson_10.pdf b/docs/CNVs/figs/gauss_poisson_10.pdf → ...CNV/archived/figures/gauss_poisson_10.pdf
diff --git a/docs/CNV/figures/germline-cnv-caller-model/benchmark.pdf b/docs/CNV/figures/germline-cnv-caller-model/benchmark.pdf
diff --git a/docs/CNV/figures/germline-cnv-caller-model/denoising_component.pdf b/docs/CNV/figures/germline-cnv-caller-model/denoising_component.pdf
diff --git a/docs/CNV/figures/germline-cnv-caller-model/full_model.pdf b/docs/CNV/figures/germline-cnv-caller-model/full_model.pdf
diff --git a/docs/CNV/figures/germline-cnv-caller-model/hhmm.pdf b/docs/CNV/figures/germline-cnv-caller-model/hhmm.pdf