Merge pull request #25 from gibsramen/init-release

Initial release - big documentation upgrades
biocore · Apr 5, 2021 · 5dc78b0 · 5dc78b0
2 parents ebe0d4b + 4b41bfb
commit 5dc78b0
Show file tree

Hide file tree

Showing 14 changed files with 654 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -14,11 +14,12 @@ Currently BIRDMAn requires Python 3.7 or higher.
 ```bash
 conda install -c conda-forge dask biom-format patsy xarray
 pip install cmdstanpy
+pip install git+git://github.com/arviz-devs/arviz.git
 git clone [email protected]:gibsramen/BIRDMAn.git
 cd BIRDMAn
 pip install .
 ```
 
-You have to install `cmdstan` through the `cmdstanpy.install_cmdstan` function first.
+You have to install `cmdstan` through the `cmdstanpy.install_cmdstan` function first. See the [CmdStanPy documentation](https://mc-stan.org/cmdstanpy/installation.html#install-cmdstan) for details.
 
 See the [documentation](https://birdman.readthedocs.io/en/latest/?badge=latest) for details on usage.
diff --git a/birdman/visualization.py b/birdman/visualization.py
@@ -7,7 +7,7 @@
 def plot_parameter_estimates(
     inference_object: az.InferenceData,
     parameter: str,
-    coord: dict = dict(),
+    coords: dict = dict(),
     num_std: float = 1.0
 ):
     """Plot credible intervals of estimated parameters.
@@ -18,8 +18,8 @@ def plot_parameter_estimates(
     :param parameter: Name of parameter to plot
     :type parameter: str
 
-    :param coord: Coordinates of parameter to plot
-    :type coord: dict, optional
+    :param coords: Coordinates of parameter to plot
+    :type coords: dict, optional
 
     :param num_std: Number of standard deviations to plot as error bars,
         defaults to 1.0
@@ -28,14 +28,14 @@ def plot_parameter_estimates(
     :returns: matplotlib axes figure
     """
     posterior = inference_object.posterior
-    if len(posterior[parameter].coords) > 3 and not coord:
+    if len(posterior[parameter].coords) > 3 and not coords:
         raise ValueError(
             "Must provide coordinates if plotting multi-dimensional parameter"
             " estimates!"
         )
 
-    param_means = posterior[parameter].sel(**coord).mean(["chain", "draw"])
-    param_stds = posterior[parameter].sel(**coord).std(["chain", "draw"])
+    param_means = posterior[parameter].sel(**coords).mean(["chain", "draw"])
+    param_stds = posterior[parameter].sel(**coords).std(["chain", "draw"])
     sort_indices = param_means.argsort().data
     param_means = param_means.data[sort_indices]
     param_stds = param_stds.data[sort_indices]
@@ -45,9 +45,6 @@ def plot_parameter_estimates(
     ax.errorbar(x=x, y=param_means, yerr=param_stds*num_std, zorder=0)
     ax.scatter(x=x, y=param_means, color="black", zorder=2, s=5)
 
-    ax.set_xlabel("Feature")
-    ax.set_ylabel("Differential")
-
     return ax
 
 

diff --git a/docs/bayesian_inference.rst b/docs/bayesian_inference.rst
@@ -0,0 +1,48 @@
+Bayesian inference
+==================
+
+Bayesian inference is a technique to learn about a given population through Bayes' Theorem. Given data, :math:`D`, and parameter(s), :math:`\theta`, Bayes' Theorem applies as follows:
+
+.. math::
+
+    P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}
+
+The component parts of this formula are usually described with the following terms:
+
+* :math:`P(\theta)`: *prior* - our "current" knowledge of :math:`\theta` independent of our data
+* :math:`P(D|\theta)`: *likelihood* - how likely is the data given our knowledge of the parameters?
+* :math:`P(\theta|D)`: *posterior* - the "updated" knowledge of :math:`\theta` after seeing the data
+* :math:`P(D)`: *evidence* - the probability of the data over all values of :math:`\theta`
+
+Our goal is to hone in on values of :math:`\theta` that best fit the data according to some statistical model we specify. Importantly, we do so by providing both the *data generative distribution* and the *prior distribution* of the parameters.
+
+One of the biggest hurdles to Bayesian inference is that computing :math:`P(\theta|D)` is hard and more often than not unattainable via analytic methods. Instead, we can *sample* from the posterior distribution enough to get a reasonable estimate. We can do this because the posterior distribution is proportional to the product of the likelihood and the prior. Notice how in Bayes' Theorem the evidence is not a function of :math:`\theta` (our value of interest). Computing the evidence is infeasible, especially for 'omics datasets, so it is fortunate that we do not need to do so.
+
+To address this, scientists often use Markov Chain Monte Carlo (MCMC) methods. In a nutshell, MCMC uses dependent sampling of a distribution to attempt convergence around some parameter estimate.
+
+Markov Chain Monte Carlo
+------------------------
+
+MCMC is the harmonious convergence of two techniques: Markov Chain and Monte Carlo sampling. These two individual components have their own independent uses, but putting them together has been a revolutionary development in statistics and modeling. We will briefly describe each part of this procedure.
+
+Markov Chain
+^^^^^^^^^^^^
+
+Markov Chains describe stochastic processes of transitions between a set of states. Importantly, the probability of moving to the next state is *only* dependent on the current state (often referred to as "memoryless"). This means that the previous steps in the chain do not matter for the purposes of any particular step. Markov Chains are very useful in predictive modeling, for example when predicting words in a sentence.
+
+Monte Carlo
+^^^^^^^^^^^
+
+Monte Carlo sampling is a technique for sampling from a distribution, often in cases where computing the underlying distribution analytically is not possible. This is regularly the case in high-dimensional space where the analytical solution comprises multiple integrals. Instead, we can sample from this distribution to approximate a given value or values.
+
+For a good explanation of Monte Carlo simulation, see the classic `Buffon's Needle <https://simonensemble.github.io/2018-04/buffon>`_ example for a good demonstration of Monte Carlo simulation for estimating :math:`\pi`.
+
+Sampling from the posterior
+---------------------------
+
+Combining these two procedures, MCMC uses multiple independent Markov Chains to sample from the posterior distribution. There are many algorithms to perform this sampling but the one we use in BIRDMAn is the No-U-Turn-Sampler (NUTS) that is used in Stan. When sampling, the individual Markov Chains first "warm up" to reach a stationary distribution. This is important as if the chains are not sufficiently warned, the chains may not converge properly and they will not estimate the parameters of interest properly.
+
+Resources
+---------
+
+For a good overview of Bayesian inference, see `this <https://towardsdatascience.com/bayesian-inference-problem-mcmc-and-variational-inference-25a8aa9bce29>`_ article.