Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial release - big documentation upgrades #25

Merged
merged 8 commits into from
Apr 5, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,12 @@ Currently BIRDMAn requires Python 3.7 or higher.
```bash
conda install -c conda-forge dask biom-format patsy xarray
pip install cmdstanpy
pip install git+git://github.com/arviz-devs/arviz.git
git clone [email protected]:gibsramen/BIRDMAn.git
cd BIRDMAn
pip install .
```

You have to install `cmdstan` through the `cmdstanpy.install_cmdstan` function first.
You have to install `cmdstan` through the `cmdstanpy.install_cmdstan` function first. See the [CmdStanPy documentation](https://mc-stan.org/cmdstanpy/installation.html#install-cmdstan) for details.

See the [documentation](https://birdman.readthedocs.io/en/latest/?badge=latest) for details on usage.
15 changes: 6 additions & 9 deletions birdman/visualization.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
def plot_parameter_estimates(
inference_object: az.InferenceData,
parameter: str,
coord: dict = dict(),
coords: dict = dict(),
num_std: float = 1.0
):
"""Plot credible intervals of estimated parameters.
Expand All @@ -18,8 +18,8 @@ def plot_parameter_estimates(
:param parameter: Name of parameter to plot
:type parameter: str

:param coord: Coordinates of parameter to plot
:type coord: dict, optional
:param coords: Coordinates of parameter to plot
:type coords: dict, optional

:param num_std: Number of standard deviations to plot as error bars,
defaults to 1.0
Expand All @@ -28,14 +28,14 @@ def plot_parameter_estimates(
:returns: matplotlib axes figure
"""
posterior = inference_object.posterior
if len(posterior[parameter].coords) > 3 and not coord:
if len(posterior[parameter].coords) > 3 and not coords:
raise ValueError(
"Must provide coordinates if plotting multi-dimensional parameter"
" estimates!"
)

param_means = posterior[parameter].sel(**coord).mean(["chain", "draw"])
param_stds = posterior[parameter].sel(**coord).std(["chain", "draw"])
param_means = posterior[parameter].sel(**coords).mean(["chain", "draw"])
param_stds = posterior[parameter].sel(**coords).std(["chain", "draw"])
sort_indices = param_means.argsort().data
param_means = param_means.data[sort_indices]
param_stds = param_stds.data[sort_indices]
Expand All @@ -45,9 +45,6 @@ def plot_parameter_estimates(
ax.errorbar(x=x, y=param_means, yerr=param_stds*num_std, zorder=0)
ax.scatter(x=x, y=param_means, color="black", zorder=2, s=5)

ax.set_xlabel("Feature")
ax.set_ylabel("Differential")

return ax


Expand Down
48 changes: 48 additions & 0 deletions docs/bayesian_inference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
Bayesian inference
==================

Bayesian inference is a technique to learn about a given population through Bayes' Theorem. Given data, :math:`D`, and parameter(s), :math:`\theta`, Bayes' Theorem applies as follows:

.. math::

P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}

The component parts of this formula are usually described with the following terms:

* :math:`P(\theta)`: *prior* - our "current" knowledge of :math:`\theta` independent of our data
* :math:`P(D|\theta)`: *likelihood* - how likely is the data given our knowledge of the parameters?
* :math:`P(\theta|D)`: *posterior* - the "updated" knowledge of :math:`\theta` after seeing the data
* :math:`P(D)`: *evidence* - the probability of the data over all values of :math:`\theta`

Our goal is to hone in on values of :math:`\theta` that best fit the data according to some statistical model we specify. Importantly, we do so by providing both the *data generative distribution* and the *prior distribution* of the parameters.

One of the biggest hurdles to Bayesian inference is that computing :math:`P(\theta|D)` is hard and more often than not unattainable via analytic methods. Instead, we can *sample* from the posterior distribution enough to get a reasonable estimate. We can do this because the posterior distribution is proportional to the product of the likelihood and the prior. Notice how in Bayes' Theorem the evidence is not a function of :math:`\theta` (our value of interest). Computing the evidence is infeasible, especially for 'omics datasets, so it is fortunate that we do not need to do so.

To address this, scientists often use Markov Chain Monte Carlo (MCMC) methods. In a nutshell, MCMC uses dependent sampling of a distribution to attempt convergence around some parameter estimate.

Markov Chain Monte Carlo
------------------------

MCMC is the harmonious convergence of two techniques: Markov Chain and Monte Carlo sampling. These two individual components have their own independent uses, but putting them together has been a revolutionary development in statistics and modeling. We will briefly describe each part of this procedure.

Markov Chain
^^^^^^^^^^^^

Markov Chains describe stochastic processes of transitions between a set of states. Importantly, the probability of moving to the next state is *only* dependent on the current state (often referred to as "memoryless"). This means that the previous steps in the chain do not matter for the purposes of any particular step. Markov Chains are very useful in predictive modeling, for example when predicting words in a sentence.

Monte Carlo
^^^^^^^^^^^

Monte Carlo sampling is a technique for sampling from a distribution, often in cases where computing the underlying distribution analytically is not possible. This is regularly the case in high-dimensional space where the analytical solution comprises multiple integrals. Instead, we can sample from this distribution to approximate a given value or values.

For a good explanation of Monte Carlo simulation, see the classic `Buffon's Needle <https://simonensemble.github.io/2018-04/buffon>`_ example for a good demonstration of Monte Carlo simulation for estimating :math:`\pi`.

Sampling from the posterior
---------------------------

Combining these two procedures, MCMC uses multiple independent Markov Chains to sample from the posterior distribution. There are many algorithms to perform this sampling but the one we use in BIRDMAn is the No-U-Turn-Sampler (NUTS) that is used in Stan. When sampling, the individual Markov Chains first "warm up" to reach a stationary distribution. This is important as if the chains are not sufficiently warned, the chains may not converge properly and they will not estimate the parameters of interest properly.

Resources
---------

For a good overview of Bayesian inference, see `this <https://towardsdatascience.com/bayesian-inference-problem-mcmc-and-variational-inference-25a8aa9bce29>`_ article.
Loading