Improve documentation on InferenceData.map #1252

kyleabeauchamp · 2020-06-20T18:51:56Z

Is InferenceData.map always done elementwise or is some sort of reduction applied?
What is the function signature (particularly input shape and output shape) expected for a function passed to map? Currently, the docs just say "fun: callable / Function to be applied to each group."

For example, suppose you want to evaluate a new function of the model parameters (iterated for each step in each chain). Often, one might use pymc3.Deterministic to add those to the compute graph during model sampling, but I imagine that InferenceData.map might be a nice way to run such calculations without having to build it into the original compute graph at training time. I think a couple of examples and some guardrails on what fun ought to be might improve the usability here.

The text was updated successfully, but these errors were encountered:

kyleabeauchamp · 2020-06-20T19:29:18Z

In particular, I have two concrete examples of what you might want to do with map:

Suppose you have a model that uniformly samples (x, y) points in the unit square. A nice application of map might be to transform the posterior samples into polar coordinates, giving you (r, theta)
Suppose you have a bayesian regression model. Another application of map might be to apply a different scoring function (e.g., https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) on the samples in your posterior.

kyleabeauchamp · 2020-06-20T20:36:46Z

Here's my current working code for what I'm trying to do. Even after reading #1255, I'm still not sure if that falls under map, another arviz function that needs to be written, or possibly just code that belongs in user-space:

def runner(posterior, func):
    results = {}
    n_chains = posterior.dims["chain"]
    n_draws = posterior.dims["draw"]
    for chain in range(n_chains):
        results[chain] = {}
        for draw in range(n_draws):
            slice = posterior.isel(chain=chain, draw=draw)
            chunk = func(slice)
            results[chain][draw] = chunk

# TODO: infer shape of chunk and create output xarray object

OriolAbril · 2020-06-20T20:36:47Z

I have extended docs a little, map basically loops over groups to apply the function to all groups (the same way that array.Dataset.map loops over data variables and therefore there is no xarray.DataArray.map. This means that data from other groups is not available unless manually passed as an argument or kwarg.

I think map would be a perfect fit for 1. if the transformation had to be done to both prior and posterior, it it is a posterior only transformation it can still be used and may be more convenient but would be equivalent to the following:

idata.posterior = func(idata.posterior)
idata = idata.map(func, groups="posterior")
# to apply func to both prior and posterior we can use latent_vars metagroup
idata = idata.map(func, groups="latent_vars")

Metagroups are still quite experimental so any comment will be very welcome.

I think 2. requires access to both observed_data and posterior predictive? I think it diverges from the groupwise case of map. Having concrete example is super useful though, so I am linking #1066 here, if you had time to create a minimal example of this it would surely help in developing inter group operations capabilities.

ahartikainen · 2020-06-20T21:50:00Z

I have thinking about this too lately.

We would basically need to create a function, then having some simple way to combine InferenceData groups + external data where function arguments could be inferred / manually defined.

Something similar as in wrap_xarray_ufunc we have, but user friendly with some predetermined options.

And this would pull out wanted output.

kyleabeauchamp · 2020-06-20T22:09:50Z

OK, I put together a really simple example of (1), which after your suggestions seems to be working well:

import pymc3 as pm
import numpy as np


n_points = 1000
n_samples = 100
chains = 2

with pm.Model():
    x = pm.Uniform("x", 0.0, 1.0, shape=n_points)
    y = pm.Uniform("y", 0.0, 1.0, shape=n_points)

    tr = pm.sample(n_samples, chains=chains, return_inferencedata=True)

# Do some extra calculations after sampling.
distance = tr.map(lambda v, x0, y0: ((v.x - x0) ** 2 + (v.y - y0) ** 2) ** 0.5, groups="latent_vars", x0=0.0, y0=0.0).posterior
print(np.linalg.norm(tr.posterior.isel(chain=0, draw=0, x_dim_0=0, y_dim_0=0).to_array()))
print(distance.isel(chain=0, draw=0, x_dim_0=0, y_dim_0=0))

I'll try to think about example (2) now.

kyleabeauchamp · 2020-06-20T22:57:10Z

Here's my code for example (2). I can't get the sklearn <> map stuff working because of the need for an implicit broadcast / aggregation across one dimension. I think making that commented out line work in an easily comprehensible way might be a win for usability.

import pymc3 as pm
import numpy as np
import sklearn.metrics


n_features, n_observations = 10, 500
n_samples = 100
chains = 2

x = np.random.normal(size=(n_features, n_observations))
epsilon = 0.1 * np.random.normal(size=n_observations)
a0 = np.eye(10)[3] * 10.0  # Only 4th feature active
b0 = 25.0
y0 = x.T.dot(a0) + b0 + epsilon


with pm.Model():
    a = pm.Uniform("a", -100, 100, shape=n_features)
    b = pm.Uniform("b", -100, 100)
    sigma = pm.Uniform("sigma", 0, 100)

    mu = pm.Deterministic("mu", pm.math.dot(x.T, a) + b)

    y = pm.Normal("y", mu, sigma, shape=n_observations, observed=y0)

    tr = pm.sample(n_samples, chains=chains, return_inferencedata=True)

# After sampling, calculate additional metrics via map operation on InferenceData
# Try two ways: via sklearn and manual
# sklearn currently not working due to ValueError: Found input variables with inconsistent numbers of samples: [500, 2]
#mse0 = tr.map(lambda x, y0: sklearn.metrics.mean_squared_error(y0, x.mu.values), groups="latent_vars", y0=y0).posterior
mse1 = tr.map(lambda x, y0: ((y0 - x.mu) ** 2).mean(("mu_dim_0")), groups="latent_vars", y0=y0).posterior

# Compare to manual MSE calculated on first chain and draw
#print(mse0.isel(chain=0, draw=0))
print(mse1.isel(chain=0, draw=0))
print(sklearn.metrics.mean_squared_error(y0, tr.posterior.mu.isel(chain=0, draw=0).values))

kyleabeauchamp · 2020-06-20T22:59:10Z

PS: the reason I'm pushing for this functionality is that I think it would be great to be able to have a clean separation between "model training" and "model evaluation", where we use the InferenceData interface to do model evaluation (using available metrics in, e.g., sklearn) after the model was sampled.

OriolAbril · 2020-06-21T16:04:59Z

map seems like an overkill for these two cases in my opinion. There is not a grupowise component to these calculations.

I'd recommend reading the updated radon notebook from pymc-devs/pymc#3963 to see examples of postprocessing calculations, you can access the different variables as idata.posterior.x and do anything you'd do with xarray objects (which is more or less the same as numpy's).

In the second example in fact, using xarray should take care of all broadcasting, the issue is that y0 is a numpy array, so xarray cannot broadcast automatically, using tr.observed_data.y will solve the issue if dims are all annotated

kyleabeauchamp · 2020-06-21T18:47:08Z

Thanks, the new features (coords in pm.Model() and pm.Data) seem extremely useful for ensuring good bookkeeping and reproducibility in these situations.

I still don't actually see how to solve the issue in (2). Basically, the issue there is that I specifically want to avoid re-writing the metric and instead leverage a unit-tested, open-source version from sklearn. To me, the only way to use that existing code is to use some sort of "coordinate aware" function like map, apply, agg, or reduce where a user specifies which axes are iterated over and which ones are "presented to func".

OriolAbril · 2020-06-21T20:32:56Z

Oh, my bad, I assumed sklearn function would work, xarray preserves coords and broadcasts automatically with most numpy functions. It looks like you'll have to use xarray.apply_ufunc or arviz.wrap_xarray_ufunc, I am afraid of both functions need some improvements I hope they are clear enough, if not, maybe this SO answer can help and in some cases you may need to call xr.broadcast before passing the datasets to apply_ufunc, see pydata/xarray#3032

AlexAndorra · 2020-06-23T15:03:05Z

Closing, as the doc improvement has been merged and released in 0.9.0, but feel free to reopen if needed 😉

kyleabeauchamp · 2020-07-05T01:36:00Z

FYI, I was able to figure out my use case combining arviz + xarray + sklearn metrics. I spent several hours staring at the docs for wrap_xarray_ufunc and eventually was able to identify the following pattern:

xr.apply_ufunc(f, y, yhat, input_core_dims=[[dim], [dim]], vectorize=True)

OriolAbril mentioned this issue Jun 20, 2020

InferenceData.map docstring examples are broken in docs #1251

Closed

OriolAbril mentioned this issue Jun 20, 2020

fix and extend idata.map docs #1255

Merged

4 tasks

AlexAndorra closed this as completed Jun 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve documentation on InferenceData.map #1252

Improve documentation on InferenceData.map #1252

kyleabeauchamp commented Jun 20, 2020 •

edited

Loading

kyleabeauchamp commented Jun 20, 2020

kyleabeauchamp commented Jun 20, 2020

OriolAbril commented Jun 20, 2020

ahartikainen commented Jun 20, 2020

kyleabeauchamp commented Jun 20, 2020 •

edited

Loading

kyleabeauchamp commented Jun 20, 2020 •

edited

Loading

kyleabeauchamp commented Jun 20, 2020 •

edited

Loading

OriolAbril commented Jun 21, 2020

kyleabeauchamp commented Jun 21, 2020

OriolAbril commented Jun 21, 2020

AlexAndorra commented Jun 23, 2020

kyleabeauchamp commented Jul 5, 2020 •

edited

Loading

Improve documentation on InferenceData.map #1252

Improve documentation on InferenceData.map #1252

Comments

kyleabeauchamp commented Jun 20, 2020 • edited Loading

kyleabeauchamp commented Jun 20, 2020

kyleabeauchamp commented Jun 20, 2020

OriolAbril commented Jun 20, 2020

ahartikainen commented Jun 20, 2020

kyleabeauchamp commented Jun 20, 2020 • edited Loading

kyleabeauchamp commented Jun 20, 2020 • edited Loading

kyleabeauchamp commented Jun 20, 2020 • edited Loading

OriolAbril commented Jun 21, 2020

kyleabeauchamp commented Jun 21, 2020

OriolAbril commented Jun 21, 2020

AlexAndorra commented Jun 23, 2020

kyleabeauchamp commented Jul 5, 2020 • edited Loading

kyleabeauchamp commented Jun 20, 2020 •

edited

Loading

kyleabeauchamp commented Jun 20, 2020 •

edited

Loading

kyleabeauchamp commented Jun 20, 2020 •

edited

Loading

kyleabeauchamp commented Jun 20, 2020 •

edited

Loading

kyleabeauchamp commented Jul 5, 2020 •

edited

Loading