Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blobs example a bit confusing #352

Closed
steven-murray opened this issue Aug 4, 2020 · 6 comments · Fixed by #464
Closed

Blobs example a bit confusing #352

steven-murray opened this issue Aug 4, 2020 · 6 comments · Fixed by #464

Comments

@steven-murray
Copy link

General information:

  • emcee version: 3.0.2
  • platform: Linux
  • installation method (pip/conda/source/other?): pip

Problem description:

Expected behavior:

There are actually a few things that are confusing me (based on the docs), and one thing that (given I understand correctly) could potentially be made better in the code. The first confusing thing in the docs is that (on the "Blobs" page) it gives the following example:

def log_prob(params):
    lp = log_prior(params)
    if not np.isfinite(lp):
        return -np.inf, -np.inf
    ll = log_like(params)
    if not np.isfinite(ll):
        return lp, -np.inf
    return lp + ll, lp

This looks like it's returning lp as the log-likelihood when ll is infinite. That doesn't seem right... if ll is infinite, then shouldn't the returned probability be -np.inf? Are the returns backwards?

Now the next example, for "named blobs and custom dtypes" has this log_prob:

def log_prob(params):
    lp = log_prior(params)
    if not np.isfinite(lp):
        return -np.inf, -np.inf
    ll = log_like(params)
    if not np.isfinite(ll):
        return lp, -np.inf
    return lp + ll, lp, np.mean(params)

Again, this has the same seemingly backwards return when ll is not finite. But it also has a different number of returns when either lp or ll are infinite (2 instead of 3). Is this really allowed? If so, great (though see below), but then I think it should be more clear that this is really allowed and encouraged.

However, in my own application, I have a log_prob that has dynamic blobs (I let the user specify a bunch of keys that they want to pull out of an object to store as blobs). This works perfectly fine, except that when I do a similar if not np.isfinite(ll): return -np.inf, -np.inf, emcee complains:

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/moves/red_blue.py in propose(self, model, state)
     91 
     92             # Compute the lnprobs of the proposed position.
---> 93             new_log_probs, new_blobs = model.compute_log_prob_fn(q)
     94 
     95             # Loop over the walkers and update them accordingly.

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/ensemble.py in compute_log_prob(self, coords)
    443                 except ValueError:
    444                     dt = np.dtype("object")
--> 445             blob = np.array(blob, dtype=dt)
    446 
    447             # Deal with single blobs properly

ValueError: could not assign tuple of length 1 to structure with 3 fields.

Minimal Example(s)

Let data = np.random.normal(loc=0.2, scale=0.3, size=25) and init_pos = np.array([0.2, 0.3]) + 1e-4 * np.random.normal(size=(100, 2))

Example 1:

def log_prob(p, data):
    mu, sig = p
    
    # "prior"
    if not 0 < mu < 1:
        return -np.inf, -np.inf
    
    
    ll = np.sum(-(data - mu)**2 / (2 * sig**2)) - sig
    
    return ll, mu*sig, mu+sig

This is almost the same as the docs example -- returning two blobs, but only one blob if returning a -np.inf prior. Setup sampler with

sampler = emcee.EnsembleSampler(
    nwalkers = 100,
    ndim = 2,
    log_prob_fn = log_prob,
    args=(data,)
)

It exits with

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-c79fbf8d39e5> in <module>
----> 1 sampler.run_mcmc(init_pos, nsteps=1000, progress=True)

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/ensemble.py in run_mcmc(self, initial_state, nsteps, **kwargs)
    382 
    383         results = None
--> 384         for results in self.sample(initial_state, iterations=nsteps, **kwargs):
    385             pass
    386 

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/ensemble.py in sample(self, initial_state, log_prob0, rstate0, blobs0, iterations, tune, skip_initial_state_check, thin_by, thin, store, progress)
    341 
    342                     # Propose
--> 343                     state, accepted = move.propose(model, state)
    344                     state.random_state = self.random_state
    345 

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/moves/red_blue.py in propose(self, model, state)
     91 
     92             # Compute the lnprobs of the proposed position.
---> 93             new_log_probs, new_blobs = model.compute_log_prob_fn(q)
     94 
     95             # Loop over the walkers and update them accordingly.

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/ensemble.py in compute_log_prob(self, coords)
    443                 except ValueError:
    444                     dt = np.dtype("object")
--> 445             blob = np.array(blob, dtype=dt)
    446 
    447             # Deal with single blobs properly

ValueError: setting an array element with a sequence.

Example 2:
New log_prob where one of the blobs is an array:

def log_prob(p, data):
    mu, sig = p
    
    # "prior"
    if not 0 < mu < 1:
        return -np.inf, -np.inf
    
    
    ll = np.sum(-(data - mu)**2 / (2 * sig**2)) - sig
    
    return ll, mu*sig, mu+sig, (data -mu)/sig

Exits with

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-c79fbf8d39e5> in <module>
----> 1 sampler.run_mcmc(init_pos, nsteps=1000, progress=True)

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/ensemble.py in run_mcmc(self, initial_state, nsteps, **kwargs)
    382 
    383         results = None
--> 384         for results in self.sample(initial_state, iterations=nsteps, **kwargs):
    385             pass
    386 

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/ensemble.py in sample(self, initial_state, log_prob0, rstate0, blobs0, iterations, tune, skip_initial_state_check, thin_by, thin, store, progress)
    341 
    342                     # Propose
--> 343                     state, accepted = move.propose(model, state)
    344                     state.random_state = self.random_state
    345 

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/moves/red_blue.py in propose(self, model, state)
    102 
    103             new_state = State(q, log_prob=new_log_probs, blobs=new_blobs)
--> 104             state = self.update(state, new_state, accepted, S1)
    105 
    106         return state, accepted

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/moves/move.py in update(self, old_state, new_state, accepted, subset)
     41                     "blobs at that position."
     42                 )
---> 43             old_state.blobs[m1] = new_state.blobs[m2]
     44 
     45         return old_state

ValueError: shape mismatch: value array of shape (41,) could not be broadcast to indexing result of shape (41,3)

Example 3:
Using the same log prob as example 2, but with a HDF5 backend, and sampler defined as:

backend = emcee.backends.HDFBackend("backend.h5")
sampler = emcee.EnsembleSampler(
    nwalkers = 100,
    ndim = 2,
    log_prob_fn = log_prob,
    args=(data,),
    backend=backend,
    blobs_dtype=[('mu*sig', float), ('mu+sig', float), ('z', (float, 25))]
)

Exits with

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-43-c79fbf8d39e5> in <module>
----> 1 sampler.run_mcmc(init_pos, nsteps=1000, progress=True)

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/ensemble.py in run_mcmc(self, initial_state, nsteps, **kwargs)
    382 
    383         results = None
--> 384         for results in self.sample(initial_state, iterations=nsteps, **kwargs):
    385             pass
    386 

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/ensemble.py in sample(self, initial_state, log_prob0, rstate0, blobs0, iterations, tune, skip_initial_state_check, thin_by, thin, store, progress)
    341 
    342                     # Propose
--> 343                     state, accepted = move.propose(model, state)
    344                     state.random_state = self.random_state
    345 

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/moves/red_blue.py in propose(self, model, state)
     91 
     92             # Compute the lnprobs of the proposed position.
---> 93             new_log_probs, new_blobs = model.compute_log_prob_fn(q)
     94 
     95             # Loop over the walkers and update them accordingly.

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/ensemble.py in compute_log_prob(self, coords)
    443                 except ValueError:
    444                     dt = np.dtype("object")
--> 445             blob = np.array(blob, dtype=dt)
    446 
    447             # Deal with single blobs properly

ValueError: could not assign tuple of length 2 to structure with 3 fields.

If I explicitly return -np.inf, -np.inf, -np.inf, np.zeros(25) when the prior is infinite, then it works fine. However, returning an array of zeros of length 1 but with the correct dtype does not work.

This makes it kind of hard to dynamically return blobs. What I could do is create the relevant blob dtype, then unpack it as a tuple, add the likelihood to the front of the tuple, and return that. In this case, I'd have to write the following:

def log_prob(p, data, blobs_dtype):
    mu, sig = p
    
    # "prior"
    if not 0 < mu < 1:
        ll = -np.inf
        blobs = np.zeros(1, dtype=blobs_dtype)
        return (ll, ) + tuple(blobs[name] for name in blobs.dtype.names)    
    
    ll = np.sum(-(data - mu)**2 / (2 * sig**2)) - sig
    
    return ll, mu*sig, mu+sig, (data - mu)/sig

backend = emcee.backends.HDFBackend("backend.h5")
sampler = emcee.EnsembleSampler(
    nwalkers = 100,
    ndim = 2,
    log_prob_fn = log_prob,
    args=(data, blobs_dtype),
    backend=backend,
    blobs_dtype=blobs_dtype
)

This works, but is clearly a lot more brittle and complex. In cases like these, we know that the blobs will never actually be saved because the likelihood is -np.inf anyway. I wonder if you couldn't just ignore the blobs in these cases?

@dfm
Copy link
Owner

dfm commented Aug 4, 2020

Thanks for reporting! You're absolutely right that there are several typos in the docs. The first example should read:

def log_prob(params):
    lp = log_prior(params)
    if not np.isfinite(lp):
        return -np.inf, None
    ll = log_like(params)
    if not np.isfinite(ll):
        return -np.inf, None
    return lp + ll, lp

And the second should read:

def log_prob(params):
    lp = log_prior(params)
    if not np.isfinite(lp):
        return -np.inf, None, None
    ll = log_like(params)
    if not np.isfinite(ll):
        return -np.inf, None, None
    return lp + ll, lp, np.mean(params)

In summary:

  1. The number of returned objects should be the same, but if the log prob is -np.inf these returned objects can be anything.
  2. You should definitely return -np.inf when the log likelihood fails - that was a typo in the docs!

Let me know if this helps your use cases and we can update the docs.

@steven-murray
Copy link
Author

Ah, you're right. Returning None for each of the blobs works both with and without setting blobs_dtype, and that is easy to do (even dynamically). However, one strange case: setting the blobs to each be 0 (since you say it doesn't matter what they are) without the blobs_dtype does result in an error:

~/miniconda3/envs/halomod/lib/python3.7/site-packages/emcee/ensemble.py in compute_log_prob(self, coords)
    443                 except ValueError:
    444                     dt = np.dtype("object")
--> 445             blob = np.array(blob, dtype=dt)
    446 
    447             # Deal with single blobs properly

ValueError: setting an array element with a sequence.

I'm not sure why, exactly, but it's easily avoidable, just by returning None. That should be made explicit in the docs though!

@dfm
Copy link
Owner

dfm commented Aug 4, 2020

Can you post exactly the code you're running to get this, with a fixed seed and all the relevant imports and data, etc? I wouldn't expect this to ever be hit: it shouldn't inspect the blobs if logprob is -np.inf and it should throw an error earlier if your initial coordinates are invalid so this suggests that there might be a bug.

@steven-murray
Copy link
Author

Sure!

import emcee
import numpy as np

def log_prob(p):
    mu, sig = p
    
    if not 0 < mu < 1:
        ll = -np.inf
        return -np.inf, 0,0,0

    ll = np.sum(-(data - mu)**2 / (2 * sig**2)) - sig
    
    return ll, mu*sig, mu+sig, (data - mu)/sig

np.random.seed(1234)
data = np.random.normal(loc=0.2, scale=0.3, size=25)

sampler = emcee.EnsembleSampler(
    nwalkers = 100,
    ndim = 2,
    log_prob_fn = log_prob,
)

init_pos = np.array([0.2, 0.3]) + 1e-4 * np.random.normal(size=(100, 2))

sampler.run_mcmc(init_pos, nsteps=1000, progress=True)

@lboogaard
Copy link

Hi, I'm very glad I found this issue after also being confused by the blobs documentation. It has a lot of info that could be useful to add in the docs. Here's what I gathered so far (assuming I understand the intended behaviour correctly):

  1. If your log probability return more than 1 argument, the following arguments are automatically assumed to be blobs
  2. If you don't provide blobs_dtype, the dtype of the extra args is automatically guessed on the first pass
  3. If your log probability returns -np.inf, the blobs are also not inspected and can be anything (but you should return the same number of arguments)

Now regarding 3. I think there might be a bug indeed. Playing with the example from the previous post, it seems that return -np.inf, 0, 0, 0 fails because the the last return is not the same shape as (data - mu)/sig. This is surprising if it's true the blobs are not inspected? Here's a MWE:

import emcee
import numpy as np

def log_prob(p):
    mu, sig = p
    
    if not 0 < mu < 1:
        #return -np.inf, None, None, None          # works
        #return -np.inf, 0, 0, np.zeros_like(data) # works
        return -np.inf, 0, 0, 0                    # fails (unless blobs_dtype='object' is explicilty set)

    ll = np.sum(-(data - mu)**2 / (2 * sig**2)) - sig
    
    return ll, mu*sig, mu+sig, (data - mu)/sig

np.random.seed(1234)
data = np.random.normal(loc=0.2, scale=0.3, size=25)

sampler = emcee.EnsembleSampler(
    nwalkers = 100,
    ndim = 2,
    log_prob_fn = log_prob,
    #blobs_dtype='object'
)

init_pos = np.array([0.2, 0.3]) + 1e-4 * np.random.normal(size=(100, 2))

state = sampler.run_mcmc(init_pos, nsteps=100, progress=True)

@dfm
Copy link
Owner

dfm commented Sep 24, 2021

Some of this is the result of how numpy has changed their handling of dtypes (for the better!) since I originally wrote this forever ago. But either way, I'd say that the best approach in this particular case (whenever you have non-trival blob types) is to provide the dtype explicitly. And don't use object! Instead, here I would use:

blobs_dtype=np.dtype([("a", float), ("b", float), ("c", float, len(data))])

Then your resulting blobs array with be a structured array that will be easy to slice, etc.

znicholls added a commit to znicholls/emcee that referenced this issue Mar 15, 2023
The blobs docs were found to be a bit confusing by some users (including me).
This merge request updates them following the discussion in dfm#352 to clarify
how this interface can be used and the intent behind it.
@dfm dfm closed this as completed in #464 May 23, 2023
dfm pushed a commit that referenced this issue May 23, 2023
The blobs docs were found to be a bit confusing by some users (including me).
This merge request updates them following the discussion in #352 to clarify
how this interface can be used and the intent behind it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants