Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pm.Simulator (WIP) #4802

Closed
wants to merge 6 commits into from
Closed

Conversation

ricardoV94
Copy link
Member

@ricardoV94 ricardoV94 commented Jun 24, 2021

This is a proposal on how to pm.Simulator could be refactored into V4. It provides a helper method to create a SimulatorRV which tries to behave as a typical RV for prior and posterior predictive sampling, and whose logp is distance(epsilon, sum_stat(value), sim_stat(sim_rv)) where sim_rv is just another reinstantiation of the original SimulatorRV.

The advantages is that we don't need special logic for the ABC kernel in sample_smc as it works just like a normal variable. It can also be used in some conventional samplers (at least it seems to work with Metropolis MCMC). It does away with the limitation of having a single pm.Simulator and so on.

Two other options would be to:

  1. Separate the Simulator object from a pm.Distance likelihood term which may be more transparent / easier to think for the user. The mixing of the two doesn't make a lot of sense if we think of Simulator as a true RV (for example if we use the mean summary statistic, then the logp shape is completely different than the samples shape). It may also make it easier to manipulate parameters like the epsilon by the sampler if we separate the two. Perhaps distance should even be completely the responsibility of the sampler kernel.

  2. Keep most of the unique logic from v3, which keeps (for better or worse) pm.Simulator and ABC on their very specific corner in the library.

CC: @aloctavodia @junpenglao

if sum_stat != "identity":
_log.info(f"Automatically setting sum_stat to identity as expected by {distance}")
sum_stat = "identity"
raise NotImplementedError("KL not refactored yet")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to link a Github issue right here.

rv_type = type(sim_op)

@_logp.register(rv_type)
def logp(op, sim_rv, rvs_to_values, *sim_params, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locally defined functions often cause problems with pickling.

Also it looks like this logp uses variables from the __new__ scope. Doesn't this, in combination with the register lead to problems when having more than one Simulator?

Copy link
Member Author

@ricardoV94 ricardoV94 Jun 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about pickling. Depends on the point at which the function is used (after obtaining the logp graph, the function is not needed anymore).

The registration part should be fine because I am creating a new type which has a unique name. I also have a test for multiple Simulators with different methods and it seems to be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also try to add a layer of indirection when creating the simulator so that I can attach what are now the local variables to the tag, in which case a single logp function / dispatcher would be enough.

pymc3/smc/smc.py Outdated Show resolved Hide resolved
pymc3/tests/test_smc.py Outdated Show resolved Hide resolved
@@ -144,7 +147,7 @@ def abs_diff(eps, obs_data, sim_data):
)

with pm.Model() as self.SMABC_potential:
a = pm.Normal("a", mu=0, sigma=1)
a = pm.Normal("a", mu=0, sigma=1, initval=0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the initval added? Maybe this indicates an underlying problem?

pymc3/tests/test_smc.py Outdated Show resolved Hide resolved
@ricardoV94
Copy link
Member Author

@michaelosthege thanks for the pointers. This was not meant for implementation review yet, just proof of concept but some of the points you raise are already useful.

@ricardoV94 ricardoV94 marked this pull request as ready for review June 27, 2021 15:45
@ricardoV94 ricardoV94 marked this pull request as draft June 27, 2021 15:46
Copy link
Member

@aloctavodia aloctavodia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like where this is going!

I am not sure about separating the Simulator from the Distance, they both together define a pseudolikelihood. So it seems reasonable to have them as parts of the same object. For the same reason making the distance part of the sampling method is weird. Actually. I think the first version of SMC-ABC was like that. We can of course discuss the benefits of trying to automatically tune epsilon, nut I think having to define a single value of epsilon is actually a good feature in practice as allows to control the level of accuracy and the cost of the simulation with minimal hand tuning.

pymc3/distributions/simulator.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jul 5, 2021

Codecov Report

Merging #4802 (729daea) into main (125256f) will increase coverage by 0.99%.
The diff coverage is 89.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4802      +/-   ##
==========================================
+ Coverage   71.97%   72.97%   +0.99%     
==========================================
  Files          85       85              
  Lines       13839    13838       -1     
==========================================
+ Hits         9961    10098     +137     
+ Misses       3878     3740     -138     
Impacted Files Coverage Δ
pymc3/distributions/simulator.py 79.13% <86.36%> (+55.94%) ⬆️
pymc3/aesaraf.py 91.34% <100.00%> (+0.07%) ⬆️
pymc3/distributions/distribution.py 66.46% <100.00%> (+1.66%) ⬆️
pymc3/smc/sample_smc.py 96.72% <100.00%> (+4.72%) ⬆️
pymc3/smc/smc.py 99.31% <100.00%> (+26.71%) ⬆️
pymc3/distributions/__init__.py 100.00% <0.00%> (ø)
pymc3/distributions/discrete.py 99.00% <0.00%> (+0.05%) ⬆️
pymc3/parallel_sampling.py 86.70% <0.00%> (+0.94%) ⬆️
... and 1 more

@ricardoV94
Copy link
Member Author

ricardoV94 commented Jul 5, 2021

@michaelosthege was totally right about the pickling issues. I didn't spot them in the first iteration because the default in sample_smc was parallel=False.

In contrast to SMC, the vanilla Metropolis stepper works fine in multiprocessing because the logp graph is compiled before the forking step, whereas SMC tries to pickle the entire model to be sent to each process. only on fork multiprocess

I temporarily set the ABC tests to all be singleprocess.

@@ -156,6 +146,29 @@ def sample_smc(
%282007%29133:7%28816%29>`__
"""

if kernel is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to select if the kernel is metropolis, independent Metropolis-Hasting, HMC, etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't thought about that yet. We might use the kernel argument for that, in which case I'll remove the DeprecationWarning. This is just temporarily.

@michaelosthege
Copy link
Member

@michaelosthege was totally right about the pickling issues. I didn't spot them in the first iteration because the default in sample_smc was parallel=False.

In contrast to SMC, the vanilla Metropolis stepper works fine in multiprocessing because the logp graph is compiled before the forking step, whereas SMC tries to pickle the entire model to be sent to each process. only on fork multiprocess

I temporarily set the ABC tests to all be singleprocess.

Might be a good idea to add include the test_smc.py in the Windows tests then.

@ricardoV94
Copy link
Member Author

ricardoV94 commented Jul 6, 2021

The pickling issue is proving non-trivial to overcome.

It is caused by the dynamic RV class creation (as well as user Ops created at runtime with the new helpers).
The logp dispatching does not seem to be an issue, but in any case I standardized that in my last commitl.

I tried another approach that would require users to explicit subclass from the SimulatorRV instead of creating one dynamically, and these can be pickled / used in multiprocess. However as soon as the classes are defined inside a function (such as the setup_class in TestSMCABC it fails to pickle again). This method also requires more effort from the users.

The old pm.DensityDist faced similar issues and required some workarounds to deal with the logp method. I am not sure how/if this approach can be adapted to V4 to use with RandomVariables (DensityDist hasn't been refactored and might be deprecated altogether). I fiddled with it a bit and did not succeed.

Any suggestions?

@ricardoV94 ricardoV94 closed this Aug 4, 2021
@ricardoV94 ricardoV94 mentioned this pull request Aug 4, 2021
@ricardoV94 ricardoV94 deleted the restore_abc branch January 31, 2022 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
request discussion SMC Sequential Monte Carlo v4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants