Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Text, SQLite and HDF5 backends #4231

Merged
merged 11 commits into from
Nov 19, 2020
Merged

Remove Text, SQLite and HDF5 backends #4231

merged 11 commits into from
Nov 19, 2020

Conversation

fonnesbeck
Copy link
Member

Since the warning has been present for over half a year, this PR removes the storage backends other than NDArray.

This might be a good place to remove save/load for NDArray as well, or this can be done in a separate PR.

@fonnesbeck fonnesbeck changed the title [WIP] Remove Text, SQLite and HDF5 backends Remove Text, SQLite and HDF5 backends Nov 18, 2020
@codecov
Copy link

codecov bot commented Nov 19, 2020

Codecov Report

Merging #4231 (fea10a1) into master (e51b9d3) will decrease coverage by 0.74%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4231      +/-   ##
==========================================
- Coverage   88.93%   88.18%   -0.75%     
==========================================
  Files          92       89       -3     
  Lines       14788    14358     -430     
==========================================
- Hits        13152    12662     -490     
- Misses       1636     1696      +60     
Impacted Files Coverage Δ
pymc3/backends/base.py 86.92% <ø> (-0.39%) ⬇️
pymc3/__init__.py 100.00% <100.00%> (ø)
pymc3/backends/__init__.py 100.00% <100.00%> (ø)
pymc3/sampling.py 86.49% <100.00%> (-0.40%) ⬇️
pymc3/variational/opvi.py 85.29% <100.00%> (-0.03%) ⬇️
pymc3/tests/backend_fixtures.py 78.97% <0.00%> (-19.61%) ⬇️
pymc3/backends/ndarray.py 87.36% <0.00%> (-5.27%) ⬇️
pymc3/distributions/continuous.py 93.04% <0.00%> (-0.12%) ⬇️
... and 6 more

Copy link
Member

@michaelosthege michaelosthege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much deleted code! 🤩

In principle I approve, but before someone accidentally hits "merge": Do we want to deprecate the backends already for 3.10?
I'd say yes, but we have a major bump coming up and could do it there too.

@AlexAndorra
Copy link
Contributor

AlexAndorra commented Nov 19, 2020

As the warnings have been here for a while now, and as ArviZ netcdf backend is already available, I think we can go ahead and depreacte this in 3.10.
Maybe the next major version will implement return_inferencedata=True as the default for pm.sample, in which case it's good that users already experienced and transitioned to az.to_netcdf

@DanPen
Copy link

DanPen commented Dec 11, 2020

I'm pretty confused. Why were these backends removed? I'm running out of memory trying to sample a model, thought that writing to disk was pretty cool. Unsure what to do.

@michaelosthege
Copy link
Member

@DanPen they were removed because because couldn't think of examples that would not work with the standard backend, and we saw no activity in Github/Discourse that indicated that they were being used at all.
Also, we are interested in transitioning to an xarray-based backend, because it would seamlessly integrate with InferenceData.

Now to your problem: Can you give some details about the size of the model and sampling strategy? Running out of memory sounds like you either have a memory leak, or your model is astronomically gigantic, or you're including things in the trace that are deterministic and take a lot of memory.

@DanPen
Copy link

DanPen commented Dec 11, 2020

@michaelosthege Thanks for the quick response!

I'm following the Dependent density regression tutorial with my own dataset. As I increase my number of observations to ~500, my 32GB of memory gets used up. I'm sampling with 4 metropolis chains, 20,000 samples, and 10,000 burn-in. Memory seems to increase linearly as sampling continues.

Should I open a new issue?

@michaelosthege
Copy link
Member

@DanPen that's 266 kB per iteration, roughly equivalent to 8000 float32 values. I doubt that your model has that many parameters (if it does, you should give up on Metropolis right now). So something else than model parameters must be chewing up that memory -> either a memory leak, or a Deterministic that has much more dimensions than the actual parameters.

You can print(model.ndim) and also go and print the shapes of all variables in the trace (after sampling a few iterations).

In a Jupyter notebook:

with model:
   idata = pm.sample(return_inferencedata=True, tune=70, draw=30)
idata

@DanPen
Copy link

DanPen commented Dec 13, 2020

@michaelosthege Cool. ndim=200 and the posterior's dimensions are alpha_dim_0: 40, beta_dim_0: 40, chain: 4, delta_dim_0: 40, draw: 30, gamma_dim_0: 40, mu_dim_0: 490, mu_dim_1: 40, tau_dim_0: 40, w_dim_0: 490, w_dim_1: 40.

Not sure I understand why mu and w are 2D.

@michaelosthege
Copy link
Member

@DanPen Then it's indeed the 2D Deterministic variables mu and w that blow up the size of the posterior.
The other variables make up the 200 free paramters, but the (490, 40)-shaped mu and w add a combined 39_200 numbers per chain per iteration.

In such cases it's better to not save them at all. You can do this by not wrapping them in Deterministic. This should reduce your memory footprint by 197x.
If you need them later, you can always reconstruct them (that's why it's called Deterministic...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants