Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to register modules to be deeply serialized #417

Merged
merged 74 commits into from
Aug 1, 2021

Conversation

Samreay
Copy link
Contributor

@Samreay Samreay commented Apr 20, 2021

This PR is based on the work done by @kinghuang in PR391, but takes on the feedback provided by @ogrisel and adds testing.

Fixes #206

Issue Summary

To summarise the issue, in many cases cloudpickle is used to send code for remote execution. This is used in dask, prefect, mlflow and many libraries. For local functions, this works perfectly fine. But for any non-local function or class, cloudpickle assumes that external modules and packages are available at the location of deserialization. This may either not be the case, or the version of the package available at the end point may be different.

This PR adds the option to register modules for deep serialization by providing a register_deep_serialization function which takes either a name or a module. This is the original register_dynamic_module by @kinghuang.

import cloudpickle
from tests import external

cloudpickle.register_deep_serialization("tests.external")  # string name works
cloudpickle.register_deep_serialization(external)          # You can pass the module itself
cloudpickle.register_deep_serialization("tests")           # or the parent string/module

output = cloudpickle.dumps(external.an_external_function)

Original dumps:

b'\x80\x05\x95+\x00\x00\x00\x00\x00\x00\x00\x8c\x0etests.external\x94\x8c\x14an_external_function\x94\x93\x94.'

dumps after registering tests.external for deep serialization:

b'\x80\x05\x95<\x02\x00\x00\x00\x00\x00\x00\x8c\x17cloudpickle.cloudpickle\x94\x8c\r_builtin_type\x94\x93\x94\x8c\nLambdaType\x94\x85\x94R\x94(h\x02\x8c\x08CodeType\x94\x85\x94R\x94(K\x00K\x00K\x00K\x00K\x01KCC\x04d\x01S\x00\x94N\x8c\x11this is something\x94\x86\x94))\x8c5/Users/samreay/Projects/cloudpickle/tests/external.py\x94\x8c\x14an_external_function\x94K\x04C\x02\x00\x01\x94))t\x94R\x94}\x94(\x8c\x0b__package__\x94\x8c\x05tests\x94\x8c\x08__name__\x94\x8c\x0etests.external\x94\x8c\x08__file__\x94\x8c5/Users/samreay/Projects/cloudpickle/tests/external.py\x94uNNNt\x94R\x94\x8c\x1ccloudpickle.cloudpickle_fast\x94\x8c\x12_function_setstate\x94\x93\x94h\x19}\x94}\x94(h\x14h\r\x8c\x0c__qualname__\x94h\r\x8c\x0f__annotations__\x94}\x94\x8c\x0e__kwdefaults__\x94N\x8c\x0c__defaults__\x94N\x8c\n__module__\x94h\x15\x8c\x07__doc__\x94N\x8c\x0b__closure__\x94N\x8c\x17_cloudpickle_submodules\x94]\x94\x8c\x0b__globals__\x94}\x94u\x86\x94\x86R0.'

Modules can be unregistered via unregister_deep_serialization

Tests

One of the example tests with an explicit use case is shown above. On top of this, tests have been added to _lookup_module_and_qualname using the _cloudpickle_testpkg package, and also to the new _is_explicitly_serialized_module function.

@codecov
Copy link

codecov bot commented Apr 20, 2021

Codecov Report

Merging #417 (1e9a48d) into master (0c62ae0) will increase coverage by 0.36%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #417      +/-   ##
==========================================
+ Coverage   91.99%   92.36%   +0.36%     
==========================================
  Files           4        4              
  Lines         687      720      +33     
  Branches      141      150       +9     
==========================================
+ Hits          632      665      +33     
  Misses         34       34              
  Partials       21       21              
Impacted Files Coverage Δ
cloudpickle/cloudpickle.py 88.31% <100.00%> (+1.09%) ⬆️
cloudpickle/cloudpickle_fast.py 96.87% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c62ae0...1e9a48d. Read the comment docs.

@kinghuang
Copy link

This PR is based on the work done by @kinghuang in PR391, but takes on the feedback provided by @ogrisel and adds testing.

Thanks for picking this up, @Samreay!

@Samreay
Copy link
Contributor Author

Samreay commented May 5, 2021

@ogrisel, any thoughts on the PR? Is there someone I should direct it to?

@damienrj
Copy link

damienrj commented May 10, 2021

This is also of great interest to me! Very nice. Maybe @ogrisel would be the person to review?

@Ben-Epstein
Copy link

Hello, following up here, is there any plan on merging or moving forward? I see all but 1 test passing

@ogrisel
Copy link
Contributor

ogrisel commented Jun 14, 2021

This sounds like a legit need and the implementation looks fine from a quick look at it. However I am not a big fan of the "deep_serialization" naming in the API and comments. What about calling this "pickle_by_value" or "pickling_by_value" instead?

@Samreay
Copy link
Contributor Author

Samreay commented Jun 14, 2021

@ogrisel Happy to update this to pickle_by_value() if we can lock that name in. The deep notation was taken from @pierreglaser's comment here which I liked because of the similarity to copy/deepcopy. However, if you believe there is still ambiguity, it's an easy change.

@Samreay
Copy link
Contributor Author

Samreay commented Jun 14, 2021

Functions renamed to register_pickle_by_value and unregister_pickle_by_value

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please document the change in the changelog.

@pierreglaser ok with you?

@Samreay
Copy link
Contributor Author

Samreay commented Jun 15, 2021

Changes now documented :)

@ogrisel
Copy link
Contributor

ogrisel commented Jun 15, 2021

The failure of CloudPickleTest.test_locally_defined_class_with_type_hints on python nightly is unrelated. The memory usage on macos might be random but I am not 100% sure.

@Samreay
Copy link
Contributor Author

Samreay commented Jun 15, 2021

Yeah, will be interesting to see if they roll back the string typehints given how many libraries its going to impact. Ive seen no memory issues when testing this locally on my macbook, can you point me to which test / build step is concern and I can try to dig a little deeper?

@ogrisel
Copy link
Contributor

ogrisel commented Jun 15, 2021

The macos failure looks random and is probably unrelated.

Copy link
Member

@pierreglaser pierreglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a first batch of comments - mostly nitpicks, but I raised a few more structural suggestions/questions.

As an important side point, I think that as of now, calling cloudpickle.dumps(module) on a module registered for pickling by value will still be pickled by reference: It is a behavior that we may want to change, cc @ogrisel

cloudpickle/cloudpickle.py Outdated Show resolved Hide resolved
tests/cloudpickle_test.py Outdated Show resolved Hide resolved
tests/cloudpickle_test.py Outdated Show resolved Hide resolved
tests/cloudpickle_test.py Outdated Show resolved Hide resolved
tests/cloudpickle_test.py Outdated Show resolved Hide resolved
tests/cloudpickle_test.py Outdated Show resolved Hide resolved
tests/cloudpickle_test.py Outdated Show resolved Hide resolved
tests/cloudpickle_test.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle.py Outdated Show resolved Hide resolved
@pierreglaser
Copy link
Member

Feel free to request my review again when all requested changes are made. I think also that a test making sure modules registered for pickling by value are actually pickled by value when doing a cloudpickle.dumps(module) call.

@Samreay
Copy link
Contributor Author

Samreay commented Jun 15, 2021

@pierreglaser Ive added the module serialisation test in, however it is currently broken, with the main issue being the TypeVar being passed into _should_pickle_by_reference from somewhere (maybe the module reduce?). My lack of knowledge in how the fundamentals is making debugging difficult, would you be able to look over the changes and offer any suggestions on what is going wrong?

@ogrisel
Copy link
Contributor

ogrisel commented Jul 28, 2021

Thanks, I will try to review soon.

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice contrib and tests!

Just a few suggestions to further improve the doc + code nitpicks.

cloudpickle/cloudpickle.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle.py Outdated Show resolved Hide resolved
Comment on lines 2425 to 2427
assert w.run(
lambda: LocalClass().method() == LocalClass().method()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert w.run(
lambda: LocalClass().method() == LocalClass().method()
)
assert (
w.run(lambda: LocalClass().method())
== LocalClass().method()
)

tests/cloudpickle_test.py Show resolved Hide resolved
tests/cloudpickle_test.py Outdated Show resolved Hide resolved
README.md Outdated
Since `cloudpickle 1.7.0`, it is possible to extend the use of serialization by
value to functions or classes coming from **any pure Python module**. This feature
is useful when the said module is unavailable in the unpickling environment
(making traditional serialization by reference ineffective). To this end,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the motivation (interactive dev on a distributed cluster with long running worker processes) could be made more explicit by reusing the text I suggest in the docstring of register_pickle_by_value above in this review.

@ogrisel
Copy link
Contributor

ogrisel commented Jul 30, 2021

@pierreglaser once my review comments are addressed, feel free to merge and release cloudpickle. But before releasing, please run the downstream ci tests in the release PR to try to avoid introducing any regression in this quite big release.

@pierreglaser
Copy link
Member

@ogrisel thanks for the review! Should we bump cloudpickle's version to 2.0.0 for this release, or stick with 1.7.0?

@ogrisel
Copy link
Contributor

ogrisel commented Aug 1, 2021

+1 for 2.0.0 :)

@ogrisel ogrisel merged commit 343da11 into cloudpipe:master Aug 1, 2021
@ogrisel
Copy link
Contributor

ogrisel commented Aug 1, 2021

Merged! Thank you very much @kinghuang @Samreay and @pierreglaser!

@pierreglaser
Copy link
Member

Woot 🎉!! Thanks a lot @Samreay for the contribution!

@bonelli
Copy link

bonelli commented Aug 2, 2021

Thank you everybody, IMHO this is a great improvement towards general purpose clusters.
When do you think 2.0.0 will be tagged and released as a package? If possible I'd wait that moment to use this in my new docker images on my k8s cluster.

@pierreglaser
Copy link
Member

Releasing will be done once #432 is addressed - hopefully within the next couple of weeks this should be done.

@Samreay
Copy link
Contributor Author

Samreay commented Aug 2, 2021

Thanks @pierreglaser and @ogrisel for all the commits getting this ready. Fingers crossed #432 is a simple fix! :)

wip-sync pushed a commit to NetBSD/pkgsrc-wip that referenced this pull request Apr 4, 2022
2.0.0
=====

- Python 3.5 is no longer supported.

- Support for registering modules to be serialised by value. This allows code
  defined in local modules to be serialised and executed remotely without those
  local modules installed on the remote machine.
  ([PR #417](cloudpipe/cloudpickle#417))

- Fix a side effect altering dynamic modules at pickling time.
  ([PR #426](cloudpipe/cloudpickle#426))

- Support for pickling type annotations on Python 3.10 as per [PEP 563](
  https://www.python.org/dev/peps/pep-0563/)
  ([PR #400](cloudpipe/cloudpickle#400))

- Stricter parametrized type detection heuristics in
  _is_parametrized_type_hint to limit false positives.
  ([PR #409](cloudpipe/cloudpickle#409))

- Support pickling / depickling of OrderedDict KeysView, ValuesView, and
  ItemsView, following similar strategy for vanilla Python dictionaries.
  ([PR #423](cloudpipe/cloudpickle#423))

- Suppressed a source of non-determinism when pickling dynamically defined
  functions and handles the deprecation of co_lnotab in Python 3.10+.
  ([PR #428](cloudpipe/cloudpickle#428))

1.6.0
=====

- `cloudpickle`'s pickle.Pickler subclass (currently defined as
  `cloudpickle.cloudpickle_fast.CloudPickler`) can and should now be accessed
  as `cloudpickle.Pickler`. This is the only officially supported way of
  accessing it.
  ([issue #366](cloudpipe/cloudpickle#366))

- `cloudpickle` now supports pickling `dict_keys`, `dict_items` and
  `dict_values`.
  ([PR #384](cloudpipe/cloudpickle#384))

1.5.0
=====

- Fix a bug causing cloudpickle to crash when pickling dynamically created,
  importable modules.
  ([issue #360](cloudpipe/cloudpickle#354))

- Add optional dependency on `pickle5` to get improved performance on
  Python 3.6 and 3.7.
  ([PR #370](cloudpipe/cloudpickle#370))

- Internal refactoring to ease the use of `pickle5` in cloudpickle
  for Python 3.6 and 3.7.
  ([PR #368](cloudpipe/cloudpickle#368))

1.4.1
=====

- Fix incompatibilities between cloudpickle 1.4.0 and Python 3.5.0/1/2
  introduced by the new support of cloudpickle for pickling typing constructs.
  ([issue #360](cloudpipe/cloudpickle#360))

- Restore compat with loading dynamic classes pickled with cloudpickle
  version 1.2.1 that would reference the `types.ClassType` attribute.
  ([PR #359](cloudpipe/cloudpickle#359))

1.4.0
=====

**This version requires Python 3.5 or later**

- cloudpickle can now all pickle all constructs from the ``typing`` module
  and the ``typing_extensions`` library in Python 3.5+
  ([PR #318](cloudpipe/cloudpickle#318))

- Stop pickling the annotations of a dynamic class for Python < 3.6
  (follow up on #276)
  ([issue #347](cloudpipe/cloudpickle#347))

- Fix a bug affecting the pickling of dynamic `TypeVar` instances on Python 3.7+,
  and expand the support for pickling `TypeVar` instances (dynamic or non-dynamic)
  to Python 3.5-3.6 ([PR #350](cloudpipe/cloudpickle#350))

- Add support for pickling dynamic classes subclassing `typing.Generic`
  instances on Python 3.7+
  ([PR #351](cloudpipe/cloudpickle#351))

1.3.0
=====

- Fix a bug affecting dynamic modules occuring with modified builtins
  ([issue #316](cloudpipe/cloudpickle#316))

- Fix a bug affecting cloudpickle when non-modules objects are added into
  sys.modules
  ([PR #326](cloudpipe/cloudpickle#326)).

- Fix a regression in cloudpickle and python3.8 causing an error when trying to
  pickle property objects.
  ([PR #329](cloudpipe/cloudpickle#329)).

- Fix a bug when a thread imports a module while cloudpickle iterates
  over the module list
  ([PR #322](cloudpipe/cloudpickle#322)).

- Add support for out-of-band pickling (Python 3.8 and later).
  https://docs.python.org/3/library/pickle.html#example
  ([issue #308](cloudpipe/cloudpickle#308))

- Fix a side effect that would redefine `types.ClassTypes` as `type`
  when importing cloudpickle.
  ([issue #337](cloudpipe/cloudpickle#337))

- Fix a bug affecting subclasses of slotted classes.
  ([issue #311](cloudpipe/cloudpickle#311))

- Dont pickle the abc cache of dynamically defined classes for Python 3.6-
  (This was already the case for python3.7+)
  ([issue #302](cloudpipe/cloudpickle#302))

1.2.2
=====

- Revert the change introduced in
  ([issue #276](cloudpipe/cloudpickle#276))
  attempting to pickle functions annotations for Python 3.4 to 3.6. It is not
  possible to pickle complex typing constructs for those versions (see
  [issue #193]( cloudpipe/cloudpickle#193))

- Fix a bug affecting bound classmethod saving on Python 2.
  ([issue #288](cloudpipe/cloudpickle#288))

- Add support for pickling "getset" descriptors
  ([issue #290](cloudpipe/cloudpickle#290))

1.2.1
=====

- Restore (partial) support for Python 3.4 for downstream projects that have
  LTS versions that would benefit from cloudpickle bug fixes.

1.2.0
=====

- Leverage the C-accelerated Pickler new subclassing API (available in Python
  3.8) in cloudpickle. This allows cloudpickle to pickle Python objects up to
  30 times faster.
  ([issue #253](cloudpipe/cloudpickle#253))

- Support pickling of classmethod and staticmethod objects in python2.
  arguments. ([issue #262](cloudpipe/cloudpickle#262))

- Add support to pickle type annotations for Python 3.5 and 3.6 (pickling type
  annotations was already supported for Python 3.7, Python 3.4 might also work
  but is no longer officially supported by cloudpickle)
  ([issue #276](cloudpipe/cloudpickle#276))

- Internal refactoring to proactively detect dynamic functions and classes when
  pickling them.  This refactoring also yields small performance improvements
  when pickling dynamic classes (~10%)
  ([issue #273](cloudpipe/cloudpickle#273))

1.1.1
=====

- Minor release to fix a packaging issue (Markdown formatting of the long
  description rendered on pypi.org). The code itself is the same as 1.1.0.

1.1.0
=====

- Support the pickling of interactively-defined functions with positional-only
  arguments. ([issue #266](cloudpipe/cloudpickle#266))

- Track the provenance of dynamic classes and enums so as to preseve the
  usual `isinstance` relationship between pickled objects and their
  original class defintions.
  ([issue #246](cloudpipe/cloudpickle#246))

1.0.0
=====

- Fix a bug making functions with keyword-only arguments forget the default
  values of these arguments after being pickled.
  ([issue #264](cloudpipe/cloudpickle#264))

0.8.1
=====

- Fix a bug (already present before 0.5.3 and re-introduced in 0.8.0)
  affecting relative import instructions inside depickled functions
  ([issue #254](cloudpipe/cloudpickle#254))

0.8.0
=====

- Add support for pickling interactively defined dataclasses.
  ([issue #245](cloudpipe/cloudpickle#245))

- Global variables referenced by functions pickled by cloudpickle are now
  unpickled in a new and isolated namespace scoped by the CloudPickler
  instance. This restores the (previously untested) behavior of cloudpickle
  prior to changes done in 0.5.4 for functions defined in the `__main__`
  module, and 0.6.0/1 for other dynamic functions.

0.7.0
=====

- Correctly serialize dynamically defined classes that have a `__slots__`
  attribute.
  ([issue #225](cloudpipe/cloudpickle#225))

0.6.1
=====

- Fix regression in 0.6.0 which breaks the pickling of local function defined
  in a module, making it impossible to access builtins.
  ([issue #211](cloudpipe/cloudpickle#211))

0.6.0
=====

- Ensure that unpickling a function defined in a dynamic module several times
  sequentially does not reset the values of global variables.
  ([issue #187](cloudpipe/cloudpickle#205))

- Restrict the ability to pickle annotations to python3.7+ ([issue #193](
  cloudpipe/cloudpickle#193) and [issue #196](
  cloudpipe/cloudpickle#196))

- Stop using the deprecated `imp` module under Python 3.
  ([issue #207](cloudpipe/cloudpickle#207))

- Fixed pickling issue with singleton types `NoneType`, `type(...)` and
  `type(NotImplemented)` ([issue #209](cloudpipe/cloudpickle#209))

0.5.6
=====

- Ensure that unpickling a locally defined function that accesses the global
  variables of a module does not reset the values of the global variables if
  they are already initialized.
  ([issue #187](cloudpipe/cloudpickle#187))

0.5.5
=====

- Fixed inconsistent version in `cloudpickle.__version__`.

0.5.4
=====

- Fixed a pickling issue for ABC in python3.7+ ([issue #180](
  cloudpipe/cloudpickle#180)).

- Fixed a bug when pickling functions in `__main__` that access global
  variables ([issue #187](
  cloudpipe/cloudpickle#187)).

0.5.3
=====
- Fixed a crash in Python 2 when serializing non-hashable instancemethods of built-in
  types ([issue #144](cloudpipe/cloudpickle#144)).

- itertools objects can also pickled
  ([PR #156](cloudpipe/cloudpickle#156)).

- `logging.RootLogger` can be also pickled
  ([PR #160](cloudpipe/cloudpickle#160)).

0.5.2
=====

- Fixed a regression: `AttributeError` when loading pickles that hold a
  reference to a dynamically defined class from the `__main__` module.
  ([issue #131]( cloudpipe/cloudpickle#131)).

- Make it possible to pickle classes and functions defined in faulty
  modules that raise an exception when trying to look-up their attributes
  by name.

0.5.1
=====

- Fixed `cloudpickle.__version__`.

0.5.0
=====

- Use `pickle.HIGHEST_PROTOCOL` by default.

0.4.4
=====

- `logging.RootLogger` can be also pickled
  ([PR #160](cloudpipe/cloudpickle#160)).

0.4.3
=====

- Fixed a regression: `AttributeError` when loading pickles that hold a
  reference to a dynamically defined class from the `__main__` module.
  ([issue #131]( cloudpipe/cloudpickle#131)).

- Fixed a crash in Python 2 when serializing non-hashable instancemethods of built-in
  types. ([issue #144](cloudpipe/cloudpickle#144))

0.4.2
=====

- Restored compatibility with pickles from 0.4.0.
- Handle the `func.__qualname__` attribute.

0.4.1
=====

- Fixed a crash when pickling dynamic classes whose `__dict__` attribute was
  defined as a [`property`](https://docs.python.org/3/library/functions.html#property).
  Most notably, this affected dynamic [namedtuples](https://docs.python.org/2/library/collections.html#namedtuple-factory-function-for-tuples-with-named-fields)
  in Python 2. (cloudpipe/cloudpickle#113)
- Cloudpickle now preserves the `__module__` attribute of functions (cloudpipe/cloudpickle#118).
- Fixed a crash when pickling modules that don't have a `__package__` attribute (cloudpipe/cloudpickle#116).

0.4.0
=====

* Fix functions with empty cells
* Allow pickling Logger objects
* Fix crash when pickling dynamic class cycles
* Ignore "None" mdoules added to sys.modules
* Support WeakSets and ABCMeta instances
* Remove non-standard `__transient__` support
* Catch exception from `pickle.whichmodule()`

0.3.1
=====

* Fix version information and ship a changelog

 0.3.0
=====

* Import submodules accessed by pickled functions
* Support recursive functions inside closures
* Fix `ResourceWarnings` and `DeprecationWarnings`
* Assume modules with `__file__` attribute are not dynamic

0.2.2
=====

* Support Python 3.6
* Support Tornado Coroutines
* Support builtin methods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New feature] full code pickle for arbitrary modules
7 participants