Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: derive from C-pickler for fast serialization #253

Merged
merged 70 commits into from
Jun 7, 2019

Conversation

pierreglaser
Copy link
Member

@pierreglaser pierreglaser commented Mar 7, 2019

Summary:

This PR proposes a new Cloudpickler class, that inherits from the C
_pickle.Pickler instead of the python pickle._Pickler, allowing 10x+
speedups for the serialization of large builtin objects such as dicts, lists..

Disclaimer: a new start

Moving from the python to the c Pickler requires a fair amount of changes.
For this reason, instead of simply adapting the current code to respect the new
constraints, I started back from scratch. This allows a new, clean API and
structure, that will be hopefully easier to understand for everyone.

I made a lot of comments, (sometimes overly verbose), to ease the review
process of this PR. Eventually, I hope the information they contain can be
transfered to a proper project documentation.

Implementation:

Changes to python

As opposed to the python pickler, The CPickler does not expose the save_*
family of functions, as well as low level isntructions such as write. These
methods can can neither be patched, or called, and the only customization
option we had initially was the dispatch table, that is called for all types
BUT a few special cases, including classes and functions, the two principal
use-cases of cloudpickle.

As this makes it simply impossible to modify pickling behavior for such types,
we patched the C pickler for it to allow a user defined reduction callback for
functions and classes. This idea was suggested by @pitrou.

The direct consequence is that functions and classes now have to follow the
save_reduce-load_build pickling/depickling process. Unfortunaltely,
this API is not well suited for custom builtin-type saving: in particular,
the state setting part of load_build (function that reconstructs an object from
a reduce value) assumes all attributes of an object are writeable, which is not
the case for C types (especially function.__globals__ and
function.__closure__)

For this reason, we also changed the API of save_reduce, allowing to add a
custom state_setter, that will be called at unpickling time.

You can view the totals changes in this diff

Individual PRs to CPython:

Changes to cloudpickle

Functions and classes are the two main types affected by this PR. The main
challenge was to make the saving process fit into the save_reduce API.

Outside of these types, the actuall reduction process remains intact.

However, now that any customization must return a tuple, I decided to adopt a
new naming, hopefully clearer naming style for functions. You will see by
yourselves.

How to build this version locally

Until the final release of Python 3.8, you need to build python from upstream's master branch

git clone [email protected]:python/cpython.git
cd cpython
./configure
make

To be able to use external modules you need a virtual environment, using for
example the venv module:

./python -m venv /path/to/local/virtualenv

Clone and install cloudpickle and its dependencies

cd /path/to/cloudpickle
git clone [email protected]:cloudpipe/cloudpickle.git
git fetch origin pull/ID/head:fast-cloudpickle
/path/to/local/virtualenv/bin/python -mpip install -rdev-requirements.txt
/path/to/local/virtualenv/bin/python -mpip install .

Finally, rum the tests:

/path/to/local/virtualenv/bin/python -mpytest tests/

Bechmarks:

  • Benchmarks of a "concrete", end-to-end use-case using loky can be found
    here. To run the benchmarks, you also need the master version of loky.

@dimosped
Copy link

dimosped commented Mar 11, 2019

Thank you for putting this PR together.
Do you have an expected date this will merge on master?

Also any plans for making this change backwards compatible < 3.8 or even 2.7 ?

@pierreglaser
Copy link
Member Author

pierreglaser commented Mar 11, 2019

Do you have an expected date this will merge on master?

It will all depend on whether cpython core developers agree on the patches we proposed. I would say (rough estimate) between zero and two months.

Also any plans for making this change backwards compatible < 3.8 or even 2.7 ?

This is still an open question: doing this would mean including the new _pickle.c code to cloudpickle and maintaining it.

@ogrisel
Copy link
Contributor

ogrisel commented Mar 18, 2019

This is still an open question: doing this would mean including the new _pickle.c code to cloudpickle and maintaining it.

Another option (for Python 3.6+ only) would be to include the change in the pickle5 backport and make it a soft dependency for cloudpickle.

pierreglaser added a commit to pierreglaser/loky that referenced this pull request Mar 18, 2019
If a Pickler instance already has methods in its dispatch table, do not
override them. This bug showed up in
cloudpipe/cloudpickle#253, where the
Cloudpickler instances derives from the C Pickler, and most custom
saving methods are inside dispatch_table, not dispatch.
tomMoral pushed a commit to joblib/loky that referenced this pull request Mar 18, 2019
If a Pickler instance already has methods in its dispatch table, do not
override them. This bug showed up in
cloudpipe/cloudpickle#253, where the
Cloudpickler instances derives from the C Pickler, and most custom
saving methods are inside dispatch_table, not dispatch.
Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the new helpers functions and reducers should be made private (with a leading _ in their name so as to make it explicit that this is not part of the public API and can change without a deprecation notice).

Furthermore I find this PR a bit hard to review because some the helpers are duplicates from existing helpers in cloudpickle.cloudpickle and it's hard to know if they are exactly the same or not. Hence I suggest the following:

cloudpickle/__init__.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Mar 22, 2019

@pierreglaser I'm sorry about my bad memory, but did you post a PR and/or a Python issue about your changes? I don't remember.

If a change is needed in pickle protocol 5, you'd better tell me also (and/or post a thread on python-dev to gather feedback).

@pitrou
Copy link
Member

pitrou commented Mar 22, 2019

Another option (for Python 3.6+ only) would be to include the change in the pickle5 backport and make it a soft dependency for cloudpickle.

That would be theoretically ok with me. Whether I concretely agree or not depends on how harder it makes to maintain the backport (right now it works with simple diff scripts and bit a manual work whenever there's a conflict).

@pierreglaser
Copy link
Member Author

did you post a PR and/or a Python issue about your changes?

@pitrou there is an issue in the bug tracker, with patches. I was waiting for feedback before submitting a PR.

If a change is needed in pickle protocol 5, you'd better tell me also

There is no change to the pickle protocol (no new opcode) itself, however I extended the save_reduce API, by adding a state_setter keyword argument.

@pitrou
Copy link
Member

pitrou commented Mar 22, 2019

Nowadays PRs are easier to read than raw patches, so I'd rather have that.

@ogrisel
Copy link
Contributor

ogrisel commented Mar 28, 2019

It seems that this PR needs to be updated to take the merge of #254 into account.

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small batch of comments:

cloudpickle/cloudpickle.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a first review. Overall this looks good to me.

Have you tried to run some of the downstream test suites? I guess we have to do that manually because of the need to run against a patched python master.

cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
tests/cloudpickle_test.py Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/__init__.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
cloudpickle/cloudpickle_fast.py Outdated Show resolved Hide resolved
@pierreglaser
Copy link
Member Author

The more I think about it, the less I am convinced about the need for globals_ref to be a CloudPickler attribute. Having globals_ref as a cloudpickle global variable will not impact the CloudPickler memoization process, and allows us to get rid of this weird signature for functions-related reducers. @ogrisel WDYT?

@ogrisel
Copy link
Contributor

ogrisel commented Apr 26, 2019

We don't want to accumulate objects in a global datastructure. This can be considered a memory leak (although there would be only empty dicts there if I am correct).

I much prefer to avoid global variables if there is a natural way not have any (as in the current code).

@pierreglaser
Copy link
Member Author

pierreglaser commented May 9, 2019

@ogrisel will cloudpickle.cloudpickle and cloudpickle.cloudpickle_fast be considered private after this PR is merged? More precisely, should cloudpickle.cloudpickle still be expected to work on Python versions >= 3.8?

(the CI breaks because cloudpickle.cloudpickle instanciantes a global variable - _cell_set_template_code - using the types.CodeType constructor, whose signature changed in 3.8 to allow for positional-only parameters)

@ogrisel
Copy link
Contributor

ogrisel commented May 10, 2019

More precisely, should cloudpickle.cloudpickle still be expected to work on Python versions >= 3.8?

I think it would be a good idea to ensure forward compatibility for the pure Python Pickler implementation if it's not to complicated to get.

@ogrisel
Copy link
Contributor

ogrisel commented May 13, 2019

I merged #246 which will impact this PR significantly unfortunately.

@ogrisel
Copy link
Contributor

ogrisel commented May 13, 2019

@pierreglaser would it be easier for your to redo this PR only once #262 is merged? In which case I would put the priority on reviewing it.

cloudpickle/__init__.py Outdated Show resolved Hide resolved
@ogrisel
Copy link
Contributor

ogrisel commented Jun 7, 2019

This comment #253 (comment) does not seem to be addressed.

@pierreglaser
Copy link
Member Author

There was a race condition with your comment and my last commit :) I'll cherry-pick in a separate PR.

@ogrisel
Copy link
Contributor

ogrisel commented Jun 7, 2019

Haha, as you wish. Please add an entry to the changelog so that we can merge this PR :)

@pierreglaser
Copy link
Member Author

I changed my mind I'll leave it as is, as this is part of the refactoring of this PR to change this helper. I'll add an entry to the changelog and make a last pass.

@pierreglaser
Copy link
Member Author

@ogrisel this LGTM as well!

@ogrisel ogrisel merged commit 167e163 into cloudpipe:master Jun 7, 2019
@ogrisel
Copy link
Contributor

ogrisel commented Jun 7, 2019

Merged! Thank you very much @pierreglaser!

@pierreglaser pierreglaser deleted the fast-cloudpickle branch June 12, 2019 15:01
wip-sync pushed a commit to NetBSD/pkgsrc-wip that referenced this pull request Apr 4, 2022
2.0.0
=====

- Python 3.5 is no longer supported.

- Support for registering modules to be serialised by value. This allows code
  defined in local modules to be serialised and executed remotely without those
  local modules installed on the remote machine.
  ([PR #417](cloudpipe/cloudpickle#417))

- Fix a side effect altering dynamic modules at pickling time.
  ([PR #426](cloudpipe/cloudpickle#426))

- Support for pickling type annotations on Python 3.10 as per [PEP 563](
  https://www.python.org/dev/peps/pep-0563/)
  ([PR #400](cloudpipe/cloudpickle#400))

- Stricter parametrized type detection heuristics in
  _is_parametrized_type_hint to limit false positives.
  ([PR #409](cloudpipe/cloudpickle#409))

- Support pickling / depickling of OrderedDict KeysView, ValuesView, and
  ItemsView, following similar strategy for vanilla Python dictionaries.
  ([PR #423](cloudpipe/cloudpickle#423))

- Suppressed a source of non-determinism when pickling dynamically defined
  functions and handles the deprecation of co_lnotab in Python 3.10+.
  ([PR #428](cloudpipe/cloudpickle#428))

1.6.0
=====

- `cloudpickle`'s pickle.Pickler subclass (currently defined as
  `cloudpickle.cloudpickle_fast.CloudPickler`) can and should now be accessed
  as `cloudpickle.Pickler`. This is the only officially supported way of
  accessing it.
  ([issue #366](cloudpipe/cloudpickle#366))

- `cloudpickle` now supports pickling `dict_keys`, `dict_items` and
  `dict_values`.
  ([PR #384](cloudpipe/cloudpickle#384))

1.5.0
=====

- Fix a bug causing cloudpickle to crash when pickling dynamically created,
  importable modules.
  ([issue #360](cloudpipe/cloudpickle#354))

- Add optional dependency on `pickle5` to get improved performance on
  Python 3.6 and 3.7.
  ([PR #370](cloudpipe/cloudpickle#370))

- Internal refactoring to ease the use of `pickle5` in cloudpickle
  for Python 3.6 and 3.7.
  ([PR #368](cloudpipe/cloudpickle#368))

1.4.1
=====

- Fix incompatibilities between cloudpickle 1.4.0 and Python 3.5.0/1/2
  introduced by the new support of cloudpickle for pickling typing constructs.
  ([issue #360](cloudpipe/cloudpickle#360))

- Restore compat with loading dynamic classes pickled with cloudpickle
  version 1.2.1 that would reference the `types.ClassType` attribute.
  ([PR #359](cloudpipe/cloudpickle#359))

1.4.0
=====

**This version requires Python 3.5 or later**

- cloudpickle can now all pickle all constructs from the ``typing`` module
  and the ``typing_extensions`` library in Python 3.5+
  ([PR #318](cloudpipe/cloudpickle#318))

- Stop pickling the annotations of a dynamic class for Python < 3.6
  (follow up on #276)
  ([issue #347](cloudpipe/cloudpickle#347))

- Fix a bug affecting the pickling of dynamic `TypeVar` instances on Python 3.7+,
  and expand the support for pickling `TypeVar` instances (dynamic or non-dynamic)
  to Python 3.5-3.6 ([PR #350](cloudpipe/cloudpickle#350))

- Add support for pickling dynamic classes subclassing `typing.Generic`
  instances on Python 3.7+
  ([PR #351](cloudpipe/cloudpickle#351))

1.3.0
=====

- Fix a bug affecting dynamic modules occuring with modified builtins
  ([issue #316](cloudpipe/cloudpickle#316))

- Fix a bug affecting cloudpickle when non-modules objects are added into
  sys.modules
  ([PR #326](cloudpipe/cloudpickle#326)).

- Fix a regression in cloudpickle and python3.8 causing an error when trying to
  pickle property objects.
  ([PR #329](cloudpipe/cloudpickle#329)).

- Fix a bug when a thread imports a module while cloudpickle iterates
  over the module list
  ([PR #322](cloudpipe/cloudpickle#322)).

- Add support for out-of-band pickling (Python 3.8 and later).
  https://docs.python.org/3/library/pickle.html#example
  ([issue #308](cloudpipe/cloudpickle#308))

- Fix a side effect that would redefine `types.ClassTypes` as `type`
  when importing cloudpickle.
  ([issue #337](cloudpipe/cloudpickle#337))

- Fix a bug affecting subclasses of slotted classes.
  ([issue #311](cloudpipe/cloudpickle#311))

- Dont pickle the abc cache of dynamically defined classes for Python 3.6-
  (This was already the case for python3.7+)
  ([issue #302](cloudpipe/cloudpickle#302))

1.2.2
=====

- Revert the change introduced in
  ([issue #276](cloudpipe/cloudpickle#276))
  attempting to pickle functions annotations for Python 3.4 to 3.6. It is not
  possible to pickle complex typing constructs for those versions (see
  [issue #193]( cloudpipe/cloudpickle#193))

- Fix a bug affecting bound classmethod saving on Python 2.
  ([issue #288](cloudpipe/cloudpickle#288))

- Add support for pickling "getset" descriptors
  ([issue #290](cloudpipe/cloudpickle#290))

1.2.1
=====

- Restore (partial) support for Python 3.4 for downstream projects that have
  LTS versions that would benefit from cloudpickle bug fixes.

1.2.0
=====

- Leverage the C-accelerated Pickler new subclassing API (available in Python
  3.8) in cloudpickle. This allows cloudpickle to pickle Python objects up to
  30 times faster.
  ([issue #253](cloudpipe/cloudpickle#253))

- Support pickling of classmethod and staticmethod objects in python2.
  arguments. ([issue #262](cloudpipe/cloudpickle#262))

- Add support to pickle type annotations for Python 3.5 and 3.6 (pickling type
  annotations was already supported for Python 3.7, Python 3.4 might also work
  but is no longer officially supported by cloudpickle)
  ([issue #276](cloudpipe/cloudpickle#276))

- Internal refactoring to proactively detect dynamic functions and classes when
  pickling them.  This refactoring also yields small performance improvements
  when pickling dynamic classes (~10%)
  ([issue #273](cloudpipe/cloudpickle#273))

1.1.1
=====

- Minor release to fix a packaging issue (Markdown formatting of the long
  description rendered on pypi.org). The code itself is the same as 1.1.0.

1.1.0
=====

- Support the pickling of interactively-defined functions with positional-only
  arguments. ([issue #266](cloudpipe/cloudpickle#266))

- Track the provenance of dynamic classes and enums so as to preseve the
  usual `isinstance` relationship between pickled objects and their
  original class defintions.
  ([issue #246](cloudpipe/cloudpickle#246))

1.0.0
=====

- Fix a bug making functions with keyword-only arguments forget the default
  values of these arguments after being pickled.
  ([issue #264](cloudpipe/cloudpickle#264))

0.8.1
=====

- Fix a bug (already present before 0.5.3 and re-introduced in 0.8.0)
  affecting relative import instructions inside depickled functions
  ([issue #254](cloudpipe/cloudpickle#254))

0.8.0
=====

- Add support for pickling interactively defined dataclasses.
  ([issue #245](cloudpipe/cloudpickle#245))

- Global variables referenced by functions pickled by cloudpickle are now
  unpickled in a new and isolated namespace scoped by the CloudPickler
  instance. This restores the (previously untested) behavior of cloudpickle
  prior to changes done in 0.5.4 for functions defined in the `__main__`
  module, and 0.6.0/1 for other dynamic functions.

0.7.0
=====

- Correctly serialize dynamically defined classes that have a `__slots__`
  attribute.
  ([issue #225](cloudpipe/cloudpickle#225))

0.6.1
=====

- Fix regression in 0.6.0 which breaks the pickling of local function defined
  in a module, making it impossible to access builtins.
  ([issue #211](cloudpipe/cloudpickle#211))

0.6.0
=====

- Ensure that unpickling a function defined in a dynamic module several times
  sequentially does not reset the values of global variables.
  ([issue #187](cloudpipe/cloudpickle#205))

- Restrict the ability to pickle annotations to python3.7+ ([issue #193](
  cloudpipe/cloudpickle#193) and [issue #196](
  cloudpipe/cloudpickle#196))

- Stop using the deprecated `imp` module under Python 3.
  ([issue #207](cloudpipe/cloudpickle#207))

- Fixed pickling issue with singleton types `NoneType`, `type(...)` and
  `type(NotImplemented)` ([issue #209](cloudpipe/cloudpickle#209))

0.5.6
=====

- Ensure that unpickling a locally defined function that accesses the global
  variables of a module does not reset the values of the global variables if
  they are already initialized.
  ([issue #187](cloudpipe/cloudpickle#187))

0.5.5
=====

- Fixed inconsistent version in `cloudpickle.__version__`.

0.5.4
=====

- Fixed a pickling issue for ABC in python3.7+ ([issue #180](
  cloudpipe/cloudpickle#180)).

- Fixed a bug when pickling functions in `__main__` that access global
  variables ([issue #187](
  cloudpipe/cloudpickle#187)).

0.5.3
=====
- Fixed a crash in Python 2 when serializing non-hashable instancemethods of built-in
  types ([issue #144](cloudpipe/cloudpickle#144)).

- itertools objects can also pickled
  ([PR #156](cloudpipe/cloudpickle#156)).

- `logging.RootLogger` can be also pickled
  ([PR #160](cloudpipe/cloudpickle#160)).

0.5.2
=====

- Fixed a regression: `AttributeError` when loading pickles that hold a
  reference to a dynamically defined class from the `__main__` module.
  ([issue #131]( cloudpipe/cloudpickle#131)).

- Make it possible to pickle classes and functions defined in faulty
  modules that raise an exception when trying to look-up their attributes
  by name.

0.5.1
=====

- Fixed `cloudpickle.__version__`.

0.5.0
=====

- Use `pickle.HIGHEST_PROTOCOL` by default.

0.4.4
=====

- `logging.RootLogger` can be also pickled
  ([PR #160](cloudpipe/cloudpickle#160)).

0.4.3
=====

- Fixed a regression: `AttributeError` when loading pickles that hold a
  reference to a dynamically defined class from the `__main__` module.
  ([issue #131]( cloudpipe/cloudpickle#131)).

- Fixed a crash in Python 2 when serializing non-hashable instancemethods of built-in
  types. ([issue #144](cloudpipe/cloudpickle#144))

0.4.2
=====

- Restored compatibility with pickles from 0.4.0.
- Handle the `func.__qualname__` attribute.

0.4.1
=====

- Fixed a crash when pickling dynamic classes whose `__dict__` attribute was
  defined as a [`property`](https://docs.python.org/3/library/functions.html#property).
  Most notably, this affected dynamic [namedtuples](https://docs.python.org/2/library/collections.html#namedtuple-factory-function-for-tuples-with-named-fields)
  in Python 2. (cloudpipe/cloudpickle#113)
- Cloudpickle now preserves the `__module__` attribute of functions (cloudpipe/cloudpickle#118).
- Fixed a crash when pickling modules that don't have a `__package__` attribute (cloudpipe/cloudpickle#116).

0.4.0
=====

* Fix functions with empty cells
* Allow pickling Logger objects
* Fix crash when pickling dynamic class cycles
* Ignore "None" mdoules added to sys.modules
* Support WeakSets and ABCMeta instances
* Remove non-standard `__transient__` support
* Catch exception from `pickle.whichmodule()`

0.3.1
=====

* Fix version information and ship a changelog

 0.3.0
=====

* Import submodules accessed by pickled functions
* Support recursive functions inside closures
* Fix `ResourceWarnings` and `DeprecationWarnings`
* Assume modules with `__file__` attribute are not dynamic

0.2.2
=====

* Support Python 3.6
* Support Tornado Coroutines
* Support builtin methods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants