-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: pickle_library option is ignored #28558
Comments
.take-issue |
After digging into this more, I see that a However, the way I'm interpreting various related parts of the code, there appear to be some inconsistencies:
So far, I have been unable to decipher where there should be logic that uses the internal The only thing I've managed to do that works for this case is to modify I've forked and cloned the repo, but for the life of me, I cannot get the Python test suite to run without failure. I have managed to get the word count example to run successfully (as mentioned in the contributing guide), but I cannot determine how to run a more comprehensive set of tests that covers testing of pickling, so I am not confident about being able test my proposed fix (nor add new tests, if necessary) since I cannot get a suite of tests to run without failure (nor without taking extremely long to run -- I've killed the test process after 15 minutes of running, and then probably another 15 minutes trying to kill all of the spawned processes that seem to restart themselves and don't get cleaned up when I halt the main test process via Ctrl-C). Does anybody have some guidance here? |
@tvalentyn can you take a look at this issue? |
This is not the case, functions are pickled using |
What is the nature of elements in your collection? I am wondering why is it that PickleCoder is used for encoding them - is that intentional?
I think we should have been using DeterministicFastPrimitivesCoder there. feel free to send a PR. Also note that it is possible to create custom coders in your pipeline and use them. |
Perhaps that should be the case, but that is not what I am experiencing, which is why I'm reporting this. Apache Beam is very new to me, so it could very well be that I simply don't know what I'm doing, and I'm missing something important. I'll attempt to summarize and clarify what I'm doing and what I'm encountering:
My goal is to drop a problematic variable ( Is there something I'm missing in order to make that happen? |
@tvalentyn, would you mind expanding on the following? How would I go about doing so?
|
Thanks for the link, but given that example, I think that perhaps we're talking about 2 different things. The coder issue I'm having is with respect to the internal beam mechanism used to pickle objects to pass between processes. What's happening is a failure to pickle a function I'm using, not the data I'm dealing with. Specifically, the error occurs while attempting to pickle the function returned by the higher-order from kerchunk.combine import drop
...
| CombineReferences(
concat_dims=["time"],
identical_dims=["lat", "lon", "channel"],
mzz_kwargs={"preprocess": drop("lst_unc_sys")},
) The call to
That indicates that it cannot pickle the Given that the indicated function is not a top-level function, I created a top-level function with the same logic as the local function, and used that as the value of |
What happened?
When constructing a
Pipeline
, the optionpickle_library
is checked during construction, raising aValueError
when the value supplied is not one of the allowed values ("default", "dill", or "cloudpickle").Unfortunately, however, it is ignored when it comes time to use pickling.
I was able to produce a
PicklingError
via a pipeline that uses aMultiZarrToZarr
preprocessor, like so (abridged from linked issue, but illustrative):The error produced was the following:
It took quite a bit of digging, but I discovered that I should be able to address this issue by setting the following pipeline options:
save_main_session=True
pickle_library="cloudpickle"
Therefore I tried this:
Unfortunately, this produced the identical error.
After a bit of digging, I discovered that the problem is with the
PickleCoder
class's_create_impl
method (v2.50.0):Specifically, the method refers directly to
pickle.dumps
andpickle.loads
from the standardpickle
module (i.e.,import pickle
appears at the top ofcoders.py
), rather than from the repo'spickler
module, which is the module where the pickle library is set via the specified pipeline options described above.When I added the import of
pickler
and locally modified the_create_impl
method as follows, my pipeline ran without error:Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: