-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] ParquetDataset and ParquetPiece not serializable #21625
Comments
Matthew Rocklin / @mrocklin: This is pretty critical for Dask usage. Anyone trying to use PyArrow 0.13 in a Dask workflow will break pretty hard here. This isn't something that we can work around easily on our side. It would be great to know if this is likely to be resolved quickly, or if we should warn users strongly away from 0.13. In general, I recommend serialization tests for any project looking to interact with distributed computing libraries in Python. Often this consists of tests like the following for any type that you think a parallel computing framework might want to interact with. def test_serialization():
obj = MyObj()
obj2 = pickle.loads(pickle.dumps(obj))
assert obj == obj2 |
Dave Hirschfeld / @dhirschfeld: |
Wes McKinney / @wesm: |
Antoine Pitrou / @pitrou: |
Martin Durant / @martindurant: |
Antoine Pitrou / @pitrou: |
Krisztian Szucs / @kszucs: |
Krisztian Szucs / @kszucs: |
Antoine Pitrou / @pitrou: |
Krisztian Szucs / @kszucs: |
Martin Durant / @martindurant: |
Matthew Rocklin / @mrocklin: |
Antoine Pitrou / @pitrou: |
Sarah Bird: import dask.dataframe as dd
from dask.distributed import ClientClient()
df = dd.read_parquet('my_data.parquet', engine='pyarrow')
df.head() Let me know if I can better help. |
Krisztian Szucs / @kszucs: |
Sarah Bird: argument_0 object
argument_1 object
argument_2 object
argument_3 object
argument_4 object
argument_5 object
argument_6 object
argument_7 object
arguments object
arguments_len int64
call_stack object
crawl_id int32
document_url object
func_name object
in_iframe bool
operation object
script_col int64
script_line int64
script_loc_eval object
script_url object
symbol object
time_stamp datetime64[ns, UTC]
top_level_url object
value_1000 object
value_len int64
visit_id int64
dtype: object My traceback is: distributed.protocol.pickle - INFO - Failed to serialize (<function safe_head at 0x7f3d57f2c7b8>, (<function _read_pyarrow_parquet_piece at 0x7f3d57ef9268>, <dask.bytes.local.LocalFileSystem object at 0x7f3db58
ea4e0>, ParquetDatasetPiece('javascript_10percent_value_1000_only.parquet/part.0.parquet', row_group=None, partition_keys=[]), ['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4', 'argument_5'
, 'argument_6', 'argument_7', 'arguments', 'arguments_len', 'call_stack', 'crawl_id', 'document_url', 'func_name', 'in_iframe', 'operation', 'script_col', 'script_line', 'script_loc_eval', 'script_url', 'symbol
', 'time_stamp', 'top_level_url', 'value_1000', 'value_len', 'visit_id'], [], False, None, []), 5). Exception: no default __reduce__ due to non-trivial __cinit__
distributed.protocol.core - CRITICAL - Failed to Serialize
Traceback (most recent call last):
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/core.py", line 54, in dumps
for key, value in data.items()
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/core.py", line 55, in <dictcomp>
if type(value) is Serialize}
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/serialize.py", line 164, in serialize
raise TypeError(msg, str(x)[:10000])
TypeError: ('Could not serialize object of type tuple.', "(<function safe_head at 0x7f3d57f2c7b8>, (<function _read_pyarrow_parquet_piece at 0x7f3d57ef9268>, <dask.bytes.local.LocalFileSystem object at 0x7f3db5
8ea4e0>, ParquetDatasetPiece('javascript_10percent_value_1000_only.parquet/part.0.parquet', row_group=None, partition_keys=[]), ['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4', 'argument_5
', 'argument_6', 'argument_7', 'arguments', 'arguments_len', 'call_stack', 'crawl_id', 'document_url', 'func_name', 'in_iframe', 'operation', 'script_col', 'script_line', 'script_loc_eval', 'script_url', 'symbo
l', 'time_stamp', 'top_level_url', 'value_1000', 'value_len', 'visit_id'], [], False, None, []), 5)")
distributed.comm.utils - INFO - Unserializable Message: [{'op': 'update-graph', 'tasks': {"('head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)": <Serialize: ('read-parquet-head-1-5-read-parquet-daacce
e11e9cff29ad1ee5622ffd6c69', 0)>, "('read-parquet-head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)": <Serialize: (<function safe_head at 0x7f3d57f2c7b8>, (<function _read_pyarrow_parquet_piece at 0x7
f3d57ef9268>, <dask.bytes.local.LocalFileSystem object at 0x7f3db58ea4e0>, ParquetDatasetPiece('javascript_10percent_value_1000_only.parquet/part.0.parquet', row_group=None, partition_keys=[]), ['argument_0', '
argument_1', 'argument_2', 'argument_3', 'argument_4', 'argument_5', 'argument_6', 'argument_7', 'arguments', 'arguments_len', 'call_stack', 'crawl_id', 'document_url', 'func_name', 'in_iframe', 'operation', 's
cript_col', 'script_line', 'script_loc_eval', 'script_url', 'symbol', 'time_stamp', 'top_level_url', 'value_1000', 'value_len', 'visit_id'], [], False, None, []), 5)>}, 'dependencies': {"('head-1-5-read-parquet
-daaccee11e9cff29ad1ee5622ffd6c69', 0)": ["('read-parquet-head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)"], "('read-parquet-head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)": []}, 'keys'
: ["('head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)"], 'restrictions': {}, 'loose_restrictions': None, 'priority': {"('read-parquet-head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)": 0,
"('head-1-5-read-parquet-daaccee11e9cff29ad1ee5622ffd6c69', 0)": 1}, 'user_priority': 0, 'resources': None, 'submitting_task': None, 'retries': None, 'fifo_timeout': '60s', 'actors': None}]
distributed.comm.utils - ERROR - ('Could not serialize object of type tuple.', "(<function safe_head at 0x7f3d57f2c7b8>, (<function _read_pyarrow_parquet_piece at 0x7f3d57ef9268>, <dask.bytes.local.LocalFileSys
tem object at 0x7f3db58ea4e0>, ParquetDatasetPiece('javascript_10percent_value_1000_only.parquet/part.0.parquet', row_group=None, partition_keys=[]), ['argument_0', 'argument_1', 'argument_2', 'argument_3', 'ar
gument_4', 'argument_5', 'argument_6', 'argument_7', 'arguments', 'arguments_len', 'call_stack', 'crawl_id', 'document_url', 'func_name', 'in_iframe', 'operation', 'script_col', 'script_line', 'script_loc_eval'
, 'script_url', 'symbol', 'time_stamp', 'top_level_url', 'value_1000', 'value_len', 'visit_id'], [], False, None, []), 5)")
Traceback (most recent call last):
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/batched.py", line 94, in _background_send
on_error='raise')
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/comm/tcp.py", line 224, in write
'recipient': self._peer_addr})
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/comm/utils.py", line 50, in to_frames
res = yield offload(_to_frames)
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/comm/utils.py", line 43, in _to_frames
context=context))
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/core.py", line 54, in dumps
for key, value in data.items()
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/core.py", line 55, in <dictcomp>
if type(value) is Serialize}
File "/home/bird/miniconda3/envs/pyarrowtest/lib/python3.7/site-packages/distributed/protocol/serialize.py", line 164, in serialize
raise TypeError(msg, str(x)[:10000])
TypeError: ('Could not serialize object of type tuple.', "(<function safe_head at 0x7f3d57f2c7b8>, (<function _read_pyarrow_parquet_piece at 0x7f3d57ef9268>, <dask.bytes.local.LocalFileSystem object at 0x7f3db5
8ea4e0>, ParquetDatasetPiece('javascript_10percent_value_1000_only.parquet/part.0.parquet', row_group=None, partition_keys=[]), ['argument_0', 'argument_1', 'argument_2', 'argument_3', 'argument_4', 'argument_5
', 'argument_6', 'argument_7', 'arguments', 'arguments_len', 'call_stack', 'crawl_id', 'document_url', 'func_name', 'in_iframe', 'operation', 'script_col', 'script_line', 'script_loc_eval', 'script_url', 'symbo
l', 'time_stamp', 'top_level_url', 'value_1000', 'value_len', 'visit_id'], [], False, None, []), 5)") |
Sarah Bird: |
Krisztian Szucs / @kszucs: |
Wes McKinney / @wesm: FWIW, ParquetDatasetPiece is regarded as an implementation detail and wasn't intended necessarily to be serializable. So will keep this in mind for the future |
Matthew Rocklin / @mrocklin: |
Since 0.13.0, parquet instances are no longer serialisable, which means that dask.distributed cannot pass them between processes in order to load parquet in parallel.
Example:
The indicated schema instance is also referenced by the ParquetDatasetPiece s.
ref: dask/distributed#2597
Environment: osx python36/conda cloudpickle 0.8.1
arrow-cpp 0.13.0 py36ha71616b_0 conda-forge
pyarrow 0.13.0 py36hb37e6aa_0 conda-forge
Reporter: Martin Durant / @martindurant
Assignee: Krisztian Szucs / @kszucs
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-5144. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: