Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying path for pl.read_delta #5785

Closed
2 tasks done
stinodego opened this issue Dec 12, 2022 · 13 comments
Closed
2 tasks done

Specifying path for pl.read_delta #5785

stinodego opened this issue Dec 12, 2022 · 13 comments
Labels
bug Something isn't working python Related to Python Polars

Comments

@stinodego
Copy link
Contributor

stinodego commented Dec 12, 2022

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

I was very excited to start using the Delta reading functionality added in #5761. There seem to be some issues with the path parameter though.

It appears that:

  • Relative paths are not supported
  • "Cloud" filesystems are not supported

Reproducible example

This assumes you have a delta table in the data/test_delta directory.

import polars as pl

abs_path = "/home/stijn/project/data/test_delta"
df = pl.read_delta(abs_path)  # This works fine

rel_path = "data/test_delta"
df = pl.read_delta(rel_path)
# pyarrow.lib.ArrowInvalid: URI has empty scheme: 'data/test_delta'

azure_path = "az://project/test_delta"
df = pl.read_delta(azure_path, storage_options=storage_options)  # storage options contains secrets so not explicitly listed here
# pyarrow.lib.ArrowInvalid: Unrecognized filesystem type in URI: az://project/test_delta

Expected behavior

I expect the behaviour of pl.read_delta() to be identical to deltalake.DeltaTable(). In the cases above, this is not the case.

Installed versions

---Version info---
Polars: 0.15.3
Index type: UInt32
Platform: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python: 3.11.0 (main, Nov  1 2022, 09:16:00) [GCC 11.2.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.23.5
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: <not installed>

Deltalake version 0.6.4

@stinodego stinodego added bug Something isn't working python Related to Python Polars labels Dec 12, 2022
@stinodego stinodego changed the title Specifying path Specifying path for pl.read_delta Dec 12, 2022
@ritchie46
Copy link
Member

ritchie46 commented Dec 12, 2022

Hmm.. I see that deltalake.DeltaTable can convert to pyarrow dataset. Maybe we should instantiate via those?

@stinodego
Copy link
Contributor Author

stinodego commented Dec 12, 2022

Hmm.. I see that deltalake.DeltaTable can convert to pyarrow dataset. Maybe we should instantiate via those?

Before this new Polars functionality, I would do:

from deltalake import DeltaTable

dt = DeltaTable(path)

df = pl.from_arrow(dt.to_pyarrow_table())

But I'm not sure how this would work in a 'scanning' fashion.

Maybe @chitralverma has some input here :)

@chitralverma
Copy link
Contributor

Let me check it out today

@chitralverma
Copy link
Contributor

Hmm.. I see that deltalake.DeltaTable can convert to pyarrow dataset. Maybe we should instantiate via those?

That's the scan_delta functionality. works lazily over PA dataset.

Before this new Polars functionality, I would do:

from deltalake import DeltaTable

dt = DeltaTable(path)

df = pl.from_arrow(dt.to_pyarrow_table())

But I'm not sure how this would work in a 'scanning' fashion.

Maybe @chitralverma has some input here :)

Hi @stinodego so there are 2 issues with this currently and both of them were not caught during unit tests,

  1. Imports problem, which you did not face but exists. I;m working on it first. Import error when using scan_delta #5790
  2. The issue of relative path that you have reported here.

for the second issue, the reason is that the scan_delta and read_delta expect table_uri, not table location, so the paths are expected to be fully qualified.
This works directly with deltalake package because they have handled this internally with classes like DeltaStorageHandler. However, during a polars lazy scan this class cannot be pickled and this breaks with the error below,
TypeError: cannot pickle 'builtins.DeltaFileSystemHandler' object. This issue does not occur during read_delta as the pickling doesn't happen there.

To get around this i have relied on pyarrow.fs implementations directly, As for the issues with azure_path, this stems from the same pyarrow.fs implementation as Azure is not directly supported, but only via fsspec.

I think of quickly fixing the read_delta first now and for the scan_delta I considering relying on fsspec as well.

Hope this clarifies your doubts?

@chitralverma
Copy link
Contributor

I have also opened, an issue on delta side for this, if they fix it then it should be quite straightforward.

@ritchie46
Copy link
Member

@chitralverma maybe we can pickle a function that imports the DeltaFileSystemHandler?

If that does not work, we can pickle a string that we can run with eval that imports DeltaFileSystemHandler.

@chitralverma
Copy link
Contributor

@chitralverma maybe we can pickle a function that imports the DeltaFileSystemHandler?

If that does not work, we can pickle a string that we can run with eval that imports DeltaFileSystemHandler.

This is what's happening currently, we are calling the delta_table.to_pyarrow_dataset() which instantiates a DeltaStorageHandler internally which relies on a DeltaFileSystemHandler.

I tried some workarounds but it wont pickle

@ritchie46
Copy link
Member

Does it cloudpickle? We could also add a cloudpickle version of _deser_and_exec and _scan_ds_impl.

@chitralverma
Copy link
Contributor

chitralverma commented Dec 13, 2022

Does it cloudpickle? We could also add a cloudpickle version of _deser_and_exec and _scan_ds_impl.

Nope

>>> cloudpickle.dumps(dsh)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/chitral/IdeaProjects/big_delta_table/test_venv/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/Users/chitral/IdeaProjects/big_delta_table/test_venv/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'builtins.DeltaFileSystemHandler' object
>>>

Neither does dill

@chitralverma
Copy link
Contributor

chitralverma commented Dec 13, 2022

So, got some updates from the delta team regarding this here

Apparently, this is not natively supported by pyo3 modules.
PyO3/pyo3#100

I've already fixed things for read_delta, will look into some solution for scan_delta

@ritchie46
Copy link
Member

Can this one be closed or is this still an issue?

@chitralverma
Copy link
Contributor

@ritchie46 it should be closed as I have tested for this, but let's keep this open for a while till someone else confirms?

@ritchie46
Copy link
Member

Nah, I will reopen if it is not. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants