Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Minimal pyarrow installation should not require libparquet #39006

Closed
raulcd opened this issue Nov 30, 2023 · 4 comments · Fixed by #39316
Closed

[Python] Minimal pyarrow installation should not require libparquet #39006

raulcd opened this issue Nov 30, 2023 · 4 comments · Fixed by #39316

Comments

@raulcd
Copy link
Member

raulcd commented Nov 30, 2023

Describe the bug, including details regarding any error messages, version, and platform.

As part of the work that is being done on conda to divide pyarrow on a minimal package (pyarrow-base) I've realised that currently in order to import pyarrow we require to have libparquet present otherwise I get the following error:

import: 'pyarrow'
Traceback (most recent call last):
  File "/home/conda/feedstock_root/build_artifacts/apache-arrow_1701339661261/test_tmp/run_test.py", line 2, in <module>
    import pyarrow
  File "/home/conda/feedstock_root/build_artifacts/apache-arrow_1701339661261/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.9/site-packages/pyarrow/__init__.py", line 65, in <module>
    import pyarrow.lib as _lib
ImportError: libparquet.so.1400: cannot open shared object file: No such file or directory

We should be able to have a minimal installation and import pyarrow without parquet.

Component(s)

Python

@jorisvandenbossche
Copy link
Member

If I just remove libparquet from my dev env (but not with pyarrow built with pyarrow disabled), I do get the correct behaviour (pyarrow imports fine, importing pyarrow.parquet gives the expected error):

In [1]: import pyarrow as pa

In [2]: import pyarrow.parquet as pq
...
ImportError: The pyarrow installation is not built with support for the Parquet file format (libparquet.so.1500: cannot open shared object file: No such file or directory)

@jorisvandenbossche
Copy link
Member

If I do the same for the released version 14.0 installed with conda, I however do get the same issue. Inspecting the pyarrow/lib.cpython-311-x86_64-linux-gnu.so file installed there (with ldd), this library indeed depends on libparquet.so.1400.

That is not the case of the lib.cpython-...so file that I have in my development environment, according to ldd this library does not link to libparquet. So will need to figure out we have that difference (or if we somehow fixed this on main, recently)

@jorisvandenbossche
Copy link
Member

Locally I don't have parquet encryption enabled, and so after enabling that and rebuilding libarrow and pyarrow, now I see the dependency on libparquet.

The original encryption support was added in #10450 (pyarrow 8.0.0), and recently we also expanded it to support datasets in #34616 (pyarrow 14.0).

Checking my different conda envs with various pyarrow versions, it seems this dependency already existed before 14.0, but not yet for 8.0. It seems to have been added in pyarrow 10.0.

@jorisvandenbossche
Copy link
Member

Checking my different conda envs with various pyarrow versions, it seems this dependency already existed before 14.0, but not yet for 8.0. It seems to have been added in pyarrow 10.0.

I assume this is just because at the time of pyarrow 8.0, conda-forge hadn't yet enabled the parquet encryption feature it its builds. With its latest rebuilds, it has the feature enabled, and then it's clear this dependency was already introduced with pyarrow 8.0, so likely with the original PR adding encryption (#10450).

And I think a likely culprit is the fact that the pyarrow C++ sources (which become libarrow_python.so) depend on libparquet (and pyarrow always depends on libarrow_python.so):

arrow/python/CMakeLists.txt

Lines 335 to 339 in 6101d12

if(PYARROW_BUILD_PARQUET_ENCRYPTION)
if(PARQUET_REQUIRE_ENCRYPTION)
list(APPEND PYARROW_CPP_SRCS ${PYARROW_CPP_SOURCE_DIR}/parquet_encryption.cc)
if(ARROW_BUILD_SHARED)
list(APPEND PYARROW_CPP_LINK_LIBS Parquet::parquet_shared)

So if we want to fix this, we need to create a separate libarrow_python_parquet_encryption.so target?

raulcd added a commit to raulcd/arrow that referenced this issue Dec 20, 2023
…row_python.so to new libarrow_python_parquet_encryption.so
pitrou pushed a commit to raulcd/arrow that referenced this issue Dec 21, 2023
…row_python.so to new libarrow_python_parquet_encryption.so
kou pushed a commit that referenced this issue Dec 22, 2023
…thon.so to new libarrow_python_parquet_encryption.so (#39316)

### Rationale for this change

If I build pyarrow with everything and then I remove some of the Arrow CPP .so in order to have a minimal build I can't import pyarrow because it requires libarrow and libparquet. This is relevant in order to have a minimal build for Conda. Please see the related issue for more information.

### What changes are included in this PR?

Move libarrow parquet encryption for pyarrow to its own shared object.

### Are these changes tested?

I will run extensive CI with extra python archery tests.

### Are there any user-facing changes?

No, and yes :) There will be a new .so on pyarrow but shouldn't be relevant in my opinion.
* Closes: #39006

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 15.0.0 milestone Dec 22, 2023
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…row_python.so to new libarrow_python_parquet_encryption.so (apache#39316)

### Rationale for this change

If I build pyarrow with everything and then I remove some of the Arrow CPP .so in order to have a minimal build I can't import pyarrow because it requires libarrow and libparquet. This is relevant in order to have a minimal build for Conda. Please see the related issue for more information.

### What changes are included in this PR?

Move libarrow parquet encryption for pyarrow to its own shared object.

### Are these changes tested?

I will run extensive CI with extra python archery tests.

### Are there any user-facing changes?

No, and yes :) There will be a new .so on pyarrow but shouldn't be relevant in my opinion.
* Closes: apache#39006

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…row_python.so to new libarrow_python_parquet_encryption.so (apache#39316)

### Rationale for this change

If I build pyarrow with everything and then I remove some of the Arrow CPP .so in order to have a minimal build I can't import pyarrow because it requires libarrow and libparquet. This is relevant in order to have a minimal build for Conda. Please see the related issue for more information.

### What changes are included in this PR?

Move libarrow parquet encryption for pyarrow to its own shared object.

### Are these changes tested?

I will run extensive CI with extra python archery tests.

### Are there any user-facing changes?

No, and yes :) There will be a new .so on pyarrow but shouldn't be relevant in my opinion.
* Closes: apache#39006

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment