Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39006: [Python] Extract libparquet requirements out of libarrow_python.so to new libarrow_python_parquet_encryption.so #39316

Merged
merged 6 commits into from
Dec 22, 2023

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented Dec 20, 2023

Rationale for this change

If I build pyarrow with everything and then I remove some of the Arrow CPP .so in order to have a minimal build I can't import pyarrow because it requires libarrow and libparquet. This is relevant in order to have a minimal build for Conda. Please see the related issue for more information.

What changes are included in this PR?

Move libarrow parquet encryption for pyarrow to its own shared object.

Are these changes tested?

I will run extensive CI with extra python archery tests.

Are there any user-facing changes?

No, and yes :) There will be a new .so on pyarrow but shouldn't be relevant in my opinion.

Copy link

⚠️ GitHub issue #39006 has been automatically assigned in GitHub to PR creator.

@raulcd
Copy link
Member Author

raulcd commented Dec 20, 2023

@github-actions crossbow submit -g python

This comment was marked as outdated.

@jorisvandenbossche
Copy link
Member

Thanks! This looks good to me, but I am not a CMake expert ;)
I assume you tested this locally that it now works to have a minimal pyarrow without parquet? Do we want to a nightly build that verifies this?

@jorisvandenbossche
Copy link
Member

Do we want to a nightly build that verifies this?

Although we have the example-python-minimal-build-* builds, and according to the test logs (eg https://github.com/ursacomputing/crossbow/actions/runs/7269207616/job/19806469446#step:3:4711), they do skip the parquet tests.

So why was this build not failing? (building the C++ doesn't seem to enable Parquet in eg https://github.com/apache/arrow/blob/main/python/examples/minimal_build/build_venv.sh)

@AlenkaF
Copy link
Member

AlenkaF commented Dec 20, 2023

If I would move parquet encryption to a separate .so I would do the same so nothing to comment.

Regarding the minimal build that wasn't failing:
I was also a bit confused when I tried building locally on main. I built Arrow C++ without parquet but with datasets and parquet encryption then I build PyArrow without parquet but with dataset and parquet encryption enabled and it worked, I was able to install pyarrow. Thought I didn't understand the issue, maybe I still do not though =)

@jorisvandenbossche
Copy link
Member

The Windows failure seems related

@raulcd
Copy link
Member Author

raulcd commented Dec 20, 2023

Thanks, I am testing a little further locally to validate the use case I am trying to fix works. I'll investigate the Windows failure.

@AlenkaF I am also confused, my expectation if Arrow CPP is built without parquet and there's no libparquet.so file that:
set(PYARROW_CPP_ENCRYPTION_LINK_LIBS Parquet::parquet_shared) would fail to build on pyarrow. I'll try to investigate that further too.

@raulcd
Copy link
Member Author

raulcd commented Dec 20, 2023

ok, the fix seems to work for the use case I am trying to solve.

I've built pyarrow with "everything":

$ python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> import pyarrow.parquet
>>> 

after I do the following:

$ cd ../../dist/
$ find . -iname libparquet.so
./lib/libparquet.so
$ mkdir raul
$ mv lib/libparquet.* raul/
$ find . -iname libparquet.so
./raul/libparquet.so

I get:

$ python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> import pyarrow.parquet
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/raulcd/code/arrow/python/pyarrow/parquet/__init__.py", line 20, in <module>
    from .core import *
  File "/home/raulcd/code/arrow/python/pyarrow/parquet/core.py", line 40, in <module>
    raise ImportError(
ImportError: The pyarrow installation is not built with support for the Parquet file format (libparquet.so.1500: cannot open shared object file: No such file or directory)

as expected. Without the change once I remove libparquet I can't import pyarrow:

$ python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/raulcd/code/arrow/python/pyarrow/__init__.py", line 65, in <module>
    import pyarrow.lib as _lib
ImportError: libparquet.so.1500: cannot open shared object file: No such file or directory

@jorisvandenbossche
Copy link
Member

Ah, so the reason it is working in the CI build is because there Parquet is not built at all (also C++ is not built with it, pyarrow is not built with it), and that works fine. But for conda, we actually do build pyarrow with support for Parquet, but then just leave out the parquet so library for the minimal install. And that's the case that was not working.

@raulcd
Copy link
Member Author

raulcd commented Dec 20, 2023

The sphinx error is also related to the change. I'll have to investigate further that too.

@pitrou
Copy link
Member

pitrou commented Dec 20, 2023

I'm curious, is this all that's needed? How does pyarrow.parquet link against the new libarrow_python_parquet_encryption.so?

@raulcd
Copy link
Member Author

raulcd commented Dec 20, 2023

I'm curious, is this all that's needed?

Probably not and I am missing something :) I'll keep working on it but I am surprised that no test- jobs failed. I would have expected tests to fail if there was some linking problem.

@pitrou
Copy link
Member

pitrou commented Dec 20, 2023

Ok, I'm testing this PR and I get:

$ python -c "import pyarrow.parquet.encryption"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/antoine/arrow/dev/python/pyarrow/parquet/encryption.py", line 19, in <module>
    from pyarrow._parquet_encryption import (CryptoFactory,   # noqa
ImportError: /home/antoine/arrow/dev/python/pyarrow/_parquet_encryption.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN5arrow2py7parquet10encryption11PyKmsClientC1EP7_objectNS2_17PyKmsClientVtableE

@raulcd raulcd requested a review from assignUser as a code owner December 20, 2023 17:25
@pitrou
Copy link
Member

pitrou commented Dec 20, 2023

@github-actions crossbow submit -g python -g wheel

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Dec 20, 2023

@github-actions crossbow submit wheel-windows-*

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Dec 20, 2023

The Windows wheel failures look unrelated...

@raulcd
Copy link
Member Author

raulcd commented Dec 20, 2023

The Windows wheels failures seem to have been failing on the nightly builds for some days.

@pitrou
Copy link
Member

pitrou commented Dec 20, 2023

The Windows wheels failures seem to have been failing on the nightly builds for some days.

Is an issue open for it?

@raulcd
Copy link
Member Author

raulcd commented Dec 20, 2023

Is an issue open for it?

I don't think so. I was planning to go over all the nightly failures and open issues in the next days in preparation for the release but I haven't had the time yet

python/CMakeLists.txt Outdated Show resolved Hide resolved
python/CMakeLists.txt Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the awaiting committer review Awaiting committer review label Dec 20, 2023
@github-actions github-actions bot added the awaiting changes Awaiting changes label Dec 20, 2023
@kou
Copy link
Member

kou commented Dec 21, 2023

#39333 for Windows wheel build failure.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 21, 2023
@raulcd
Copy link
Member Author

raulcd commented Dec 21, 2023

@github-actions crossbow submit -g python -g wheel

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Dec 21, 2023

@github-actions crossbow submit -g python -g wheel

@pitrou
Copy link
Member

pitrou commented Dec 21, 2023

I rebased to get the Windows wheel fix.

Copy link

Revision: 6ee40c9

Submitted crossbow builds: ursacomputing/crossbow @ actions-0f80c7792a

Task Status
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-python GitHub Actions
test-debian-11-python-3 Azure
test-fedora-38-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 GitHub Actions
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp38-amd64 GitHub Actions
wheel-manylinux-2014-cp38-arm64 GitHub Actions
wheel-manylinux-2014-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp38-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

@pitrou pitrou requested a review from kou December 21, 2023 13:05
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@kou kou merged commit 51970e0 into apache:main Dec 22, 2023
13 checks passed
@kou kou removed the awaiting change review Awaiting change review label Dec 22, 2023
@github-actions github-actions bot added the awaiting merge Awaiting merge label Dec 22, 2023
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 51970e0.

There was 1 benchmark result with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…row_python.so to new libarrow_python_parquet_encryption.so (apache#39316)

### Rationale for this change

If I build pyarrow with everything and then I remove some of the Arrow CPP .so in order to have a minimal build I can't import pyarrow because it requires libarrow and libparquet. This is relevant in order to have a minimal build for Conda. Please see the related issue for more information.

### What changes are included in this PR?

Move libarrow parquet encryption for pyarrow to its own shared object.

### Are these changes tested?

I will run extensive CI with extra python archery tests.

### Are there any user-facing changes?

No, and yes :) There will be a new .so on pyarrow but shouldn't be relevant in my opinion.
* Closes: apache#39006

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…row_python.so to new libarrow_python_parquet_encryption.so (apache#39316)

### Rationale for this change

If I build pyarrow with everything and then I remove some of the Arrow CPP .so in order to have a minimal build I can't import pyarrow because it requires libarrow and libparquet. This is relevant in order to have a minimal build for Conda. Please see the related issue for more information.

### What changes are included in this PR?

Move libarrow parquet encryption for pyarrow to its own shared object.

### Are these changes tested?

I will run extensive CI with extra python archery tests.

### Are there any user-facing changes?

No, and yes :) There will be a new .so on pyarrow but shouldn't be relevant in my opinion.
* Closes: apache#39006

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Minimal pyarrow installation should not require libparquet
5 participants