Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36730: [Python] Add support for Cython 3.0.0 #36745

Closed
wants to merge 12 commits into from

Conversation

kou
Copy link
Member

@kou kou commented Jul 18, 2023

Rationale for this change

Cython 3.0.0 is the latest release. PyArrow should work with Cython 3.0.0.

What changes are included in this PR?

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@github-actions
Copy link

⚠️ GitHub issue #36730 has been automatically assigned in GitHub to PR creator.

@kou kou marked this pull request as ready for review July 18, 2023 09:31
@pitrou
Copy link
Member

pitrou commented Jul 18, 2023

This may fix the compilation error but there are other issues during tests (see macOS CI?).

@kou
Copy link
Member Author

kou commented Jul 18, 2023

I've fixed CI failures.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 19, 2023

@kou there were also some runtime errors in the minimal build that are caused by cython 3 (but it's strange that this is only failing in a single build). Based on your diff, I don't think that will already be fixed here, but let me run them to be sure.

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit example-python-minimal-build-fedora-conda

@github-actions
Copy link

Revision: dba6bb4

Submitted crossbow builds: ursacomputing/crossbow @ actions-f3bbf75ac8

Task Status
example-python-minimal-build-fedora-conda Github Actions

@pitrou
Copy link
Member

pitrou commented Jul 19, 2023

Can this PR also undo the pinning on various Python builds?

@apache apache deleted a comment from github-actions bot Jul 19, 2023
@kou
Copy link
Member Author

kou commented Jul 19, 2023

@github-actions crossbow submit example-python-minimal-build-fedora-conda

@github-actions
Copy link

Revision: 81493b1

Submitted crossbow builds: ursacomputing/crossbow @ actions-492c6d5c7d

Task Status
example-python-minimal-build-fedora-conda Github Actions

@kou
Copy link
Member Author

kou commented Jul 19, 2023

I've fixed the generator related problem.
I'll rebase on main and unpin Cython version.

@kou kou force-pushed the python-cython-3 branch from 81493b1 to cb6f36d Compare July 19, 2023 23:39
@kou kou requested review from assignUser and raulcd as code owners July 19, 2023 23:39
@kou
Copy link
Member Author

kou commented Jul 20, 2023

Hmm. Can we detect Cython version in .{pyx,pxi}?

@kou
Copy link
Member Author

kou commented Jul 20, 2023

I want to write something like the following:

diff --git a/python/pyarrow/scalar.pxi b/python/pyarrow/scalar.pxi
index f438c8847..61de38203 100644
--- a/python/pyarrow/scalar.pxi
+++ b/python/pyarrow/scalar.pxi
@@ -793,7 +793,10 @@ cdef class MapScalar(ListScalar):
         """
         arr = self.values
         if array is None:
-            raise StopIteration
+            if Cython.__version__ < '3.0.0':
+                raise StopIteration
+            else:
+                return
         for k, v in zip(arr.field('key'), arr.field('value')):
             yield (k.as_py(), v.as_py())
 

@pitrou
Copy link
Member

pitrou commented Jul 20, 2023

There should not be any need to raise StopIteration.

Comment on lines 438 to 444
def __iter__(self):
while True:
yield self.read_next_message()
try:
yield self.read_next_message()
except StopIteration:
# For Cython >= 3.0.0
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it is this code that gives problems to be written in a way that works for both cython 0.29 and 3, but a potential different way of writing it:

def __iter__(self):
    return self

def __next__(self):
    return self.read_next_message()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll try it.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 20, 2023
arr = self.values
if array is None:
if arr is None:
raise StopIteration
for k, v in zip(arr.field('key'), arr.field('value')):
yield (k.as_py(), v.as_py())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually test these changes before pushing them?

You can't use yield in a __next__ function.
Either you write __iter__ as a generator (using yield), or you write a pair of __iter__ and __next__ functions (without yield). You can't mix the two styles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And FWIW, for this case, I would leave the __iter__ as is (in the other cases such as RecordBatchReader, we already have a public "next" method that can just be called from __next__)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I didn't test on local before I pushed a commit because I don't have enough time yesterday.

I reverted the __iter__ + __next__ change.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flight changes LGTM, thanks!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jul 20, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jul 20, 2023
@github-actions github-actions bot added Component: Documentation awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 20, 2023
@kou
Copy link
Member Author

kou commented Jul 20, 2023

@jorisvandenbossche Do you have any idea why the following failure was happen?

https://github.com/apache/arrow/actions/runs/5615711782/job/15216554355?pr=36745#step:6:6074

 =================================== FAILURES ===================================
_______________________ test_cloudpickle_dataset[False] ________________________

tempdir = PosixPath('/tmp/pytest-of-root/pytest-0/test_cloudpickle_dataset_False0')
datadir = PosixPath('/opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/data/parquet')
use_legacy_dataset = False

    @pytest.mark.pandas
    @parametrize_legacy_dataset
    def test_cloudpickle_dataset(tempdir, datadir, use_legacy_dataset):
        cp = pytest.importorskip('cloudpickle')
        dataset = _make_dataset_for_pickling(tempdir, use_legacy_dataset)
>       _assert_dataset_is_picklable(
            dataset, pickler=cp, use_legacy_dataset=use_legacy_dataset)

opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/parquet/test_dataset.py:1572: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/parquet/test_dataset.py:1540: in _assert_dataset_is_picklable
    assert is_pickleable(dataset)
opt/conda/envs/arrow/lib/python3.8/site-packages/pyarrow/tests/parquet/test_dataset.py:1538: in is_pickleable
    return obj == pickler.loads(pickler.dumps(obj))
opt/conda/envs/arrow/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py:73: in dumps
    cp.dump(obj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <cloudpickle.cloudpickle_fast.CloudPickler object at 0x7f25aa27aa10>
obj = <pyarrow.parquet.core._ParquetDatasetV2 object at 0x7f25a9ec9690>

    def dump(self, obj):
        try:
>           return Pickler.dump(self, obj)
E           _pickle.PicklingError: Can't pickle <cyfunction ParquetFragmentScanOptions._reconstruct at 0x7f2660177140>: it's not the same object as pyarrow._dataset_parquet.ParquetFragmentScanOptions._reconstruct

opt/conda/envs/arrow/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py:632: PicklingError

@jorisvandenbossche
Copy link
Member

Not directly an idea. We use the same pattern in many other places. Can you try this patch:

--- a/python/pyarrow/_dataset_parquet.pyx
+++ b/python/pyarrow/_dataset_parquet.pyx
@@ -741,7 +741,7 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
             thrift_string_size_limit=self.thrift_string_size_limit,
             thrift_container_size_limit=self.thrift_container_size_limit,
         )
-        return type(self)._reconstruct, (kwargs,)
+        return ParquetFragmentScanOptions._reconstruct, (kwargs,)

(in theory it shouldn't matter for python, but maybe cython 3 has a bug that treats this different)

@kou
Copy link
Member Author

kou commented Jul 21, 2023

Thanks! I've pushed your patch.

@h-vetinari
Copy link
Contributor

If there are problems with compatibility between 0.29 and 3.0 (aside from the few intentional breaking changes), the Cython devs are usually quick to respond if you open an issue

@danepitkin
Copy link
Member

danepitkin commented Aug 10, 2023

The failing pickle test is only when using the cloudpickle module instead of the pickle5/pickle module. Perhaps they don't support cython3.0.0 yet? This is the only test in the entire codebase that tests cloudpickle.. not sure how important it is as a dependency to pyarrow.

@pytest.mark.pandas
@parametrize_legacy_dataset
def test_cloudpickle_dataset(tempdir, datadir, use_legacy_dataset):
    cp = pytest.importorskip('cloudpickle')
    dataset = _make_dataset_for_pickling(tempdir, use_legacy_dataset)
    _assert_dataset_is_picklable(
        dataset, pickler=cp, use_legacy_dataset=use_legacy_dataset)

@pitrou
Copy link
Member

pitrou commented Aug 10, 2023

Well, first, this PR should be rebased, because there are conflicts.

@danepitkin
Copy link
Member

danepitkin commented Aug 10, 2023

I did rebase/squash to test things out here: #37097

I wanted to experiment without polluting this PR. I can merge any fixes I find back into this PR.

@kou
Copy link
Member Author

kou commented Aug 11, 2023

Can we use #37097 instead of this?

@danepitkin
Copy link
Member

Can we use #37097 instead of this?

Sure!

@kou
Copy link
Member Author

kou commented Aug 11, 2023

I close this in favor of #37097.

@kou kou closed this Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI][Python] Cython 3.0 seems to make our build of pyarrow fail
6 participants