ARROW-16838: [Python] Improve schema inference for pandas indexes with extension dtypes #14080

jrbourbeau · 2022-09-08T21:35:21Z

Possible fix for https://issues.apache.org/jira/browse/ARROW-16838. pd.Index objects don't have a .head method, while pd.DataFrame, pd.Series, and pd.Index all support indexing with [:0] to return a empty object of the same type.

…ype-inference

github-actions · 2022-09-08T21:35:43Z

https://issues.apache.org/jira/browse/ARROW-16838

github-actions · 2022-09-08T21:35:45Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

jrbourbeau · 2022-09-08T22:32:53Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

I'm not quite sure what to do here. I don't see a "Start Progress" link here or over in the corresponding JIRA issue. Can somebody clarify?

jrbourbeau · 2022-09-09T15:42:56Z

Hrm it looks like the failing Travis CI build happened when building CXX objects -- so presumably unrelated to the changes in this PR. Not sure if this is a known issue or not

lidavidm · 2022-09-09T15:52:56Z

Travis: that does tend to be a little flaky.

JIRA: you need to be signed in to JIRA (it's basically asking you to make sure the ticket is assigned to yourself)

jrbourbeau · 2022-09-09T16:20:23Z

Gotcha -- thanks @lidavidm

I've gone ahead and assigned the ticket to myself in JIRA 👍

lidavidm · 2022-09-14T11:56:14Z

@jorisvandenbossche are you perhaps able to take a look here?

jorisvandenbossche

Thanks for the PR!

jorisvandenbossche · 2022-09-14T12:19:35Z

python/pyarrow/pandas_compat.py

@@ -541,7 +541,7 @@ def dataframe_to_types(df, preserve_index, columns=None):
        if _pandas_api.is_categorical(values):
            type_ = pa.array(c, from_pandas=True).type
        elif _pandas_api.is_extension_array_dtype(values):
-            type_ = pa.array(c.head(0), from_pandas=True).type
+            type_ = pa.array(c[:0], from_pandas=True).type


So for the case that c is a Series and not Index, this [:0] currently is always positional (so that's fine). But to be sure I was just checking on pandas main, and this now gives a warning that this will change in the future:

In [9]: s = pd.Series([1, 2], index=[1, 2]) In [10]: s[:0] <ipython-input-10-48e1b387a0b4>:1: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`. s[:0] Out[10]: Series([], dtype: int64)

So we should avoid running into this warning.

I am thinking that we could maybe also use values[:0] instead?

I am thinking that we could maybe also use values[:0] instead?

But, that then will loose some information in case of tz-aware datetimes I think.

Maybe a "dumb" c.head(0) if isinstance(c, Series) else c[:0] might be the easiest after all.

Ah, good catch. Just updated to avoid that deprecation with the if isinstance(c, Series) check

…ype-inference

jrbourbeau · 2022-09-14T16:29:31Z

Hrm, the linting errors in CI appear to be unrelated to the changes in this PR. Maybe #14113 fixes them? Merging main to see if that helps

jrbourbeau · 2022-09-15T00:54:50Z

The Travis CI failures is, I think, unrelated to the changes in this PR. @jorisvandenbossche let me know if there are any other changes you'd like to see

jrbourbeau · 2022-09-21T00:54:52Z

Just wanted to check in here -- are there any additional comment, question, or concerns on this PR?

jorisvandenbossche · 2022-09-21T07:50:55Z

No, looks all good, thanks for the update! (I was just away for a few days)

ursabot · 2022-09-21T10:31:47Z

Benchmark runs are scheduled for baseline = 8ad5e59 and contender = afd3c40. afd3c40 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.24% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.56% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.14% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] afd3c40a ec2-t3-xlarge-us-east-2
[Failed] afd3c40a test-mac-arm
[Failed] afd3c40a ursa-i9-9960x
[Finished] afd3c40a ursa-thinkcentre-m75q
[Finished] 8ad5e598 ec2-t3-xlarge-us-east-2
[Failed] 8ad5e598 test-mac-arm
[Failed] 8ad5e598 ursa-i9-9960x
[Finished] 8ad5e598 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-09-21T10:32:02Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

jrbourbeau · 2022-09-21T16:09:36Z

Thanks @jorisvandenbossche!

…h extension dtypes (apache#14080) Possible fix for https://issues.apache.org/jira/browse/ARROW-16838. `pd.Index` objects don't have a `.head` method, while `pd.DataFrame`, `pd.Series`, and `pd.Index` all support indexing with `[:0]` to return a empty object of the same type. Authored-by: James Bourbeau <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

jrbourbeau added 3 commits September 8, 2022 15:49

Avoid head call for better Index support

896f221

Merge branch 'master' of https://github.com/apache/arrow into index-t…

9a29528

…ype-inference

Add test coverage

67b43f2

github-actions bot added the Component: Python label Sep 8, 2022

jrbourbeau mentioned this pull request Sep 9, 2022

p2p shuffle fails when partition index is pandas extension dtype dask/distributed#6927

Closed

jorisvandenbossche reviewed Sep 14, 2022

View reviewed changes

jrbourbeau added 2 commits September 14, 2022 09:44

Avoid future deprecation warning for Series

243305d

Merge branch 'master' of https://github.com/apache/arrow into index-t…

92b2440

…ype-inference

jrbourbeau added 2 commits September 14, 2022 13:23

Use _pandas_api shim

734f3f2

Lint

e270147

jorisvandenbossche approved these changes Sep 21, 2022

View reviewed changes

jorisvandenbossche merged commit afd3c40 into apache:master Sep 21, 2022

jrbourbeau deleted the index-type-inference branch September 21, 2022 16:07

jrbourbeau mentioned this pull request Sep 28, 2022

to_parquet fails for nullable dtype index dask/dask#9186

Closed

hayesgb mentioned this pull request Nov 3, 2022

Add support for use_nullable_dtypes to dd.read_parquet dask/dask#9617

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16838: [Python] Improve schema inference for pandas indexes with extension dtypes #14080

ARROW-16838: [Python] Improve schema inference for pandas indexes with extension dtypes #14080

jrbourbeau commented Sep 8, 2022

github-actions bot commented Sep 8, 2022

github-actions bot commented Sep 8, 2022

jrbourbeau commented Sep 8, 2022

jrbourbeau commented Sep 9, 2022

lidavidm commented Sep 9, 2022

jrbourbeau commented Sep 9, 2022 •

edited

Loading

lidavidm commented Sep 14, 2022

jorisvandenbossche left a comment

jorisvandenbossche Sep 14, 2022

jorisvandenbossche Sep 14, 2022

jrbourbeau Sep 14, 2022

jrbourbeau commented Sep 14, 2022

jrbourbeau commented Sep 15, 2022

jrbourbeau commented Sep 21, 2022

jorisvandenbossche commented Sep 21, 2022

ursabot commented Sep 21, 2022

ursabot commented Sep 21, 2022

jrbourbeau commented Sep 21, 2022

ARROW-16838: [Python] Improve schema inference for pandas indexes with extension dtypes #14080

ARROW-16838: [Python] Improve schema inference for pandas indexes with extension dtypes #14080

Conversation

jrbourbeau commented Sep 8, 2022

github-actions bot commented Sep 8, 2022

github-actions bot commented Sep 8, 2022

jrbourbeau commented Sep 8, 2022

jrbourbeau commented Sep 9, 2022

lidavidm commented Sep 9, 2022

jrbourbeau commented Sep 9, 2022 • edited Loading

lidavidm commented Sep 14, 2022

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Sep 14, 2022

Choose a reason for hiding this comment

jorisvandenbossche Sep 14, 2022

Choose a reason for hiding this comment

jrbourbeau Sep 14, 2022

Choose a reason for hiding this comment

jrbourbeau commented Sep 14, 2022

jrbourbeau commented Sep 15, 2022

jrbourbeau commented Sep 21, 2022

jorisvandenbossche commented Sep 21, 2022

ursabot commented Sep 21, 2022

ursabot commented Sep 21, 2022

jrbourbeau commented Sep 21, 2022

jrbourbeau commented Sep 9, 2022 •

edited

Loading