Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16838: [Python] Improve schema inference for pandas indexes with extension dtypes #14080

Merged
merged 7 commits into from
Sep 21, 2022

Conversation

jrbourbeau
Copy link
Contributor

Possible fix for https://issues.apache.org/jira/browse/ARROW-16838. pd.Index objects don't have a .head method, while pd.DataFrame, pd.Series, and pd.Index all support indexing with [:0] to return a empty object of the same type.

@github-actions
Copy link

github-actions bot commented Sep 8, 2022

@github-actions
Copy link

github-actions bot commented Sep 8, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@jrbourbeau
Copy link
Contributor Author

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

I'm not quite sure what to do here. I don't see a "Start Progress" link here or over in the corresponding JIRA issue. Can somebody clarify?

@jrbourbeau
Copy link
Contributor Author

Hrm it looks like the failing Travis CI build happened when building CXX objects -- so presumably unrelated to the changes in this PR. Not sure if this is a known issue or not

@lidavidm
Copy link
Member

lidavidm commented Sep 9, 2022

Travis: that does tend to be a little flaky.

JIRA: you need to be signed in to JIRA (it's basically asking you to make sure the ticket is assigned to yourself)

@jrbourbeau
Copy link
Contributor Author

jrbourbeau commented Sep 9, 2022

Gotcha -- thanks @lidavidm

I've gone ahead and assigned the ticket to myself in JIRA 👍

@lidavidm
Copy link
Member

@jorisvandenbossche are you perhaps able to take a look here?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@@ -541,7 +541,7 @@ def dataframe_to_types(df, preserve_index, columns=None):
if _pandas_api.is_categorical(values):
type_ = pa.array(c, from_pandas=True).type
elif _pandas_api.is_extension_array_dtype(values):
type_ = pa.array(c.head(0), from_pandas=True).type
type_ = pa.array(c[:0], from_pandas=True).type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the case that c is a Series and not Index, this [:0] currently is always positional (so that's fine). But to be sure I was just checking on pandas main, and this now gives a warning that this will change in the future:

In [9]: s = pd.Series([1, 2], index=[1, 2])

In [10]: s[:0]
<ipython-input-10-48e1b387a0b4>:1: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
  s[:0]
Out[10]: Series([], dtype: int64)

So we should avoid running into this warning.

I am thinking that we could maybe also use values[:0] instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that we could maybe also use values[:0] instead?

But, that then will loose some information in case of tz-aware datetimes I think.

Maybe a "dumb" c.head(0) if isinstance(c, Series) else c[:0] might be the easiest after all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. Just updated to avoid that deprecation with the if isinstance(c, Series) check

@jrbourbeau
Copy link
Contributor Author

Hrm, the linting errors in CI appear to be unrelated to the changes in this PR. Maybe #14113 fixes them? Merging main to see if that helps

@jrbourbeau
Copy link
Contributor Author

The Travis CI failures is, I think, unrelated to the changes in this PR. @jorisvandenbossche let me know if there are any other changes you'd like to see

@jrbourbeau
Copy link
Contributor Author

Just wanted to check in here -- are there any additional comment, question, or concerns on this PR?

@jorisvandenbossche
Copy link
Member

No, looks all good, thanks for the update! (I was just away for a few days)

@jorisvandenbossche jorisvandenbossche merged commit afd3c40 into apache:master Sep 21, 2022
@ursabot
Copy link

ursabot commented Sep 21, 2022

Benchmark runs are scheduled for baseline = 8ad5e59 and contender = afd3c40. afd3c40 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.24% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.56% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.14% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] afd3c40a ec2-t3-xlarge-us-east-2
[Failed] afd3c40a test-mac-arm
[Failed] afd3c40a ursa-i9-9960x
[Finished] afd3c40a ursa-thinkcentre-m75q
[Finished] 8ad5e598 ec2-t3-xlarge-us-east-2
[Failed] 8ad5e598 test-mac-arm
[Failed] 8ad5e598 ursa-i9-9960x
[Finished] 8ad5e598 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Sep 21, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@jrbourbeau jrbourbeau deleted the index-type-inference branch September 21, 2022 16:07
@jrbourbeau
Copy link
Contributor Author

Thanks @jorisvandenbossche!

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
…h extension dtypes (apache#14080)

Possible fix for https://issues.apache.org/jira/browse/ARROW-16838. `pd.Index` objects don't have a `.head` method, while `pd.DataFrame`, `pd.Series`, and `pd.Index` all support indexing with `[:0]` to return a empty object of the same type. 

Authored-by: James Bourbeau <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
…h extension dtypes (apache#14080)

Possible fix for https://issues.apache.org/jira/browse/ARROW-16838. `pd.Index` objects don't have a `.head` method, while `pd.DataFrame`, `pd.Series`, and `pd.Index` all support indexing with `[:0]` to return a empty object of the same type. 

Authored-by: James Bourbeau <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants