-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16838: [Python] Improve schema inference for pandas indexes with extension dtypes #14080
ARROW-16838: [Python] Improve schema inference for pandas indexes with extension dtypes #14080
Conversation
|
I'm not quite sure what to do here. I don't see a "Start Progress" link here or over in the corresponding JIRA issue. Can somebody clarify? |
Hrm it looks like the failing Travis CI build happened when building CXX objects -- so presumably unrelated to the changes in this PR. Not sure if this is a known issue or not |
Travis: that does tend to be a little flaky. JIRA: you need to be signed in to JIRA (it's basically asking you to make sure the ticket is assigned to yourself) |
Gotcha -- thanks @lidavidm I've gone ahead and assigned the ticket to myself in JIRA 👍 |
@jorisvandenbossche are you perhaps able to take a look here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
python/pyarrow/pandas_compat.py
Outdated
@@ -541,7 +541,7 @@ def dataframe_to_types(df, preserve_index, columns=None): | |||
if _pandas_api.is_categorical(values): | |||
type_ = pa.array(c, from_pandas=True).type | |||
elif _pandas_api.is_extension_array_dtype(values): | |||
type_ = pa.array(c.head(0), from_pandas=True).type | |||
type_ = pa.array(c[:0], from_pandas=True).type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the case that c
is a Series and not Index, this [:0] currently is always positional (so that's fine). But to be sure I was just checking on pandas main, and this now gives a warning that this will change in the future:
In [9]: s = pd.Series([1, 2], index=[1, 2])
In [10]: s[:0]
<ipython-input-10-48e1b387a0b4>:1: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
s[:0]
Out[10]: Series([], dtype: int64)
So we should avoid running into this warning.
I am thinking that we could maybe also use values[:0]
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that we could maybe also use values[:0] instead?
But, that then will loose some information in case of tz-aware datetimes I think.
Maybe a "dumb" c.head(0) if isinstance(c, Series) else c[:0]
might be the easiest after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. Just updated to avoid that deprecation with the if isinstance(c, Series)
check
Hrm, the linting errors in CI appear to be unrelated to the changes in this PR. Maybe #14113 fixes them? Merging |
The Travis CI failures is, I think, unrelated to the changes in this PR. @jorisvandenbossche let me know if there are any other changes you'd like to see |
Just wanted to check in here -- are there any additional comment, question, or concerns on this PR? |
No, looks all good, thanks for the update! (I was just away for a few days) |
Benchmark runs are scheduled for baseline = 8ad5e59 and contender = afd3c40. afd3c40 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
Thanks @jorisvandenbossche! |
…h extension dtypes (apache#14080) Possible fix for https://issues.apache.org/jira/browse/ARROW-16838. `pd.Index` objects don't have a `.head` method, while `pd.DataFrame`, `pd.Series`, and `pd.Index` all support indexing with `[:0]` to return a empty object of the same type. Authored-by: James Bourbeau <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…h extension dtypes (apache#14080) Possible fix for https://issues.apache.org/jira/browse/ARROW-16838. `pd.Index` objects don't have a `.head` method, while `pd.DataFrame`, `pd.Series`, and `pd.Index` all support indexing with `[:0]` to return a empty object of the same type. Authored-by: James Bourbeau <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Possible fix for https://issues.apache.org/jira/browse/ARROW-16838.
pd.Index
objects don't have a.head
method, whilepd.DataFrame
,pd.Series
, andpd.Index
all support indexing with[:0]
to return a empty object of the same type.