-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: use ObjectEngine for indexing for now correctness over performance #60329
Merged
WillAyd
merged 13 commits into
pandas-dev:main
from
jorisvandenbossche:string-dtype-index-engine
Nov 26, 2024
Merged
Changes from 10 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
091baa8
String dtype: use ObjectEngine for indexing for now correctness over …
jorisvandenbossche cfb73f5
add string-specific ObjectEngine subclass for pre-processing of input…
jorisvandenbossche 6892f83
remove xfails
jorisvandenbossche e007299
add tests for get_loc + fix for NA variant of string dtype
jorisvandenbossche bb148ba
support get_indexer
jorisvandenbossche a669d75
update tests
jorisvandenbossche 2a4aed2
Merge remote-tracking branch 'upstream/main' into string-dtype-index-…
jorisvandenbossche 13fa689
Merge remote-tracking branch 'upstream/main' into string-dtype-index-…
jorisvandenbossche fccd220
update xfail for parser test
jorisvandenbossche 8142300
try fix typing
jorisvandenbossche 3c62a8d
Merge remote-tracking branch 'upstream/main' into string-dtype-index-…
jorisvandenbossche 43a3edf
limit get_loc to exact match for now
jorisvandenbossche c546a51
fix for non-infer_string mode
jorisvandenbossche File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this test now means that you can use
np.nan
andpd.NA
interchangeably when indexing? If that's correct, I'm not sure I agree that we should be going that farThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we are coercing any missing value indicator to NaN upon construction, and so to preserve back compat, I think I prefer we do the same for input to indexing operations.
To express it in terms of get_loc, this works now:
but the same on main with enabling the string dtype:
That is because now the None is no longer in the object dtype index, but has been coerced to NaN.
(on main, trying the above with
np.nan
also fails (see the issue #59879), but that's because theStringEngine
simply wasn't set up to work with missing values, so that is the initial reason I replaced it now with the StringObjectEngine)The above is with
None
, but essentially happens with any other missing value indicator, like pd.NA. MaybeNone
andnp.nan
are the most important ones though, but I would at least prefer that indexing withNone
keeps working for now (we can always start deprecating it, but I wouldn't do that it as a breaking change for 3.0)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this is also already quite inconsistent depending on the data type .. See #59765 for an overview (e.g. also for datetimelike and categorical, we treat all NA-likes as the same in indexing lookups)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - that's a great issue. Thanks for opening it.
Hmm I'm a bit confused by how this relates to all of the missing indicators becoming essentially equal though. On main, this does not work (?):
Definitely understand that there is not an ideal solution here given the inconsistent history, but I don't want to go too far and just start making all of the missing value indicators interchangeable. I think containment logic should land a little closer to equality logic, and in the latter we obviously don't allow this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the first bug that this PR is solving: right now no missing value lookup works, not even NaN itself (which is what is stored in the array). This is because the
StringEngine
simply doesn't handle missing values correctly (when building the hash table, it actually converts it to a sentinel string, but then for any of the lookup methods it doesn't take that into account; it's a bit an incomplete implementation)So by using the ObjectEngine (subclass), that fixes that first issue: ensuring NaN can be found
Missing values don't compare equal (well,
None
does, but we specifically didn't choose that long term as the sentinel moving forward;np.nan
andpd.NA
don't compare equal), so containment is already a bit of a special case anyway compared to equality, when it comes to missing values.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point on the equality. I guess I'm still hung up on the indexing behavior being the same though.
I've lost track of the nuance a bit, but haven't np.nan and pd.NA always had different indexing behavior? I'm just wary of glossing over that as part of this.
Maybe worth some input from @pandas-dev/pandas-core if anyone else has thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR to for now just enable exact matching missing values in
get_loc
, so this PR can already be merged (and fix the most glaring bug), and then we can have the discussion around backwards compatibility in #59879 (I don't think the above thread is very easy to follow for other people to chime in, will do a write up on the issue -> see #59879 (comment))