-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: use ObjectEngine for indexing for now correctness over performance #60329
Changes from 6 commits
091baa8
cfb73f5
6892f83
e007299
bb148ba
a669d75
2a4aed2
13fa689
fccd220
8142300
3c62a8d
43a3edf
c546a51
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,37 @@ | |
import pandas._testing as tm | ||
|
||
|
||
class TestGetLoc: | ||
def test_get_loc(self, any_string_dtype): | ||
index = Index(["a", "b", "c"], dtype=any_string_dtype) | ||
assert index.get_loc("b") == 1 | ||
|
||
def test_get_loc_raises(self, any_string_dtype): | ||
index = Index(["a", "b", "c"], dtype=any_string_dtype) | ||
with pytest.raises(KeyError, match="d"): | ||
index.get_loc("d") | ||
|
||
def test_get_loc_invalid_value(self, any_string_dtype): | ||
index = Index(["a", "b", "c"], dtype=any_string_dtype) | ||
with pytest.raises(KeyError, match="1"): | ||
index.get_loc(1) | ||
|
||
def test_get_loc_non_unique(self, any_string_dtype): | ||
index = Index(["a", "b", "a"], dtype=any_string_dtype) | ||
result = index.get_loc("a") | ||
expected = np.array([True, False, True]) | ||
tm.assert_numpy_array_equal(result, expected) | ||
|
||
def test_get_loc_non_missing(self, any_string_dtype, nulls_fixture): | ||
index = Index(["a", "b", "c"], dtype=any_string_dtype) | ||
with pytest.raises(KeyError): | ||
index.get_loc(nulls_fixture) | ||
|
||
def test_get_loc_missing(self, any_string_dtype, nulls_fixture): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this test now means that you can use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is that we are coercing any missing value indicator to NaN upon construction, and so to preserve back compat, I think I prefer we do the same for input to indexing operations. To express it in terms of get_loc, this works now: >>> pd.options.future.infer_string = False
>>> pd.Index(["a", "b", None]).get_loc(None)
2 but the same on main with enabling the string dtype: >>> pd.options.future.infer_string = True
>>> pd.Index(["a", "b", None]).get_loc(None)
...
KeyError: None That is because now the None is no longer in the object dtype index, but has been coerced to NaN. The above is with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW this is also already quite inconsistent depending on the data type .. See #59765 for an overview (e.g. also for datetimelike and categorical, we treat all NA-likes as the same in indexing lookups) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Nice - that's a great issue. Thanks for opening it.
Hmm I'm a bit confused by how this relates to all of the missing indicators becoming essentially equal though. On main, this does not work (?): >>> pd.options.future.infer_string = False
>>> pd.Index(["a", "b", None]).get_loc(np.nan)
KeyError: nan Definitely understand that there is not an ideal solution here given the inconsistent history, but I don't want to go too far and just start making all of the missing value indicators interchangeable. I think containment logic should land a little closer to equality logic, and in the latter we obviously don't allow this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that's the first bug that this PR is solving: right now no missing value lookup works, not even NaN itself (which is what is stored in the array). This is because the So by using the ObjectEngine (subclass), that fixes that first issue: ensuring NaN can be found
Missing values don't compare equal (well, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair point on the equality. I guess I'm still hung up on the indexing behavior being the same though. I've lost track of the nuance a bit, but haven't np.nan and pd.NA always had different indexing behavior? I'm just wary of glossing over that as part of this. Maybe worth some input from @pandas-dev/pandas-core if anyone else has thoughts There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I updated the PR to for now just enable exact matching missing values in |
||
index = Index(["a", "b", nulls_fixture], dtype=any_string_dtype) | ||
assert index.get_loc(nulls_fixture) == 2 | ||
|
||
|
||
class TestGetIndexer: | ||
@pytest.mark.parametrize( | ||
"method,expected", | ||
|
@@ -41,21 +72,41 @@ def test_get_indexer_strings_raises(self, any_string_dtype): | |
["a", "b", "c", "d"], method="pad", tolerance=[2, 2, 2, 2] | ||
) | ||
|
||
@pytest.mark.parametrize("null", [None, np.nan, float("nan"), pd.NA]) | ||
def test_get_indexer_missing(self, any_string_dtype, null): | ||
# NaT and Decimal("NaN") from null_fixture are not supported for string dtype | ||
index = Index(["a", "b", null], dtype=any_string_dtype) | ||
result = index.get_indexer(["a", null, "c"]) | ||
expected = np.array([0, 2, -1], dtype=np.intp) | ||
tm.assert_numpy_array_equal(result, expected) | ||
|
||
|
||
class TestGetIndexerNonUnique: | ||
@pytest.mark.xfail(reason="TODO(infer_string)", strict=False) | ||
def test_get_indexer_non_unique_nas(self, any_string_dtype, nulls_fixture): | ||
index = Index(["a", "b", None], dtype=any_string_dtype) | ||
indexer, missing = index.get_indexer_non_unique([nulls_fixture]) | ||
@pytest.mark.parametrize("null", [None, np.nan, float("nan"), pd.NA]) | ||
def test_get_indexer_non_unique_nas(self, request, any_string_dtype, null): | ||
if ( | ||
any_string_dtype == "string" | ||
and any_string_dtype.na_value is pd.NA | ||
and isinstance(null, float) | ||
): | ||
# TODO(infer_string) | ||
request.applymarker( | ||
pytest.mark.xfail( | ||
reason="NA-variant string dtype does not work with NaN" | ||
) | ||
) | ||
|
||
index = Index(["a", "b", null], dtype=any_string_dtype) | ||
indexer, missing = index.get_indexer_non_unique([null]) | ||
|
||
expected_indexer = np.array([2], dtype=np.intp) | ||
expected_missing = np.array([], dtype=np.intp) | ||
tm.assert_numpy_array_equal(indexer, expected_indexer) | ||
tm.assert_numpy_array_equal(missing, expected_missing) | ||
|
||
# actually non-unique | ||
index = Index(["a", None, "b", None], dtype=any_string_dtype) | ||
indexer, missing = index.get_indexer_non_unique([nulls_fixture]) | ||
index = Index(["a", null, "b", null], dtype=any_string_dtype) | ||
indexer, missing = index.get_indexer_non_unique([null]) | ||
|
||
expected_indexer = np.array([1, 3], dtype=np.intp) | ||
tm.assert_numpy_array_equal(indexer, expected_indexer) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm would it be better to call this
StrEngine
? Or where does the termStringObject
come from?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was meant to be read as "string-objectengine", i.e. essentially just the object engine, but we know that we only use it for strings (and so the
_check_type
can be specialized).But I don't mind the name exactly (although
StrEngine
might also be confusing, because we currently use this for both str and string dtypes)