-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Construction of Series from dict containing NaN as key #18496
Conversation
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -205,4 +205,5 @@ Other | |||
|
|||
- Improved error message when attempting to use a Python keyword as an identifier in a numexpr query (:issue:`18221`) | |||
- Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`) | |||
- Fixed initialization of Series from dict containing NaN as key (:issue:`18480`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intialization -> construction
use double-backticks around Series
and NaN
and dict
pandas/core/series.py
Outdated
if data else np.nan) | ||
|
||
remap_to_mi = False | ||
keys = maybe_mi_keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is WAY too complicated, likely slow. pls try to simplify. at the very least, should all be split out to a more accessible location, e.g. move to a helper function, below _sanitize_array is prob ok. call it _dict_to_array or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to need quite some simplification. I am ok with throwing out some cases if that will help.
@@ -625,6 +625,18 @@ def test_constructor_dict(self): | |||
expected.iloc[1] = 1 | |||
assert_series_equal(result, expected) | |||
|
|||
# GH 18480 - NaN key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate test
f1a10df
to
d3d9f45
Compare
Codecov Report
@@ Coverage Diff @@
## master #18496 +/- ##
==========================================
+ Coverage 91.3% 91.31% +<.01%
==========================================
Files 163 163
Lines 49781 49817 +36
==========================================
+ Hits 45451 45488 +37
+ Misses 4330 4329 -1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18496 +/- ##
==========================================
- Coverage 91.44% 91.43% -0.02%
==========================================
Files 157 157
Lines 51379 51374 -5
==========================================
- Hits 46985 46972 -13
- Misses 4394 4402 +8
Continue to review full report at Codecov.
|
6d701ec
to
b10b0ce
Compare
pandas/core/series.py
Outdated
remap_to_mi = False | ||
if data: | ||
keys, values = zip(*compat.iteritems(data)) | ||
# Workaround for #18485 - part 1/3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I said before. this giant block of code is not acceptable here. its needs to be moved to a separate function, and made much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is smaller than before, but no problem, I'll isolate it. Any comment on the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean: different than "made much simpler" which is everybody's dream more than a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is the logic is so dense its not even understandable. this is too complex, no-one will be able to follow this, vet it for bugs or anything. you need to break this apart into logical abstractions, build things up into much smaller units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@toobaz don't get me wrong, I am happy to have a fix for this, but it has to be code that is maintainable, limits special cases, and it non completely non-performant.
@jorisvandenbossche @jreback Bringing here a conversation from Gitter: one possible way to approach this (avoiding the problem that |
@toobaz happy to have you modify cython code directly to avoid perf penalty (and it also abstracts away some of the complexity). inline comments a plus! |
@toobaz sidenote: can you give the PR a bit more descriptive title? |
69b7a12
to
236df68
Compare
Since I introduced no (explicit) loops I preferred to keep code in I could add a test at the beginning of |
pandas/core/indexes/base.py
Outdated
locs = self.get_indexer(keys) | ||
order = - np.ones(len(self), dtype=int) | ||
order[locs] = np.arange(len(keys)) | ||
values = values[order] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just a reindex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just wanted to avoid creating a Series
object, particularly so inside an Index
method... but I can do it if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this method is only used in Series.init, I would then try to refactor to do the reindex there, that would eliminate much of this custom code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... (that would basically be my previous version of this PR, except that I would move code from Series.__init__
to Series._init_from_dict
). Notice however that if we leave Index._get_values_from_dict
as it is... it is broken (doesn't work for NaNs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But your previous PR had a lot of complex code checking for tuples etc (I think that part was the main objection), while this one does not have that any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then if you're OK with leaving Index._get_values_from_dict
as it is, I will try again.
assert_series_equal(result, expected) | ||
|
||
# Different NaNs: | ||
d = {1: 'a', 2: 'b', float('nan'): 'c', float('nan'): 'd'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test with None
key as well? (which already works, but would be nice to assert)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(done below)
# Different NaNs: | ||
d = {1: 'a', 2: 'b', float('nan'): 'c', float('nan'): 'd'} | ||
result = Series(d).sort_values() | ||
expected = Series(['a', 'b', 'c', 'd'], index=[1, 2, np.nan, np.nan]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the order of the expected result guaranteed like this? (I mean: I though we sort the index, but not the values?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I'm sort_values()
ing just above ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't see that :-)
Suggestion (but didn't think it fully through): we can make a distinction on whether That doesn't solve the problem with |
In my current version the lookup is not a bottleneck because it is just an iteration on keys/values, precisely what is needed anyway in the "no index" case (where, by the way, no reindexing happens because the indexes are |
04aada8
to
d50f170
Compare
@@ -181,7 +181,7 @@ def test_concat_empty_series_dtypes(self): | |||
# categorical | |||
assert pd.concat([Series(dtype='category'), | |||
Series(dtype='category')]).dtype == 'category' | |||
assert pd.concat([Series(dtype='category'), | |||
assert pd.concat([Series(np.array([]), dtype='category'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #18515
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add this issue number here as well
@jreback ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much simpler thanks, some comments.
pandas/core/series.py
Outdated
@@ -303,6 +293,23 @@ def _can_hold_na(self): | |||
|
|||
_index = None | |||
|
|||
def _init_from_dict(self, data, index, dtype): | |||
# Looking for NaN in dict doesn't work ({np.nan : 1}[float('nan')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a proper doc-string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double check: given the size of the method now, do you still prefer it isolated (rather than having exactly the same lines of code in __init__
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes absolutely, having these separate makes them much easier to grok.
pandas/core/series.py
Outdated
keys, values = zip(*compat.iteritems(data)) | ||
else: | ||
keys, values = [], [] | ||
s = Series(values, index=keys, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some comments here
expected = Series(['a', 'b', 'c', 'd'], index=[1, 2, np.nan, None]) | ||
assert_series_equal(result, expected) | ||
|
||
# MultiIndex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add another test that parametrizes over the missing values specifically (with a fixed structured for d), e.g. parametrize over None, np.nan, float('nan')
pandas/core/base.py
Outdated
@@ -875,7 +875,7 @@ def _map_values(self, mapper, na_action=None): | |||
# we specify the keys here to handle the | |||
# possibility that they are tuples | |||
from pandas import Series, Index | |||
index = Index(mapper, tupleize_cols=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test that hits this (IOW that makes you change this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already there (notice here I'm just reverting my previous PR for consistency with the new default behaviour)... shall I mention this issue in that test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, then add this issue number there as well
pandas/core/series.py
Outdated
@@ -303,6 +293,23 @@ def _can_hold_na(self): | |||
|
|||
_index = None | |||
|
|||
def _init_from_dict(self, data, index, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In frame.py
it is called _init_dict
so let's use that, and I would move it directly after the __init__
index = Index(_try_sort(data)) | ||
|
||
try: | ||
data = index._get_values_from_dict(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was the only usage of _get_values_from_dict
, so this could be cleaned up. There are also multiple implementation (for the different types of indices, not sure if those differences are important and are all catched in the new implementation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The different implementations seem unused and broken, see for instance
In [2]: pd.Index([])._get_values_from_dict({})
Out[2]: array([], dtype=float64)
In [3]: pd.DatetimeIndex([])._get_values_from_dict({})
Out[3]: array([ nan])
however, in principle they do something sensible (not necessarily expected), which is to look for Timestamp
keys in the dict. The "new implementation", that is the Series
construction, doesn't care about this (and shouldn't, I think).
I'm OK with removing all of this if you want (in another PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be removed here (it is this PR that changes the implementation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, ping
d50f170
to
4887f78
Compare
43e97ca
to
1ee3c3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small question on a test, for the rest looks good!
@@ -181,7 +181,8 @@ def test_concat_empty_series_dtypes(self): | |||
# categorical | |||
assert pd.concat([Series(dtype='category'), | |||
Series(dtype='category')]).dtype == 'category' | |||
assert pd.concat([Series(dtype='category'), | |||
# GH 18515 | |||
assert pd.concat([Series(np.array([]), dtype='category'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you change this? (it's only the dtype of the category, it is still a categorical series, so for the concat that does not matter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the dtype of the category does matter for concat (and rightly so, since conceptually the fact that a categorical of ints is really a categorical is only an implementation detail when you're going to concat it to a non-categorical anyway).
(... or just try that test before and after my change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this PR change the behaviour of empty category series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it clear to you that that test fails with this PR? If yes, please clarify/reformulate what you're asking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this PR changes (fixes) the dtype of categories for pd.Series(dtype='category')
, please specify so and add an explicit test for this (and a whatsnew bug fix note).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So data is None
is handled with data = {}
, and thus this PR affects this behaviour? So this does fix #18515
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, asking is good. This PR makes (incidentally, but it is the correct behaviour) category series initialized with an empty dict
behave like category series initialized with an empty list
, that is have object
dtype. This fixes #18515 (but not #17261 ). Since this test is about concatenating an empty float category Series
to an empty float (non-category) Series
, I had to fix the former so that it still had dtype float64
.
So since you reopened #18515, now I will add "closes #18515" and push again.
OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
And in general, please add commits instead of amending if you make such additional changes. Makes reviewing easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, ping
d1804f2
to
2d75ffc
Compare
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -208,5 +209,6 @@ Other | |||
|
|||
- Improved error message when attempting to use a Python keyword as an identifier in a numexpr query (:issue:`18221`) | |||
- Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`) | |||
- Fixed construction of :class:`Series` from ``dict`` containing ``NaN`` as key (:issue:`18480`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Construction of a :class:`Series`
from a dict
pandas/core/series.py
Outdated
keys, values = zip(*compat.iteritems(data)) | ||
else: | ||
keys, values = [], [] | ||
# Input is now list-like, so rely on "standard" construction: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line here
pandas/core/series.py
Outdated
keys, values = [], [] | ||
# Input is now list-like, so rely on "standard" construction: | ||
s = Series(values, index=keys, dtype=dtype) | ||
# Now we just make sure the order is respected, if any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/series.py
Outdated
s = Series(values, index=keys, dtype=dtype) | ||
# Now we just make sure the order is respected, if any | ||
if index is not None and not index.identical(keys): | ||
s = s.reindex(index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could do:
if index is not None:
s = s.reindex(index, copy=False)
|
||
# the are Index() and RangeIndex() which don't compare type equal | ||
empty2 = Series(input_class()) | ||
# these are Index() and RangeIndex() which don't compare type equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line before comments
# but are just .equals | ||
assert_series_equal(empty, empty2, check_index_type=False) | ||
|
||
empty = Series(index=lrange(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think these tests got eliminated? can you add another test (or add onto the construction via np.nan, None, float('nan')
which hits this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is wrong (dtype shouldn't be float64
), the right one would fail, see #17261 . Adding an xfailing test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The xfailing test would basically be this one with explicit dtype, it's better to change it when fixing #17261 . Adding a fixed test for this instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, ping
2d75ffc
to
1582c42
Compare
thanks! |
From pandas-dev#18496 Special cases empty series construction, since the reindex is not necessary.
closes Initialization of Series from dict disregards np.nan key #18480
closes Inconsistent dtype of category in empty Series between dict and list input #18515
tests added / passed
passes
git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
This is also a prerequisite for fixing #18455 (which is a prerequisite for fixing #18460 ). The workaround to #18485 is annoying, but it is easy to remove it when the bug is fixed.