-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix Series constructor for Categorical with index #19714
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -690,6 +690,7 @@ Categorical | |||
- Bug in :meth:`Index.astype` with a categorical dtype where the resultant index is not converted to a :class:`CategoricalIndex` for all types of index (:issue:`18630`) | |||
- Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`) | |||
- Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`) | |||
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of incorrect length is given (:issue:`19342`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say "index of different length". It could be the categorical that's incorrect :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point. Will do.
Codecov Report
@@ Coverage Diff @@
## master #19714 +/- ##
==========================================
+ Coverage 91.66% 91.66% +<.01%
==========================================
Files 150 150
Lines 48969 48975 +6
==========================================
+ Hits 44886 44892 +6
Misses 4083 4083
Continue to review full report at Codecov.
|
1434b63
to
c6b2016
Compare
map(lambda x: x, range(3))]) | ||
def test_constructor_index_mismatch(self, input): | ||
# GH 19342 | ||
pytest.raises(ValueError, Series, input, index=np.arange(4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also check the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
2c351ea
to
3bc499d
Compare
Hello @cbertinato! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on February 26, 2018 at 14:53 Hours UTC |
3bc499d
to
28b70b8
Compare
# raises an error | ||
idx = np.arange(4) | ||
if compat.PY2: | ||
typs = types.GeneratorType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need all of this, its just confusing just to construct an error message. just do a simpler check on the error.
pandas/core/series.py
Outdated
@@ -210,6 +210,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None, | |||
raise ValueError("cannot specify a dtype with a " | |||
"Categorical unless " | |||
"dtype='category'") | |||
if index is not None and len(index) != len(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go a little further down after
if index is None:
....
else:
# this the check here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though maybe this should go in _sanitize_array
around L3242
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an advantage to moving it into _sanitize_array
versus putting it in the if at L230. But I could be missing something. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it’s prob ok here but a bit lower
early failure is good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Proposed a placement in this next push. Placing it in the
if index is None:
...
at ~L226 as an else breaks some tests if data is a scalar or SingleBlockManager
. It's difficult to catch all cases if we put it in _sanitize_array
because of the returns in the if cases. So the next best place appears to be after the call to _sanitize_array
. Not ideal with regard to early failure, but really the only place that I can see to put a single check just to be able to do len(data)
. Not much different from the current location except that it catches cases other than Categorical
.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -690,6 +690,7 @@ Categorical | |||
- Bug in :meth:`Index.astype` with a categorical dtype where the resultant index is not converted to a :class:`CategoricalIndex` for all types of index (:issue:`18630`) | |||
- Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`) | |||
- Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`) | |||
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of different length is given (:issue:`19342`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this in reshaping
28b70b8
to
abe385d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you merge in master and fix the merge conflict. Also a couple linting errors.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -844,6 +844,7 @@ Reshaping | |||
- Improved error message for :func:`DataFrame.merge` when there is no common merge key (:issue:`19427`) | |||
- Bug in :func:`DataFrame.join` which does an *outer* instead of a *left* join when being called with multiple DataFrames and some have non-unique indices (:issue:`19624`) | |||
- :func:`Series.rename` now accepts ``axis`` as a kwarg (:issue:`18589`) | |||
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of different length is given (:issue:`19342`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify error
-> ValueError
@@ -5,6 +5,7 @@ | |||
|
|||
from datetime import datetime, timedelta | |||
from collections import OrderedDict | |||
import types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These imports aren't needed now.
abe385d
to
11522eb
Compare
pandas/core/series.py
Outdated
@@ -238,6 +239,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None, | |||
data = _sanitize_array(data, index, dtype, copy, | |||
raise_cast_failure=True) | |||
|
|||
if index is not None and len(index) != len(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go a touch higher,
if index is None:
if not is_list_like(data):
data = [data]
index = com._default_index(len(data))
else:
# add here
# create/copy the manager
if isinstance(data, SingleBlockManager):
if dtype is not None:
data = data.astype(dtype=dtype, errors='ignore',
copy=copy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Added a scalar check that lets scalars through, so we are assuming that the Series is shaped correctly when the scalar is broadcast to fit the index, which is probably ok.
if index is None:
...
else:
if isscalar(data) and len(index) != len(data):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. A few other inputs that break. np.array
and np.dtype
appear to be just two of them. Unless we add specific checks for these, I think we may need to move it lower, below _sanitize_array
.
11522eb
to
98f1f16
Compare
pandas/core/series.py
Outdated
@@ -226,6 +227,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None, | |||
if not is_list_like(data): | |||
data = [data] | |||
index = com._default_index(len(data)) | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can make this an elif here
98f1f16
to
b6df1c8
Compare
i rebased. ping on green. |
pandas/core/series.py
Outdated
# a scalar numpy array is list-like but doesn't | ||
# have a proper length | ||
try: | ||
if len(data) > 1 and len(index) != len(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, did this change? 0-len should be ok, can you add a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0-len gets caught deeper, in the SingleBlockManager, but len 1 gets caught here. It should be let through to be broadcast in _sanitize_array
. I'll add a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, would be ok with catching both cases here, or are they different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think catching the 0-len case here would be good for consistency. We don't want to catch the len 1 case because it will get broadcast, so it will look something like:
# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(data) != 1 and len(index) != len(data):
Unless the intention is not to broadcast a list-like of length 1. One could argue that it would be better to raise an error instead of broadcasting. If one wanted to broadcast a scalar, then just pass a scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would have to show the test which fails for this, len(data) == 0 is valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test test_apply_subset
in tests/io/formats/test_style.py raises an error. The traceback indicates the input to the Series constructor is:
data = ['color: baz'], index = RangeIndex(start=0, stop=2, step=1), dtype = None
This should be valid. Checking that len(data) != 1
lets this case pass.
@@ -418,8 +418,8 @@ def test_constructor_numpy_scalar(self): | |||
# GH 19342 | |||
# construction with a numpy scalar | |||
# should not raise | |||
result = Series(np.array(100), index=np.arange(4)) | |||
expected = Series(100, index=np.arange(4)) | |||
result = Series(np.array(100), index=np.arange(4), dtype='int64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, ok thanks
pandas/core/series.py
Outdated
# a scalar numpy array is list-like but doesn't | ||
# have a proper length | ||
try: | ||
if len(data) > 1 and len(index) != len(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would have to show the test which fails for this, len(data) == 0 is valid
pandas/core/series.py
Outdated
# a scalar numpy array is list-like but doesn't | ||
# have a proper length | ||
try: | ||
if len(data) != 1 and len(index) != len(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not convinced about this, what fails for len(data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove len(data) != 1
and run python -m pytest pandas/tests/io/formats/test_style.py
I get:
try:
if len(index) != len(data):
raise ValueError(
'Length of passed values is {val}, '
'index implies {ind}'
> .format(val=len(data), ind=len(index)))
E ValueError: ('Length of passed values is 1, index implies 2', 'occurred at index A')
pandas/core/series.py:246: ValueError
I should add a test for this case in test_constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok add a test in test_constructors. which is this failing on in test_style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_apply_subset. Input to the Series constructor is:
data = ['color: baz'],
index = RangeIndex(start=0, stop=2, step=1),
dtype = None,
name = 'A',
copy = False,
fastpath = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this should raise! if the data is a scalar this is ok, but we can't broadcast a list like that (well we can, but we shoudn't)
In [2]: data = ['color: baz']
...: index = pd.RangeIndex(start=0, stop=2, step=1)
...:
In [3]: pd.Series(data, index)
Out[3]:
0 color: baz
1 color: baz
dtype: object
Fixes Series constructor so that ValueError is raised when a Categorical and index of different length are given.
4540878
to
e756c7e
Compare
I agree. It shouldn’t broadcast a list like that. We can remove the check and see if there’s anywhere else where this breaks. If not, then fix the test in test_style? |
I wasn’t sure whether anything else relied on this behavior. |
yes, and add this as an additional test in test_constructor. |
Modified test setup in io/formats/test_style.py accordingly
thanks @cbertinato sometimes the seemingly small changes are hard! |
Thanks for the help and advice! |
Fixes Series constructor so that ValueError is raised when a Categorical and index of incorrect length are given. Closes issue #19342
git diff upstream/master -u -- "*.py" | flake8 --diff