Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix Series constructor for Categorical with index #19714

Merged
merged 9 commits into from
Feb 27, 2018

Conversation

cbertinato
Copy link
Contributor

Fixes Series constructor so that ValueError is raised when a Categorical and index of incorrect length are given. Closes issue #19342

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks.

@@ -690,6 +690,7 @@ Categorical
- Bug in :meth:`Index.astype` with a categorical dtype where the resultant index is not converted to a :class:`CategoricalIndex` for all types of index (:issue:`18630`)
- Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`)
- Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`)
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of incorrect length is given (:issue:`19342`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say "index of different length". It could be the categorical that's incorrect :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. Will do.

@codecov
Copy link

codecov bot commented Feb 15, 2018

Codecov Report

Merging #19714 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19714      +/-   ##
==========================================
+ Coverage   91.66%   91.66%   +<.01%     
==========================================
  Files         150      150              
  Lines       48969    48975       +6     
==========================================
+ Hits        44886    44892       +6     
  Misses       4083     4083
Flag Coverage Δ
#multiple 90.04% <100%> (ø) ⬆️
#single 41.85% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 94.44% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e4c50a...f5db9ab. Read the comment docs.

@gfyoung gfyoung added Bug Categorical Categorical Data Type labels Feb 15, 2018
map(lambda x: x, range(3))])
def test_constructor_index_mismatch(self, input):
# GH 19342
pytest.raises(ValueError, Series, input, index=np.arange(4))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also check the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

@cbertinato cbertinato force-pushed the issue-19342 branch 2 times, most recently from 2c351ea to 3bc499d Compare February 16, 2018 11:54
@pep8speaks
Copy link

pep8speaks commented Feb 16, 2018

Hello @cbertinato! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 26, 2018 at 14:53 Hours UTC

# raises an error
idx = np.arange(4)
if compat.PY2:
typs = types.GeneratorType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need all of this, its just confusing just to construct an error message. just do a simpler check on the error.

@@ -210,6 +210,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
raise ValueError("cannot specify a dtype with a "
"Categorical unless "
"dtype='category'")
if index is not None and len(index) != len(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should go a little further down after

if index is None:
    ....
else:
      # this the check here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though maybe this should go in _sanitize_array around L3242

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an advantage to moving it into _sanitize_array versus putting it in the if at L230. But I could be missing something. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s prob ok here but a bit lower
early failure is good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Proposed a placement in this next push. Placing it in the

if index is None:
    ...

at ~L226 as an else breaks some tests if data is a scalar or SingleBlockManager. It's difficult to catch all cases if we put it in _sanitize_array because of the returns in the if cases. So the next best place appears to be after the call to _sanitize_array. Not ideal with regard to early failure, but really the only place that I can see to put a single check just to be able to do len(data). Not much different from the current location except that it catches cases other than Categorical.

@@ -690,6 +690,7 @@ Categorical
- Bug in :meth:`Index.astype` with a categorical dtype where the resultant index is not converted to a :class:`CategoricalIndex` for all types of index (:issue:`18630`)
- Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`)
- Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`)
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of different length is given (:issue:`19342`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this in reshaping

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you merge in master and fix the merge conflict. Also a couple linting errors.

@@ -844,6 +844,7 @@ Reshaping
- Improved error message for :func:`DataFrame.merge` when there is no common merge key (:issue:`19427`)
- Bug in :func:`DataFrame.join` which does an *outer* instead of a *left* join when being called with multiple DataFrames and some have non-unique indices (:issue:`19624`)
- :func:`Series.rename` now accepts ``axis`` as a kwarg (:issue:`18589`)
- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of different length is given (:issue:`19342`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify error -> ValueError

@@ -5,6 +5,7 @@

from datetime import datetime, timedelta
from collections import OrderedDict
import types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports aren't needed now.

@jreback jreback added this to the 0.23.0 milestone Feb 21, 2018
@@ -238,6 +239,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
data = _sanitize_array(data, index, dtype, copy,
raise_cast_failure=True)

if index is not None and len(index) != len(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should go a touch higher,

            if index is None:
                if not is_list_like(data):
                    data = [data]
                index = com._default_index(len(data))
           else:
                # add here

            # create/copy the manager
            if isinstance(data, SingleBlockManager):
                if dtype is not None:
                    data = data.astype(dtype=dtype, errors='ignore',
                                       copy=copy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Added a scalar check that lets scalars through, so we are assuming that the Series is shaped correctly when the scalar is broadcast to fit the index, which is probably ok.

if index is None:
    ...
else:
    if isscalar(data) and len(index) != len(data):
        ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. A few other inputs that break. np.array and np.dtype appear to be just two of them. Unless we add specific checks for these, I think we may need to move it lower, below _sanitize_array.

@@ -226,6 +227,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
if not is_list_like(data):
data = [data]
index = com._default_index(len(data))
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make this an elif here

@jreback
Copy link
Contributor

jreback commented Feb 24, 2018

i rebased. ping on green.

# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(data) > 1 and len(index) != len(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, did this change? 0-len should be ok, can you add a test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0-len gets caught deeper, in the SingleBlockManager, but len 1 gets caught here. It should be let through to be broadcast in _sanitize_array. I'll add a test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, would be ok with catching both cases here, or are they different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think catching the 0-len case here would be good for consistency. We don't want to catch the len 1 case because it will get broadcast, so it will look something like:

                # a scalar numpy array is list-like but doesn't
                # have a proper length
                try:
                    if len(data) != 1 and len(index) != len(data):

Unless the intention is not to broadcast a list-like of length 1. One could argue that it would be better to raise an error instead of broadcasting. If one wanted to broadcast a scalar, then just pass a scalar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would have to show the test which fails for this, len(data) == 0 is valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_apply_subset in tests/io/formats/test_style.py raises an error. The traceback indicates the input to the Series constructor is:

data = ['color: baz'], index = RangeIndex(start=0, stop=2, step=1), dtype = None

This should be valid. Checking that len(data) != 1 lets this case pass.

@@ -418,8 +418,8 @@ def test_constructor_numpy_scalar(self):
# GH 19342
# construction with a numpy scalar
# should not raise
result = Series(np.array(100), index=np.arange(4))
expected = Series(100, index=np.arange(4))
result = Series(np.array(100), index=np.arange(4), dtype='int64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, ok thanks

# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(data) > 1 and len(index) != len(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would have to show the test which fails for this, len(data) == 0 is valid

# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(data) != 1 and len(index) != len(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not convinced about this, what fails for len(data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove len(data) != 1 and run python -m pytest pandas/tests/io/formats/test_style.py I get:

                try:
                    if len(index) != len(data):
                        raise ValueError(
                            'Length of passed values is {val}, '
                            'index implies {ind}'
>                           .format(val=len(data), ind=len(index)))
E                           ValueError: ('Length of passed values is 1, index implies 2', 'occurred at index A')

pandas/core/series.py:246: ValueError

I should add a test for this case in test_constructors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok add a test in test_constructors. which is this failing on in test_style?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_apply_subset. Input to the Series constructor is:

data = ['color: baz'],
index = RangeIndex(start=0, stop=2, step=1), 
dtype = None, 
name = 'A', 
copy = False, 
fastpath = False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this should raise! if the data is a scalar this is ok, but we can't broadcast a list like that (well we can, but we shoudn't)

In [2]: data = ['color: baz']
   ...: index = pd.RangeIndex(start=0, stop=2, step=1)
   ...: 

In [3]: pd.Series(data, index)
Out[3]: 
0    color: baz
1    color: baz
dtype: object

cbertinato and others added 4 commits February 26, 2018 08:35
Fixes Series constructor so that ValueError is raised when a Categorical and index of different length are given.
@cbertinato
Copy link
Contributor Author

I agree. It shouldn’t broadcast a list like that. We can remove the check and see if there’s anywhere else where this breaks. If not, then fix the test in test_style?

@cbertinato
Copy link
Contributor Author

I wasn’t sure whether anything else relied on this behavior.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2018

I agree. It shouldn’t broadcast a list like that. We can remove the check and see if there’s anywhere else where this breaks. If not, then fix the test in test_style?

yes, and add this as an additional test in test_constructor.

Modified test setup in io/formats/test_style.py accordingly
@jreback jreback merged commit e51800b into pandas-dev:master Feb 27, 2018
@jreback
Copy link
Contributor

jreback commented Feb 27, 2018

thanks @cbertinato sometimes the seemingly small changes are hard!

@cbertinato
Copy link
Contributor Author

Thanks for the help and advice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants