BUG: Fix Series constructor for Categorical with index #19714

cbertinato · 2018-02-15T14:04:56Z

Fixes Series constructor so that ValueError is raised when a Categorical and index of incorrect length are given. Closes issue #19342

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

TomAugspurger

Looks good! Thanks.

TomAugspurger · 2018-02-15T14:23:32Z

doc/source/whatsnew/v0.23.0.txt

@@ -690,6 +690,7 @@ Categorical
 - Bug in :meth:`Index.astype` with a categorical dtype where the resultant index is not converted to a :class:`CategoricalIndex` for all types of index (:issue:`18630`)
 - Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`)
 - Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`)
+- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of incorrect length is given (:issue:`19342`)


Maybe say "index of different length". It could be the categorical that's incorrect :)

Very good point. Will do.

codecov · 2018-02-15T14:52:05Z

Codecov Report

Merging #19714 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #19714      +/-   ##
==========================================
+ Coverage   91.66%   91.66%   +<.01%     
==========================================
  Files         150      150              
  Lines       48969    48975       +6     
==========================================
+ Hits        44886    44892       +6     
  Misses       4083     4083

Flag	Coverage Δ
#multiple	`90.04% <100%> (ø)`	⬆️
#single	`41.85% <50%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/series.py	`94.44% <100%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e4c50a...f5db9ab. Read the comment docs.

gfyoung · 2018-02-15T19:49:44Z

pandas/tests/series/test_constructors.py

+                                       map(lambda x: x, range(3))])
+    def test_constructor_index_mismatch(self, input):
+        # GH 19342
+        pytest.raises(ValueError, Series, input, index=np.arange(4))


Let's also check the error message.

pep8speaks · 2018-02-16T11:54:49Z

Hello @cbertinato! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 26, 2018 at 14:53 Hours UTC

jreback · 2018-02-18T16:03:04Z

pandas/tests/series/test_constructors.py

+        # raises an error
+        idx = np.arange(4)
+        if compat.PY2:
+            typs = types.GeneratorType


you don't need all of this, its just confusing just to construct an error message. just do a simpler check on the error.

jreback · 2018-02-18T16:06:08Z

pandas/core/series.py

@@ -210,6 +210,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
                    raise ValueError("cannot specify a dtype with a "
                                     "Categorical unless "
                                     "dtype='category'")
+                if index is not None and len(index) != len(data):


this should go a little further down after

if index is None: .... else: # this the check here

though maybe this should go in _sanitize_array around L3242

I don't see an advantage to moving it into _sanitize_array versus putting it in the if at L230. But I could be missing something. What do you think?

it’s prob ok here but a bit lower
early failure is good

Ok. Proposed a placement in this next push. Placing it in the

if index is None: ...

at ~L226 as an else breaks some tests if data is a scalar or SingleBlockManager. It's difficult to catch all cases if we put it in _sanitize_array because of the returns in the if cases. So the next best place appears to be after the call to _sanitize_array. Not ideal with regard to early failure, but really the only place that I can see to put a single check just to be able to do len(data). Not much different from the current location except that it catches cases other than Categorical.

jreback · 2018-02-18T16:07:37Z

doc/source/whatsnew/v0.23.0.txt

@@ -690,6 +690,7 @@ Categorical
 - Bug in :meth:`Index.astype` with a categorical dtype where the resultant index is not converted to a :class:`CategoricalIndex` for all types of index (:issue:`18630`)
 - Bug in :meth:`Series.astype` and ``Categorical.astype()`` where an existing categorical data does not get updated (:issue:`10696`, :issue:`18593`)
 - Bug in :class:`Index` constructor with ``dtype=CategoricalDtype(...)`` where ``categories`` and ``ordered`` are not maintained (issue:`19032`)
+- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of different length is given (:issue:`19342`)


put this in reshaping

TomAugspurger

Could you merge in master and fix the merge conflict. Also a couple linting errors.

TomAugspurger · 2018-02-20T12:23:21Z

doc/source/whatsnew/v0.23.0.txt

@@ -844,6 +844,7 @@ Reshaping
 - Improved error message for :func:`DataFrame.merge` when there is no common merge key (:issue:`19427`)
 - Bug in :func:`DataFrame.join` which does an *outer* instead of a *left* join when being called with multiple DataFrames and some have non-unique indices (:issue:`19624`)
 - :func:`Series.rename` now accepts ``axis`` as a kwarg (:issue:`18589`)
+- Bug in :class:`Series` constructor with ``Categorical`` where an error is not raised when an index of different length is given (:issue:`19342`)


Could you clarify error -> ValueError

TomAugspurger · 2018-02-20T12:23:47Z

pandas/tests/series/test_constructors.py

@@ -5,6 +5,7 @@

 from datetime import datetime, timedelta
 from collections import OrderedDict
+import types


These imports aren't needed now.

jreback · 2018-02-22T00:00:13Z

pandas/core/series.py

@@ -238,6 +239,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
                data = _sanitize_array(data, index, dtype, copy,
                                       raise_cast_failure=True)

+                if index is not None and len(index) != len(data):


this should go a touch higher,

if index is None: if not is_list_like(data): data = [data] index = com._default_index(len(data)) else: # add here # create/copy the manager if isinstance(data, SingleBlockManager): if dtype is not None: data = data.astype(dtype=dtype, errors='ignore', copy=copy)

Ok. Added a scalar check that lets scalars through, so we are assuming that the Series is shaped correctly when the scalar is broadcast to fit the index, which is probably ok.

if index is None: ... else: if isscalar(data) and len(index) != len(data): ...

Nevermind. A few other inputs that break. np.array and np.dtype appear to be just two of them. Unless we add specific checks for these, I think we may need to move it lower, below _sanitize_array.

jreback · 2018-02-23T01:23:54Z

pandas/core/series.py

@@ -226,6 +227,11 @@ def __init__(self, data=None, index=None, dtype=None, name=None,
                if not is_list_like(data):
                    data = [data]
                index = com._default_index(len(data))
+            else:


can make this an elif here

jreback · 2018-02-24T14:48:08Z

i rebased. ping on green.

jreback · 2018-02-25T16:30:34Z

pandas/core/series.py

+                # a scalar numpy array is list-like but doesn't
+                # have a proper length
+                try:
+                    if len(data) > 1 and len(index) != len(data):


hmm, did this change? 0-len should be ok, can you add a test

0-len gets caught deeper, in the SingleBlockManager, but len 1 gets caught here. It should be let through to be broadcast in _sanitize_array. I'll add a test.

hmm, would be ok with catching both cases here, or are they different?

Yeah. I think catching the 0-len case here would be good for consistency. We don't want to catch the len 1 case because it will get broadcast, so it will look something like:

# a scalar numpy array is list-like but doesn't # have a proper length try: if len(data) != 1 and len(index) != len(data):

Unless the intention is not to broadcast a list-like of length 1. One could argue that it would be better to raise an error instead of broadcasting. If one wanted to broadcast a scalar, then just pass a scalar.

you would have to show the test which fails for this, len(data) == 0 is valid

The test test_apply_subset in tests/io/formats/test_style.py raises an error. The traceback indicates the input to the Series constructor is:

data = ['color: baz'], index = RangeIndex(start=0, stop=2, step=1), dtype = None

This should be valid. Checking that len(data) != 1 lets this case pass.

jreback · 2018-02-25T16:31:23Z

pandas/tests/series/test_constructors.py

@@ -418,8 +418,8 @@ def test_constructor_numpy_scalar(self):
        # GH 19342
        # construction with a numpy scalar
        # should not raise
-        result = Series(np.array(100), index=np.arange(4))
-        expected = Series(100, index=np.arange(4))
+        result = Series(np.array(100), index=np.arange(4), dtype='int64')


ahh, ok thanks

jreback · 2018-02-25T21:10:45Z

pandas/core/series.py

+                # a scalar numpy array is list-like but doesn't
+                # have a proper length
+                try:
+                    if len(data) > 1 and len(index) != len(data):


you would have to show the test which fails for this, len(data) == 0 is valid

jreback · 2018-02-26T12:28:23Z

pandas/core/series.py

+                # a scalar numpy array is list-like but doesn't
+                # have a proper length
+                try:
+                    if len(data) != 1 and len(index) != len(data):


still not convinced about this, what fails for len(data)

If I remove len(data) != 1 and run python -m pytest pandas/tests/io/formats/test_style.py I get:

try: if len(index) != len(data): raise ValueError( 'Length of passed values is {val}, ' 'index implies {ind}' > .format(val=len(data), ind=len(index))) E ValueError: ('Length of passed values is 1, index implies 2', 'occurred at index A') pandas/core/series.py:246: ValueError

I should add a test for this case in test_constructors.

ok add a test in test_constructors. which is this failing on in test_style?

test_apply_subset. Input to the Series constructor is:

data = ['color: baz'], index = RangeIndex(start=0, stop=2, step=1), dtype = None, name = 'A', copy = False, fastpath = False

So this should raise! if the data is a scalar this is ok, but we can't broadcast a list like that (well we can, but we shoudn't)

In [2]: data = ['color: baz'] ...: index = pd.RangeIndex(start=0, stop=2, step=1) ...: In [3]: pd.Series(data, index) Out[3]: 0 color: baz 1 color: baz dtype: object

Fixes Series constructor so that ValueError is raised when a Categorical and index of different length are given.

cbertinato · 2018-02-26T13:56:41Z

I agree. It shouldn’t broadcast a list like that. We can remove the check and see if there’s anywhere else where this breaks. If not, then fix the test in test_style?

cbertinato · 2018-02-26T13:57:34Z

I wasn’t sure whether anything else relied on this behavior.

jreback · 2018-02-26T14:17:21Z

I agree. It shouldn’t broadcast a list like that. We can remove the check and see if there’s anywhere else where this breaks. If not, then fix the test in test_style?

yes, and add this as an additional test in test_constructor.

Modified test setup in io/formats/test_style.py accordingly

jreback · 2018-02-27T01:13:17Z

thanks @cbertinato sometimes the seemingly small changes are hard!

cbertinato · 2018-02-27T01:45:16Z

Thanks for the help and advice!

)

TomAugspurger reviewed Feb 15, 2018

View reviewed changes

cbertinato force-pushed the issue-19342 branch from 1434b63 to c6b2016 Compare February 15, 2018 15:29

gfyoung added Bug Categorical Categorical Data Type labels Feb 15, 2018

gfyoung reviewed Feb 15, 2018

View reviewed changes

cbertinato force-pushed the issue-19342 branch 2 times, most recently from 2c351ea to 3bc499d Compare February 16, 2018 11:54

cbertinato force-pushed the issue-19342 branch from 3bc499d to 28b70b8 Compare February 16, 2018 11:56

jreback requested changes Feb 18, 2018

View reviewed changes

cbertinato force-pushed the issue-19342 branch from 28b70b8 to abe385d Compare February 19, 2018 16:05

TomAugspurger reviewed Feb 20, 2018

View reviewed changes

cbertinato force-pushed the issue-19342 branch from abe385d to 11522eb Compare February 20, 2018 14:05

jreback approved these changes Feb 21, 2018

View reviewed changes

jreback added this to the 0.23.0 milestone Feb 21, 2018

jreback requested changes Feb 22, 2018

View reviewed changes

cbertinato force-pushed the issue-19342 branch from 11522eb to 98f1f16 Compare February 22, 2018 13:37

jreback requested changes Feb 23, 2018

View reviewed changes

cbertinato force-pushed the issue-19342 branch from 98f1f16 to b6df1c8 Compare February 23, 2018 17:17

jreback approved these changes Feb 24, 2018

View reviewed changes

jreback requested changes Feb 25, 2018

View reviewed changes

jreback reviewed Feb 25, 2018

View reviewed changes

jreback requested changes Feb 25, 2018

View reviewed changes

jreback requested changes Feb 26, 2018

View reviewed changes

cbertinato and others added 4 commits February 26, 2018 08:35

BUG: Fix Series constructor for Categorical with index

a47802e

Fixes Series constructor so that ValueError is raised when a Categorical and index of different length are given.

Potential fix for failed tests

e5423b7

revert changes from master

6f65134

accomodate numpy scalar

1297c2b

cbertinato added 4 commits February 26, 2018 08:35

Allow broadcasting of single-element lists

7847923

Fix test for 32-bit environment

bb693c7

Allow list with len 1 to be broadcast

29d9519

Add test for single-element list and index case

e756c7e

cbertinato force-pushed the issue-19342 branch from 4540878 to e756c7e Compare February 26, 2018 13:36

Disallow broadcasting of single-element lists

f5db9ab

Modified test setup in io/formats/test_style.py accordingly

jreback approved these changes Feb 27, 2018

View reviewed changes

jreback merged commit e51800b into pandas-dev:master Feb 27, 2018

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018

BUG: Fix Series constructor for Categorical with index (pandas-dev#19714

5508704

)

jorisvandenbossche mentioned this pull request Mar 17, 2018

API/REGR: construction of Series with scalar-like / len-1 lists #20391

Closed

toobaz mentioned this pull request Apr 27, 2018

len-1 scalar is accepted as valid input for len >1 Series #18819

Closed

BUG: Fix Series constructor for Categorical with index #19714

BUG: Fix Series constructor for Categorical with index #19714

Conversation

cbertinato commented Feb 15, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 15, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Feb 16, 2018 • edited Loading

Comment last updated on February 26, 2018 at 14:53 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbertinato commented Feb 26, 2018

cbertinato commented Feb 26, 2018

jreback commented Feb 26, 2018

jreback commented Feb 27, 2018

cbertinato commented Feb 27, 2018

codecov bot commented Feb 15, 2018 •

edited

Loading

pep8speaks commented Feb 16, 2018 •

edited

Loading