Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/BUG: Raise when int-dtype coercions fail #21456

Merged
merged 1 commit into from
Jun 20, 2018

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jun 13, 2018

Related to the Index and Series constructors.

Closes #15832.

cc @ucals (since this is mostly based off what you did in #15859)

@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas labels Jun 13, 2018
@gfyoung gfyoung requested a review from jreback June 13, 2018 06:40
@gfyoung gfyoung force-pushed the int-dtype-coercions branch from e895c12 to 0387ae1 Compare June 13, 2018 07:16
raise OverflowError("The elements provided in the data cannot all be "
"casted to the dtype {dtype}".format(dtype=dtype))

if np.array_equal(arr, casted):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy-self-warning

Since this is in part due to numpy's self-conflict regarding element-wise comparisons, should we care about this showing up (perhaps we swallow the warning)?

@pytest.mark.parametrize("int_dtype", ["uint8", "uint16", "uint32",
"uint64", "int32", "int64",
"int16", "int8"])
@pytest.mark.parametrize("float_dtype", ["float16", "float32", "float64"])
Copy link
Member Author

@gfyoung gfyoung Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps these (i.e. int_dtype, float_dtype, uint_dtype below) should be conftest.py fixtures, if there is sufficient use for them?

Copy link
Member Author

@gfyoung gfyoung Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add them as fixtures to this PR and then can see after merging what can be done with them for other tests in the codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : BTW, I didn't realize that the comment in your "review" was referencing the one in this diff (I thought I was deleting a duplicate comment). Oops.

@gfyoung gfyoung added this to the 0.24.0 milestone Jun 13, 2018
@gfyoung gfyoung force-pushed the int-dtype-coercions branch from 0387ae1 to 3a65ee9 Compare June 13, 2018 08:09
@@ -36,6 +36,7 @@ Datetimelike API Changes
Other API Changes
^^^^^^^^^^^^^^^^^

- Series and Index constructors now raise when the data is incompatible with the specified dtype (:issue:`15832`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks around Series/Index

say with a passed dtype=

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

incompatible with integer/unsigned integer dtypes.

.. versionadded:: 0.24.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this duplicates maybe_downcast_to_dtype which is used internally, rather have the doc-string of that updated / examples (and can add the copy=)

Copy link
Member Author

@gfyoung gfyoung Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. We're not always down-casting e.g.

Series(np.array([1, 2], dtype="int32"), dtype="int64")

Silly? Yes, but it should work.

@codecov
Copy link

codecov bot commented Jun 13, 2018

Codecov Report

Merging #21456 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21456      +/-   ##
==========================================
+ Coverage   91.92%   91.92%   +<.01%     
==========================================
  Files         153      153              
  Lines       49570    49586      +16     
==========================================
+ Hits        45566    45582      +16     
  Misses       4004     4004
Flag Coverage Δ
#multiple 90.32% <100%> (ø) ⬆️
#single 41.81% <66.66%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/base.py 96.62% <100%> (ø) ⬆️
pandas/core/dtypes/cast.py 88.49% <100%> (+0.26%) ⬆️
pandas/core/series.py 94.19% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2da06c...27caec9. Read the comment docs.

@pandas-dev pandas-dev deleted a comment from jreback Jun 13, 2018
@gfyoung gfyoung force-pushed the int-dtype-coercions branch from 3a65ee9 to 7d5c732 Compare June 13, 2018 21:36


@pytest.fixture(params=["float16", "float32", "float64"])
def float_dtype(request):
Copy link
Member

@jschendel jschendel Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also asked to add these fixtures in #21432. I think it might be best to merge #21432 first to make backporting easier since it's tagged for 0.23.2? Not sure if it really matters though.

Your code looks 95% identical to mine, so I'll just modify mine to match yours prior to pushing so there shouldn't be any significant conflicts regardless of which PR gets merged first. The only potential sources of differences:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschendel : I'll address all of your questions:

  • The merging doesn't particularly matter at this point. I would hold on modifying your fixtures for the moment. Let's ask @jreback in your PR.

  • float16, ah that's true. I'll remove that.

  • I generally prefer a newline after the docstring. Facilitates reading IMO.

return request.param


@pytest.fixture(params=SIGNED_INT_DTYPES)
Copy link
Member

@jschendel jschendel Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

params should be ALL_INT_DTYPES

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we rename either the list or the fixture to make them consistent ("all" vs. "any")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Oops, good catch 😄

  • So this was a decision in semantics. The list is "all" of the integer dtypes, hence the name. However, all_int_dtype as a parameter to the test makes less sense vs. any_int_dtype.

# see gh-15832
msg = "Trying to coerce float values to integers"
with tm.assert_raises_regex(ValueError, msg):
Series([1, 2, 3.5], dtype=any_int_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is going to cartesian product these, but these look like independent cases, can you split to 2 tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# see gh-15832
msg = "Trying to coerce float values to integers"
with tm.assert_raises_regex(ValueError, msg):
Index([1, 2, 3.5], dtype=any_int_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, can you split to 2 tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

msg = "could not convert string to float"
with tm.assert_raises_regex(ValueError, msg):
Series(["a", "b", "c"], dtype=float)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have these tests for Index as well (I think we do)?

Copy link
Member Author

@gfyoung gfyoung Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 : That's a bug!

>>> Index(["a", "b", "c"], dtype=float)
Index([["a", "b", "c"], dtype=object)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #21311

Copy link
Member Author

@gfyoung gfyoung Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschendel @jreback : I'll add an xfail test for this.

@gfyoung gfyoung force-pushed the int-dtype-coercions branch from 5568230 to b85efa7 Compare June 15, 2018 00:43
@jreback
Copy link
Contributor

jreback commented Jun 15, 2018

@gfyoung if you can rebase your fixures should cleanly merge

@gfyoung gfyoung force-pushed the int-dtype-coercions branch 2 times, most recently from 7185086 to 390d914 Compare June 15, 2018 23:54
@gfyoung
Copy link
Member Author

gfyoung commented Jun 16, 2018

@jreback : Tests are finally green again! PTAL.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, also I think its worthwhile to make a sub-section in the whatsnew with a mini-example

elif inferred in ['floating', 'mixed-integer-float']:
if isna(data).any():
raise ValueError('cannot convert float '
'NaN to integer')

if inferred == "mixed-integer-float":
maybe_cast_to_integer_array(data, dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need data = here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Also added mini-section to whatsnew.

@gfyoung gfyoung force-pushed the int-dtype-coercions branch from 390d914 to 2be269b Compare June 18, 2018 23:41
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run some constructor asv's to see if any effect.

#
# We didn't do this earlier because NumPy
# doesn't handle `uint64` correctly.
arr = np.asarray(arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this? at this point its either an ndarray, Series, Index, which is all ok here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. list can also propagate here. This was prompted by #21432, which introduced an annoying but necessary corner case with uint64.

@@ -4068,6 +4069,9 @@ def _try_cast(arr, take_fast_path):
return arr

try:
if is_float_dtype(dtype) or is_integer_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

@gfyoung gfyoung force-pushed the int-dtype-coercions branch from 2be269b to b3d0b4e Compare June 19, 2018 00:30
@gfyoung
Copy link
Member Author

gfyoung commented Jun 19, 2018

@jreback : Tests are green again. PTAL.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment. perf check?

@@ -71,6 +71,32 @@ Datetimelike API Changes
Other API Changes
^^^^^^^^^^^^^^^^^

.. _whatsnew_0240.api.other.incompatibilities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a colon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixed.

* Related to the Index and Series constructors.

Closes pandas-devgh-15832.

* Add integer dtype fixtures to conftest.py

Can used for subsequent refactoring.
@gfyoung gfyoung force-pushed the int-dtype-coercions branch from b3d0b4e to 27caec9 Compare June 19, 2018 17:01
@gfyoung
Copy link
Member Author

gfyoung commented Jun 19, 2018

perf check?

@jreback : Oh sorry, forgot to respond on that. Perf looks good!

@gfyoung
Copy link
Member Author

gfyoung commented Jun 20, 2018

@jreback : Comments addressed, perf looks good, and CI is green. PTAL.

@jreback jreback merged commit b36b451 into pandas-dev:master Jun 20, 2018
@jreback
Copy link
Contributor

jreback commented Jun 20, 2018

thanks @gfyoung nice patch!


if is_unsigned_integer_dtype(dtype) and (arr < 0).any():
raise OverflowError("Trying to coerce negative values "
"to unsigned integers")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung should maybe_cast_to_integer_array also check for overflows like

arr = np.array([1, 200, 923442])
dtype = np.dtype(np.int8)

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants