BUG: Fix pd.NA `na_rep` truncated in to_csv #30146

jbman223 · 2019-12-08T21:30:33Z

closes pd.NA na_rep truncated in to_csv #29975
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

brajiang

Very clean fix. The comments are helpful in showing what exactly is being tested and demonstrating that the issue has been resolved.

doc/source/whatsnew/v1.0.0.rst

pandas/core/internals/blocks.py

pep8speaks · 2019-12-09T13:23:30Z

Hello @jbman223! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-01 17:41:44 UTC

doc/source/whatsnew/v1.0.0.rst

jreback · 2019-12-10T12:57:03Z

pandas/core/internals/blocks.py

-            values = values.astype("<U{size}".format(size=itemsize))
+            try:
+                itemsize = writers.word_len(na_rep)
+                values = values.astype("<U{size}".format(size=itemsize))


huh? what raises here

This one was puzzling to me.

list/test_list.py::test_to_csv FAILED [100%] pandas/tests/extension/list/test_list.py:23 (test_to_csv) self = ExtensionBlock: slice(0, 1, 1), 1 x 100, dtype: list slicer = slice(0, 100, None), na_rep = '', quoting = 0 kwargs = {'date_format': None, 'decimal': '.', 'float_format': None} values = <ListArray> [ ['Z', 'r', 'y', 'P', 'D', 'p'], ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'], ... ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'], ['l']] Length: 100, dtype: list mask = array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False,...False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]) def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs): """override to use ExtensionArray astype for the conversion""" values = self.values if slicer is not None: values = values[slicer] mask = isna(values) try: > values[mask] = na_rep ../../core/internals/blocks.py:1782: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <ListArray> [ ['Z', 'r', 'y', 'P', 'D', 'p'], ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'], ... ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'], ['l']] Length: 100, dtype: list key = array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False,...False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]) value = '' def __setitem__(self, key: Union[int, np.ndarray], value: Any) -> None: """ Set one or more values inplace. This method is not required to satisfy the pandas extension array interface. Parameters ---------- key : int, ndarray, or slice When called from, e.g. ``Series.__setitem__``, ``key`` will be one of * scalar int * ndarray of integers. * boolean ndarray * slice object value : ExtensionDtype.type, Sequence[ExtensionDtype.type], or object value or values to be set of ``key``. Returns ------- None """ # Some notes to the ExtensionArray implementor who may have ended up # here. While this method is not required for the interface, if you # *do* choose to implement __setitem__, then some semantics should be # observed: # # * Setting multiple values : ExtensionArrays should support setting # multiple values at once, 'key' will be a sequence of integers and # 'value' will be a same-length sequence. # # * Broadcasting : For a sequence 'key' and a scalar 'value', # each position in 'key' should be set to 'value'. # # * Coercion : Most users will expect basic coercion to work. For # example, a string like '2018-01-01' is coerced to a datetime # when setting on a datetime64ns array. In general, if the # __init__ method coerces that value, then so should __setitem__ # Note, also, that Series/DataFrame.where internally use __setitem__ # on a copy of the data. raise NotImplementedError( > _not_implemented_message.format(type(self), "__setitem__") ) E NotImplementedError: <class 'pandas.tests.extension.list.array.ListArray'> does not implement __setitem__. ../../core/arrays/base.py:334: NotImplementedError During handling of the above exception, another exception occurred: data = <ListArray> [ ['Z', 'r', 'y', 'P', 'D', 'p'], ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'], ... ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'], ['l']] Length: 100, dtype: list def test_to_csv(data): # https://github.com/pandas-dev/pandas/issues/28840 # array with list-likes fail when doing astype(str) on the numpy array # which was done in to_native_types df = pd.DataFrame({"a": data}) > res = df.to_csv() list/test_list.py:29: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../core/generic.py:3242: in to_csv formatter.save() ../../io/formats/csvs.py:204: in save self._save() ../../io/formats/csvs.py:328: in _save self._save_chunk(start_i, end_i) ../../io/formats/csvs.py:344: in _save_chunk quoting=self.quoting, ../../core/internals/blocks.py:1785: in to_native_types return super().to_native_types(slicer, na_rep, quoting, **kwargs) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = ExtensionBlock: slice(0, 1, 1), 1 x 100, dtype: list slicer = slice(0, 100, None), na_rep = '', quoting = 0 kwargs = {'date_format': None, 'decimal': '.', 'float_format': None} values = array([[list(['Z', 'r', 'y', 'P', 'D', 'p']), list(['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b']), ...c']), list(['L', 'l', 'M', 'c']), list(['Q', 'p', 'D', 'L', 'z', 'Z', 'i']), list(['l'])]], dtype=object) mask = array([[False, False, False, False, False, False, False, False, False, False, False, False, False, False, Fals...se, False, False, False, False, False, False, False, False, False, False, False, False, False, False]]) itemsize = 0 def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs): """ convert to our native types format, slicing if desired """ values = self.get_values() if slicer is not None: values = values[:, slicer] mask = isna(values) if not self.is_object and not quoting: # try: # itemsize = writers.word_len(na_rep) # values = values.astype("<U{size}".format(size=itemsize)) # except Exception: # values = np.array(values, dtype="object") itemsize = writers.word_len(na_rep) > values = values.astype("<U{size}".format(size=itemsize)) E ValueError: setting an array element with a sequence ../../core/internals/blocks.py:666: ValueError Assertion failed Assertion failed Assertion failed Assertion failed

It seems that when you don't format the array to a string first, the masking causes exceptional behavior, which then fails when casting the array to "<U{size}", but honestly I'm not totally sure.

You can likely skip that test for ListArray. That would be preferable to changing this.

So @TomAugspurger, just to clarify, you think it would be better for me to revert the code change in pandas/core/internals/blocks.py and remove the failing test?

TomAugspurger · 2019-12-10T20:06:42Z

Revert the change in internals, and skip or xfail the test. There are probably other examples in that file for how to do it.

…

On Tue, Dec 10, 2019 at 1:59 PM Jacob Buckheit ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/core/internals/blocks.py <#30146 (comment)>: > @@ -657,8 +657,11 @@ def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs): mask = isna(values) if not self.is_object and not quoting: - itemsize = writers.word_len(na_rep) - values = values.astype("<U{size}".format(size=itemsize)) + try: + itemsize = writers.word_len(na_rep) + values = values.astype("<U{size}".format(size=itemsize)) So @TomAugspurger <https://github.com/TomAugspurger>, just to clarify, you think it would be better for me to revert the code change in pandas/core/internals/blocks.py and remove the failing test? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30146?email_source=notifications&email_token=AAKAOITDH4PFBC4EWYUK7QDQX7YK3A5CNFSM4JYAUKPKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCOWQQDQ#discussion_r356248230>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOITAFEAHCRXM5PSCJTLQX7YK3ANCNFSM4JYAUKPA> .

jbman223 · 2019-12-10T20:10:53Z

Revert the change in internals, and skip or xfail the test. There are probably other examples in that file for how to do it.
…
On Tue, Dec 10, 2019 at 1:59 PM Jacob Buckheit @.> wrote: @.* commented on this pull request. ------------------------------ In pandas/core/internals/blocks.py <#30146 (comment)>: > @@ -657,8 +657,11 @@ def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs): mask = isna(values) if not self.is_object and not quoting: - itemsize = writers.word_len(na_rep) - values = values.astype("<U{size}".format(size=itemsize)) + try: + itemsize = writers.word_len(na_rep) + values = values.astype("<U{size}".format(size=itemsize)) So @TomAugspurger https://github.com/TomAugspurger, just to clarify, you think it would be better for me to revert the code change in pandas/core/internals/blocks.py and remove the failing test? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30146?email_source=notifications&email_token=AAKAOITDH4PFBC4EWYUK7QDQX7YK3A5CNFSM4JYAUKPKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCOWQQDQ#discussion_r356248230>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITAFEAHCRXM5PSCJTLQX7YK3ANCNFSM4JYAUKPA .

what would you recommend as a failure reason for the skip or xfails?

jreback · 2019-12-11T13:12:16Z

pandas/tests/extension/list/test_list.py

@@ -21,6 +21,7 @@ def data():
    return ListArray(data)


+@pytest.mark.skip(reason="fails with update to na_rep")


why are you skipping? if you don't want to address this now, then need to xfail this and create an issue (otherwise this will likely never be seen again).

what exactly is failing here?

@jbman223 can you respond here?

Hey, I think the test should probably xfail. To be honest, I'm not totally sure what is failing. To me, it looks like you have to cast the array to a string array first to avoid the test failing with an error, but when you cast first, you see the strange behavior of the NA_REP being truncated.

Can you post the failure when this isn't xfailed?

I don't immediately see a reason why it wouldn't work.

@TomAugspurger Here is the failure:

list/test_list.py::test_to_csv FAILED [100%] pandas/tests/extension/list/test_list.py:23 (test_to_csv) self = ExtensionBlock: slice(0, 1, 1), 1 x 100, dtype: list slicer = slice(0, 100, None), na_rep = '', quoting = 0 kwargs = {'date_format': None, 'decimal': '.', 'float_format': None} values = <ListArray> [ ['Z', 'r', 'y', 'P', 'D', 'p'], ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'], ... ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'], ['l']] Length: 100, dtype: list mask = array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False,...False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]) def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs): """override to use ExtensionArray astype for the conversion""" values = self.values if slicer is not None: values = values[slicer] mask = isna(values) try: > values[mask] = na_rep ../../core/internals/blocks.py:1782: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <ListArray> [ ['Z', 'r', 'y', 'P', 'D', 'p'], ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'], ... ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'], ['l']] Length: 100, dtype: list key = array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False,...False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]) value = '' def __setitem__(self, key: Union[int, np.ndarray], value: Any) -> None: """ Set one or more values inplace. This method is not required to satisfy the pandas extension array interface. Parameters ---------- key : int, ndarray, or slice When called from, e.g. ``Series.__setitem__``, ``key`` will be one of * scalar int * ndarray of integers. * boolean ndarray * slice object value : ExtensionDtype.type, Sequence[ExtensionDtype.type], or object value or values to be set of ``key``. Returns ------- None """ # Some notes to the ExtensionArray implementor who may have ended up # here. While this method is not required for the interface, if you # *do* choose to implement __setitem__, then some semantics should be # observed: # # * Setting multiple values : ExtensionArrays should support setting # multiple values at once, 'key' will be a sequence of integers and # 'value' will be a same-length sequence. # # * Broadcasting : For a sequence 'key' and a scalar 'value', # each position in 'key' should be set to 'value'. # # * Coercion : Most users will expect basic coercion to work. For # example, a string like '2018-01-01' is coerced to a datetime # when setting on a datetime64ns array. In general, if the # __init__ method coerces that value, then so should __setitem__ # Note, also, that Series/DataFrame.where internally use __setitem__ # on a copy of the data. raise NotImplementedError( > _not_implemented_message.format(type(self), "__setitem__") ) E NotImplementedError: <class 'pandas.tests.extension.list.array.ListArray'> does not implement __setitem__. ../../core/arrays/base.py:334: NotImplementedError During handling of the above exception, another exception occurred: data = <ListArray> [ ['Z', 'r', 'y', 'P', 'D', 'p'], ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'], ... ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'], ['l']] Length: 100, dtype: list def test_to_csv(data): # https://github.com/pandas-dev/pandas/issues/28840 # array with list-likes fail when doing astype(str) on the numpy array # which was done in to_native_types df = pd.DataFrame({"a": data}) > res = df.to_csv() list/test_list.py:29: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../core/generic.py:3242: in to_csv formatter.save() ../../io/formats/csvs.py:204: in save self._save() ../../io/formats/csvs.py:328: in _save self._save_chunk(start_i, end_i) ../../io/formats/csvs.py:344: in _save_chunk quoting=self.quoting, ../../core/internals/blocks.py:1785: in to_native_types return super().to_native_types(slicer, na_rep, quoting, **kwargs) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = ExtensionBlock: slice(0, 1, 1), 1 x 100, dtype: list slicer = slice(0, 100, None), na_rep = '', quoting = 0 kwargs = {'date_format': None, 'decimal': '.', 'float_format': None} values = array([[list(['Z', 'r', 'y', 'P', 'D', 'p']), list(['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b']), ...c']), list(['L', 'l', 'M', 'c']), list(['Q', 'p', 'D', 'L', 'z', 'Z', 'i']), list(['l'])]], dtype=object) mask = array([[False, False, False, False, False, False, False, False, False, False, False, False, False, False, Fals...se, False, False, False, False, False, False, False, False, False, False, False, False, False, False]]) itemsize = 0 def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs): """ convert to our native types format, slicing if desired """ values = self.get_values() if slicer is not None: values = values[:, slicer] mask = isna(values) if not self.is_object and not quoting: # try: # itemsize = writers.word_len(na_rep) # values = values.astype("<U{size}".format(size=itemsize)) # except Exception: # values = np.array(values, dtype="object") itemsize = writers.word_len(na_rep) > values = values.astype("<U{size}".format(size=itemsize)) E ValueError: setting an array element with a sequence ../../core/internals/blocks.py:666: ValueError Assertion failed Assertion failed Assertion failed Assertion failed

Thanks.

So ListArray doesn't implement __setitem__, which means that values[mask] = na_rep can't possibly work.

This seems to indicate that we should convert the extension array to a type that can be set before doing the setitem. Does that make sense?

I think that makes sense, which is how the code was originally. I thought this line (pandas/core/internals/blocks.py, 1777):

values = values.astype(str)

Would be the line that converts it to an array type that implements the __setitem__ method. However, when this line works, the NA_REP is truncated

jreback · 2019-12-27T19:57:25Z

can you merge master and will look again

jreback · 2020-01-01T18:18:13Z

thanks @jbman223

brajiang approved these changes Dec 8, 2019

View reviewed changes

jreback requested changes Dec 8, 2019

View reviewed changes

doc/source/whatsnew/v1.0.0.rst Show resolved Hide resolved

pandas/core/internals/blocks.py Outdated Show resolved Hide resolved

jschendel added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string labels Dec 9, 2019

jschendel added this to the 1.0 milestone Dec 9, 2019

jbman223 changed the title ~~Fix pd.NA na_rep truncated in to_csv~~ BUG: Fix pd.NA na_rep truncated in to_csv Dec 9, 2019

jbman223 requested a review from jreback December 9, 2019 20:57

jreback requested changes Dec 10, 2019

View reviewed changes

jreback requested changes Dec 11, 2019

View reviewed changes

jreback removed this from the 1.0 milestone Dec 11, 2019

jorisvandenbossche added this to the 1.0 milestone Dec 11, 2019

jbman223 and others added 8 commits January 1, 2020 12:37

Fix issue with to_csv na_rep when dtype=string

e75049f

Add what's new entry

397f361

Fix failing test

1a49b7e

Fix tests to respect system line endings

a5a96c2

Move issue to line 59 in whats new docs

1b9a5b5

Revert changes to internals. Add skip to failing test

b96e12b

fixup

8c31e7f

remove xfail

3310fa5

jreback force-pushed the fix_na_rep_truncated branch from d22b1d2 to 3310fa5 Compare January 1, 2020 17:41

jreback approved these changes Jan 1, 2020

View reviewed changes

jreback merged commit f146632 into pandas-dev:master Jan 1, 2020

hweecat pushed a commit to hweecat/pandas that referenced this pull request Jan 1, 2020

BUG: Fix pd.NA na_rep truncated in to_csv (pandas-dev#30146)

0d13ccd

mdruiter mentioned this pull request Jan 30, 2020

REGR: to_csv(na_rep) truncates values #31447

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix pd.NA `na_rep` truncated in to_csv #30146

BUG: Fix pd.NA `na_rep` truncated in to_csv #30146

jbman223 commented Dec 8, 2019 •

edited

Loading

brajiang left a comment

pep8speaks commented Dec 9, 2019 •

edited

Loading

jreback Dec 10, 2019

jbman223 Dec 10, 2019

TomAugspurger Dec 10, 2019 •

edited

Loading

jbman223 Dec 10, 2019

TomAugspurger commented Dec 10, 2019 via email

jbman223 commented Dec 10, 2019

jreback Dec 11, 2019

TomAugspurger Dec 20, 2019

jbman223 Dec 20, 2019

TomAugspurger Dec 20, 2019

jbman223 Dec 20, 2019

TomAugspurger Dec 20, 2019

jbman223 Dec 20, 2019

jreback commented Dec 27, 2019

jreback commented Jan 1, 2020

		@@ -21,6 +21,7 @@ def data():
		return ListArray(data)


		@pytest.mark.skip(reason="fails with update to na_rep")

BUG: Fix pd.NA na_rep truncated in to_csv #30146

BUG: Fix pd.NA na_rep truncated in to_csv #30146

Conversation

jbman223 commented Dec 8, 2019 • edited Loading

brajiang left a comment

Choose a reason for hiding this comment

pep8speaks commented Dec 9, 2019 • edited Loading

Comment last updated at 2020-01-01 17:41:44 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Dec 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Dec 10, 2019 via email

jbman223 commented Dec 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 27, 2019

jreback commented Jan 1, 2020

BUG: Fix pd.NA `na_rep` truncated in to_csv #30146

BUG: Fix pd.NA `na_rep` truncated in to_csv #30146

jbman223 commented Dec 8, 2019 •

edited

Loading

pep8speaks commented Dec 9, 2019 •

edited

Loading

TomAugspurger Dec 10, 2019 •

edited

Loading