Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix pd.NA na_rep truncated in to_csv #30146

Merged
merged 8 commits into from
Jan 1, 2020

Conversation

jbman223
Copy link
Contributor

@jbman223 jbman223 commented Dec 8, 2019

Copy link

@brajiang brajiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean fix. The comments are helpful in showing what exactly is being tested and demonstrating that the issue has been resolved.

doc/source/whatsnew/v1.0.0.rst Show resolved Hide resolved
pandas/core/internals/blocks.py Outdated Show resolved Hide resolved
@jschendel jschendel added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string labels Dec 9, 2019
@jschendel jschendel added this to the 1.0 milestone Dec 9, 2019
@pep8speaks
Copy link

pep8speaks commented Dec 9, 2019

Hello @jbman223! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-01 17:41:44 UTC

@jbman223 jbman223 changed the title Fix pd.NA na_rep truncated in to_csv BUG: Fix pd.NA na_rep truncated in to_csv Dec 9, 2019
@jbman223 jbman223 requested a review from jreback December 9, 2019 20:57
doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
values = values.astype("<U{size}".format(size=itemsize))
try:
itemsize = writers.word_len(na_rep)
values = values.astype("<U{size}".format(size=itemsize))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? what raises here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was puzzling to me.

list/test_list.py::test_to_csv FAILED                                    [100%]
pandas/tests/extension/list/test_list.py:23 (test_to_csv)
self = ExtensionBlock: slice(0, 1, 1), 1 x 100, dtype: list
slicer = slice(0, 100, None), na_rep = '', quoting = 0
kwargs = {'date_format': None, 'decimal': '.', 'float_format': None}
values = <ListArray>
[                    ['Z', 'r', 'y', 'P', 'D', 'p'],
 ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'],
...     ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'],
                                              ['l']]
Length: 100, dtype: list
mask = array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,...False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

    def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs):
        """override to use ExtensionArray astype for the conversion"""
        values = self.values
        if slicer is not None:
            values = values[slicer]
        mask = isna(values)
    
        try:
>           values[mask] = na_rep

../../core/internals/blocks.py:1782: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ListArray>
[                    ['Z', 'r', 'y', 'P', 'D', 'p'],
 ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'],
...     ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'],
                                              ['l']]
Length: 100, dtype: list
key = array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,...False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])
value = ''

    def __setitem__(self, key: Union[int, np.ndarray], value: Any) -> None:
        """
        Set one or more values inplace.
    
        This method is not required to satisfy the pandas extension array
        interface.
    
        Parameters
        ----------
        key : int, ndarray, or slice
            When called from, e.g. ``Series.__setitem__``, ``key`` will be
            one of
    
            * scalar int
            * ndarray of integers.
            * boolean ndarray
            * slice object
    
        value : ExtensionDtype.type, Sequence[ExtensionDtype.type], or object
            value or values to be set of ``key``.
    
        Returns
        -------
        None
        """
        # Some notes to the ExtensionArray implementor who may have ended up
        # here. While this method is not required for the interface, if you
        # *do* choose to implement __setitem__, then some semantics should be
        # observed:
        #
        # * Setting multiple values : ExtensionArrays should support setting
        #   multiple values at once, 'key' will be a sequence of integers and
        #  'value' will be a same-length sequence.
        #
        # * Broadcasting : For a sequence 'key' and a scalar 'value',
        #   each position in 'key' should be set to 'value'.
        #
        # * Coercion : Most users will expect basic coercion to work. For
        #   example, a string like '2018-01-01' is coerced to a datetime
        #   when setting on a datetime64ns array. In general, if the
        #   __init__ method coerces that value, then so should __setitem__
        # Note, also, that Series/DataFrame.where internally use __setitem__
        # on a copy of the data.
        raise NotImplementedError(
>           _not_implemented_message.format(type(self), "__setitem__")
        )
E       NotImplementedError: <class 'pandas.tests.extension.list.array.ListArray'> does not implement __setitem__.

../../core/arrays/base.py:334: NotImplementedError

During handling of the above exception, another exception occurred:

data = <ListArray>
[                    ['Z', 'r', 'y', 'P', 'D', 'p'],
 ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'],
...     ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'],
                                              ['l']]
Length: 100, dtype: list

    def test_to_csv(data):
        # https://github.com/pandas-dev/pandas/issues/28840
        # array with list-likes fail when doing astype(str) on the numpy array
        # which was done in to_native_types
        df = pd.DataFrame({"a": data})
>       res = df.to_csv()

list/test_list.py:29: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../core/generic.py:3242: in to_csv
    formatter.save()
../../io/formats/csvs.py:204: in save
    self._save()
../../io/formats/csvs.py:328: in _save
    self._save_chunk(start_i, end_i)
../../io/formats/csvs.py:344: in _save_chunk
    quoting=self.quoting,
../../core/internals/blocks.py:1785: in to_native_types
    return super().to_native_types(slicer, na_rep, quoting, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = ExtensionBlock: slice(0, 1, 1), 1 x 100, dtype: list
slicer = slice(0, 100, None), na_rep = '', quoting = 0
kwargs = {'date_format': None, 'decimal': '.', 'float_format': None}
values = array([[list(['Z', 'r', 'y', 'P', 'D', 'p']),
        list(['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b']),
      ...c']), list(['L', 'l', 'M', 'c']),
        list(['Q', 'p', 'D', 'L', 'z', 'Z', 'i']), list(['l'])]],
      dtype=object)
mask = array([[False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, Fals...se, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False]])
itemsize = 0

    def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs):
        """ convert to our native types format, slicing if desired """
        values = self.get_values()
    
        if slicer is not None:
            values = values[:, slicer]
        mask = isna(values)
    
        if not self.is_object and not quoting:
            # try:
            #     itemsize = writers.word_len(na_rep)
            #     values = values.astype("<U{size}".format(size=itemsize))
            # except Exception:
            #     values = np.array(values, dtype="object")
            itemsize = writers.word_len(na_rep)
>           values = values.astype("<U{size}".format(size=itemsize))
E           ValueError: setting an array element with a sequence

../../core/internals/blocks.py:666: ValueError

Assertion failed

Assertion failed

Assertion failed

Assertion failed

It seems that when you don't format the array to a string first, the masking causes exceptional behavior, which then fails when casting the array to "<U{size}", but honestly I'm not totally sure.

Copy link
Contributor

@TomAugspurger TomAugspurger Dec 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can likely skip that test for ListArray. That would be preferable to changing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @TomAugspurger, just to clarify, you think it would be better for me to revert the code change in pandas/core/internals/blocks.py and remove the failing test?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 10, 2019 via email

@jbman223
Copy link
Contributor Author

Revert the change in internals, and skip or xfail the test. There are probably other examples in that file for how to do it.

On Tue, Dec 10, 2019 at 1:59 PM Jacob Buckheit @.> wrote: @.* commented on this pull request. ------------------------------ In pandas/core/internals/blocks.py <#30146 (comment)>: > @@ -657,8 +657,11 @@ def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs): mask = isna(values) if not self.is_object and not quoting: - itemsize = writers.word_len(na_rep) - values = values.astype("<U{size}".format(size=itemsize)) + try: + itemsize = writers.word_len(na_rep) + values = values.astype("<U{size}".format(size=itemsize)) So @TomAugspurger https://github.com/TomAugspurger, just to clarify, you think it would be better for me to revert the code change in pandas/core/internals/blocks.py and remove the failing test? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30146?email_source=notifications&email_token=AAKAOITDH4PFBC4EWYUK7QDQX7YK3A5CNFSM4JYAUKPKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCOWQQDQ#discussion_r356248230>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITAFEAHCRXM5PSCJTLQX7YK3ANCNFSM4JYAUKPA .

what would you recommend as a failure reason for the skip or xfails?

@@ -21,6 +21,7 @@ def data():
return ListArray(data)


@pytest.mark.skip(reason="fails with update to na_rep")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you skipping? if you don't want to address this now, then need to xfail this and create an issue (otherwise this will likely never be seen again).

what exactly is failing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbman223 can you respond here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I think the test should probably xfail. To be honest, I'm not totally sure what is failing. To me, it looks like you have to cast the array to a string array first to avoid the test failing with an error, but when you cast first, you see the strange behavior of the NA_REP being truncated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post the failure when this isn't xfailed?

I don't immediately see a reason why it wouldn't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger Here is the failure:

list/test_list.py::test_to_csv FAILED                                    [100%]
pandas/tests/extension/list/test_list.py:23 (test_to_csv)
self = ExtensionBlock: slice(0, 1, 1), 1 x 100, dtype: list
slicer = slice(0, 100, None), na_rep = '', quoting = 0
kwargs = {'date_format': None, 'decimal': '.', 'float_format': None}
values = <ListArray>
[                    ['Z', 'r', 'y', 'P', 'D', 'p'],
 ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'],
...     ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'],
                                              ['l']]
Length: 100, dtype: list
mask = array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,...False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

    def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs):
        """override to use ExtensionArray astype for the conversion"""
        values = self.values
        if slicer is not None:
            values = values[slicer]
        mask = isna(values)
    
        try:
>           values[mask] = na_rep

../../core/internals/blocks.py:1782: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ListArray>
[                    ['Z', 'r', 'y', 'P', 'D', 'p'],
 ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'],
...     ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'],
                                              ['l']]
Length: 100, dtype: list
key = array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,...False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])
value = ''

    def __setitem__(self, key: Union[int, np.ndarray], value: Any) -> None:
        """
        Set one or more values inplace.
    
        This method is not required to satisfy the pandas extension array
        interface.
    
        Parameters
        ----------
        key : int, ndarray, or slice
            When called from, e.g. ``Series.__setitem__``, ``key`` will be
            one of
    
            * scalar int
            * ndarray of integers.
            * boolean ndarray
            * slice object
    
        value : ExtensionDtype.type, Sequence[ExtensionDtype.type], or object
            value or values to be set of ``key``.
    
        Returns
        -------
        None
        """
        # Some notes to the ExtensionArray implementor who may have ended up
        # here. While this method is not required for the interface, if you
        # *do* choose to implement __setitem__, then some semantics should be
        # observed:
        #
        # * Setting multiple values : ExtensionArrays should support setting
        #   multiple values at once, 'key' will be a sequence of integers and
        #  'value' will be a same-length sequence.
        #
        # * Broadcasting : For a sequence 'key' and a scalar 'value',
        #   each position in 'key' should be set to 'value'.
        #
        # * Coercion : Most users will expect basic coercion to work. For
        #   example, a string like '2018-01-01' is coerced to a datetime
        #   when setting on a datetime64ns array. In general, if the
        #   __init__ method coerces that value, then so should __setitem__
        # Note, also, that Series/DataFrame.where internally use __setitem__
        # on a copy of the data.
        raise NotImplementedError(
>           _not_implemented_message.format(type(self), "__setitem__")
        )
E       NotImplementedError: <class 'pandas.tests.extension.list.array.ListArray'> does not implement __setitem__.

../../core/arrays/base.py:334: NotImplementedError

During handling of the above exception, another exception occurred:

data = <ListArray>
[                    ['Z', 'r', 'y', 'P', 'D', 'p'],
 ['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b'],
...     ['Q', 'p', 'D', 'L', 'z', 'Z', 'i'],
                                              ['l']]
Length: 100, dtype: list

    def test_to_csv(data):
        # https://github.com/pandas-dev/pandas/issues/28840
        # array with list-likes fail when doing astype(str) on the numpy array
        # which was done in to_native_types
        df = pd.DataFrame({"a": data})
>       res = df.to_csv()

list/test_list.py:29: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../core/generic.py:3242: in to_csv
    formatter.save()
../../io/formats/csvs.py:204: in save
    self._save()
../../io/formats/csvs.py:328: in _save
    self._save_chunk(start_i, end_i)
../../io/formats/csvs.py:344: in _save_chunk
    quoting=self.quoting,
../../core/internals/blocks.py:1785: in to_native_types
    return super().to_native_types(slicer, na_rep, quoting, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = ExtensionBlock: slice(0, 1, 1), 1 x 100, dtype: list
slicer = slice(0, 100, None), na_rep = '', quoting = 0
kwargs = {'date_format': None, 'decimal': '.', 'float_format': None}
values = array([[list(['Z', 'r', 'y', 'P', 'D', 'p']),
        list(['a', 'Q', 'd', 'J', 'X', 'l', 'd', 's', 'q', 'b']),
      ...c']), list(['L', 'l', 'M', 'c']),
        list(['Q', 'p', 'D', 'L', 'z', 'Z', 'i']), list(['l'])]],
      dtype=object)
mask = array([[False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, Fals...se, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False]])
itemsize = 0

    def to_native_types(self, slicer=None, na_rep="nan", quoting=None, **kwargs):
        """ convert to our native types format, slicing if desired """
        values = self.get_values()
    
        if slicer is not None:
            values = values[:, slicer]
        mask = isna(values)
    
        if not self.is_object and not quoting:
            # try:
            #     itemsize = writers.word_len(na_rep)
            #     values = values.astype("<U{size}".format(size=itemsize))
            # except Exception:
            #     values = np.array(values, dtype="object")
            itemsize = writers.word_len(na_rep)
>           values = values.astype("<U{size}".format(size=itemsize))
E           ValueError: setting an array element with a sequence

../../core/internals/blocks.py:666: ValueError

Assertion failed

Assertion failed

Assertion failed

Assertion failed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

So ListArray doesn't implement __setitem__, which means that values[mask] = na_rep can't possibly work.

This seems to indicate that we should convert the extension array to a type that can be set before doing the setitem. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, which is how the code was originally. I thought this line (pandas/core/internals/blocks.py, 1777):

         values = values.astype(str) 

Would be the line that converts it to an array type that implements the __setitem__ method. However, when this line works, the NA_REP is truncated

@jreback jreback removed this from the 1.0 milestone Dec 11, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Dec 11, 2019
@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

can you merge master and will look again

@jreback jreback force-pushed the fix_na_rep_truncated branch from d22b1d2 to 3310fa5 Compare January 1, 2020 17:41
@jreback jreback merged commit f146632 into pandas-dev:master Jan 1, 2020
@jreback
Copy link
Contributor

jreback commented Jan 1, 2020

thanks @jbman223

hweecat pushed a commit to hweecat/pandas that referenced this pull request Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.NA na_rep truncated in to_csv
7 participants