Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel with dtype=str converts empty cells to np.nan #20429

Closed

Conversation

nikoskaragiannakis
Copy link
Contributor

Checklist for other PRs (remove this part if you are doing a PR for the pandas documentation sprint):

@nikoskaragiannakis
Copy link
Contributor Author

No idea what's going on with this conflict. Any help?

@@ -981,6 +981,8 @@ I/O
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
- Bug in :`read_excel` where it transforms np.nan to 'nan' if dtype=str is chosen. Now keeps np.nan as they are. (:issue:`20377`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be :func:`read_excel`

@cbertinato
Copy link
Contributor

Try rebasing and then resolving the conflict?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung do we handle this correctly in read_csv itself?

@@ -981,6 +981,8 @@ I/O
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are including some other changes here, pls rebase on master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not mine. I deleted it by mistake and added it back.
You can check master here https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v0.23.0.txt#L985
However, even after rebasing, I keep getting this conflict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rebased off master and resolved the conflicts in the rebase then it should be ok. Did you fetch the current master before rebasing?

Copy link
Contributor Author

@nikoskaragiannakis nikoskaragiannakis Mar 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did now

@@ -981,6 +981,8 @@ I/O
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
- Bug in :`read_excel` where it transforms np.nan to 'nan' if dtype=str is chosen. Now keeps np.nan as they are. (:issue:`20377`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double back-ticks around dtype=str and around np.nan

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions IO Excel read_excel, to_excel labels Mar 22, 2018
@gfyoung
Copy link
Member

gfyoung commented Mar 23, 2018

@gfyoung do we handle this correctly in read_csv itself?

@jreback : unfortunately, no. Too bad I didn't see this PR earlier. I would have actually suggested to fix it in read_csv and then check if the bug still persists in read_excel (since they share parsing engines).

dtypes = output[asheetname].dtypes
output[asheetname].replace('nan', np.nan, inplace=True)
output[asheetname] = output[asheetname].astype(dtypes,
copy=False)
Copy link
Member

@gfyoung gfyoung Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about this patch being a performance hit against read_excel. The Python parser (in io/parsers.py) processes each of the Excel elements before placing it into a DataFrame. I would look there for the fix, since as I mentioned below, this bug impacts other read_* functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result from read_csv is not the same as the one from read_excel with dtype=str. In the former case, empties are read in as np.nan, whereas in the latter they are read in as the string 'nan'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, though that doesn't change my opinion. The problematic part still likely stems in the engine parsing, which would also effect read_csv (the more popular of the two IMO). Thus, if we can kill two birds with one stone, that would be even better.

@gfyoung gfyoung changed the title read_excel with dtype=str converst empty cells to np.nan read_excel with dtype=str converts empty cells to np.nan Mar 23, 2018
@arnau126
Copy link

arnau126 commented Mar 23, 2018

read_csv only works correctly (empty cells turn to np.nan) if it uses its default engine CParserWrapper.

In read_excel empty cells turn to the string 'nan', because it uses PythonParser. This parser has a function called self._cast_types, which uses astype_nansafe, which uses astype_unicode and astype_str (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L460).

The later functions are only used by PythonParser, so I think that we should fix them by using checknull as astype_intsafe does (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L437).

@jreback
Copy link
Contributor

jreback commented Mar 23, 2018

this is my point - the fix needs to be in read_csv not here

@arnau126
Copy link

I think it's not a matter of read_csv or read_excel.
The fix needs to be in PythonParser which can be used in both read_csv and read_excel.

@cbertinato
Copy link
Contributor

@arnau126 and @gfyoung have hit the nail on the head. The issue is at least in PythonParser, if not deeper. Using the python parser with read_csv produces the same result as @jreback has pointed out. Question is: which is the expected behavior? np.nan or 'nan'?

@gfyoung
Copy link
Member

gfyoung commented Mar 23, 2018

I would go with empty string as per the issue.

@nikoskaragiannakis
Copy link
Contributor Author

In the ticket I mentioned this #20377 (comment).
@cbertinato gave an answer there but, since we're discussing it, I'd like some more opinions.

@cbertinato
Copy link
Contributor

I would go with empty string as per the issue.

Should the C parser also be brought in line?

@nikoskaragiannakis: @gfyoung suggests an empty string. That sounds like a good idea as it is consistent with the originating DataFrame.

@gfyoung
Copy link
Member

gfyoung commented Mar 24, 2018

Should the C parser also be brought in line?

Absolutely! Both parsers should be patched.

@cbertinato
Copy link
Contributor

Sorry to have added confusion. I misinterpreted an earlier comment. I believe @jreback’s suggested changes are the best way to go.

@nikoskaragiannakis
Copy link
Contributor Author

So, what needs to be done here is to make sure that an existing np.nan is not converted to a string 'nan', right? Does this mean that it should remain as a np.nan?

@pep8speaks
Copy link

pep8speaks commented Apr 2, 2018

Hello @nikoskaragiannakis! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on April 08, 2018 at 11:56 Hours UTC

@nikoskaragiannakis
Copy link
Contributor Author

I made the changes so now both read_csv and read_excel

  • turn empty values to np.nan when dtype=str and na_filter=True
  • turn empty values to empty string when dtype=str and na_filter=False

I also ran the performance tests for csv, excel, and series:

io.csv
======
[  0.00%] · For pandas commit hash 7d5f6b20:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  3.33%] ··· Running io.csv.ReadCSVCategorical.time_convert_direct                                                                                                                                      44.4±0.2ms
[  6.67%] ··· Running io.csv.ReadCSVCategorical.time_convert_post                                                                                                                                        70.3±0.1ms
[ 10.00%] ··· Running io.csv.ReadCSVComment.time_comment                                                                                                                                                2.01±0.01ms
[ 13.33%] ··· Running io.csv.ReadCSVDInferDatetimeFormat.time_read_csv                                                                                                                              1.15±0.01ms;...
[ 16.67%] ··· Running io.csv.ReadCSVFloatPrecision.time_read_csv                                                                                                                                    1.64±0.01ms;...
[ 20.00%] ··· Running io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine                                                                                                                      1.81±0.01ms;...
[ 23.33%] ··· Running io.csv.ReadCSVParseDates.time_baseline                                                                                                                                               3.15±0ms
[ 26.67%] ··· Running io.csv.ReadCSVParseDates.time_multiple_date                                                                                                                                          3.10±0ms
[ 30.00%] ··· Running io.csv.ReadCSVSkipRows.time_skipprows                                                                                                                                         21.5±0.04ms;...
[ 33.33%] ··· Running io.csv.ReadCSVThousands.time_thousands                                                                                                                                        18.0±0.02ms;...
[ 36.67%] ··· Running io.csv.ReadUint64Integers.time_read_uint64                                                                                                                                        1.13±0.01ms
[ 40.00%] ··· Running io.csv.ReadUint64Integers.time_read_uint64_na_values                                                                                                                              1.22±0.01ms
[ 43.33%] ··· Running io.csv.ReadUint64Integers.time_read_uint64_neg_values                                                                                                                                1.15±0ms
[ 46.67%] ··· Running io.csv.ToCSV.time_frame                                                                                                                                                         156±0.7ms;...
[ 50.00%] ··· Running io.csv.ToCSVDatetime.time_frame_date_formatting                                                                                                                                   11.6±0.05ms
[ 50.00%] · For pandas commit hash bd8a3cff:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..............................................................................
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 53.33%] ··· Running io.csv.ReadCSVCategorical.time_convert_direct                                                                                                                                      43.0±0.2ms
[ 56.67%] ··· Running io.csv.ReadCSVCategorical.time_convert_post                                                                                                                                        66.9±0.2ms
[ 60.00%] ··· Running io.csv.ReadCSVComment.time_comment                                                                                                                                                2.04±0.02ms
[ 63.33%] ··· Running io.csv.ReadCSVDInferDatetimeFormat.time_read_csv                                                                                                                              1.17±0.01ms;...
[ 66.67%] ··· Running io.csv.ReadCSVFloatPrecision.time_read_csv                                                                                                                                    1.69±0.01ms;...
[ 70.00%] ··· Running io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine                                                                                                                      1.84±0.01ms;...
[ 73.33%] ··· Running io.csv.ReadCSVParseDates.time_baseline                                                                                                                                            3.07±0.02ms
[ 76.67%] ··· Running io.csv.ReadCSVParseDates.time_multiple_date                                                                                                                                       3.10±0.02ms
[ 80.00%] ··· Running io.csv.ReadCSVSkipRows.time_skipprows                                                                                                                                         21.4±0.06ms;...
[ 83.33%] ··· Running io.csv.ReadCSVThousands.time_thousands                                                                                                                                        17.9±0.03ms;...
[ 86.67%] ··· Running io.csv.ReadUint64Integers.time_read_uint64                                                                                                                                        1.13±0.07ms
[ 90.00%] ··· Running io.csv.ReadUint64Integers.time_read_uint64_na_values                                                                                                                              1.25±0.02ms
[ 93.33%] ··· Running io.csv.ReadUint64Integers.time_read_uint64_neg_values                                                                                                                             1.17±0.01ms
[ 96.67%] ··· Running io.csv.ToCSV.time_frame                                                                                                                                                         156±0.3ms;...
[100.00%] ··· Running io.csv.ToCSVDatetime.time_frame_date_formatting                                                                                                                                   11.4±0.03ms
BENCHMARKS NOT SIGNIFICANTLY CHANGED.



io.excel
========
[  0.00%] · For pandas commit hash 7d5f6b20:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running io.excel.Excel.time_read_excel                                                                                                                                                  154±0.2ms;...
[ 50.00%] ··· Running io.excel.Excel.time_write_excel                                                                                                                                                     641ms;...
[ 50.00%] · For pandas commit hash bd8a3cff:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running io.excel.Excel.time_read_excel                                                                                                                                                  154±0.1ms;...
[100.00%] ··· Running io.excel.Excel.time_write_excel                                                                                                                                                     641ms;...
BENCHMARKS NOT SIGNIFICANTLY CHANGED.



series_methods
==============
[  0.00%] · For pandas commit hash 7d5f6b20:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  5.56%] ··· Running series_methods.Clip.time_clip                                                                                                                                                       125±0.5μs
[ 11.11%] ··· Running series_methods.Dir.time_dir_strings                                                                                                                                               1.74±0.01ms
[ 16.67%] ··· Running series_methods.Dropna.time_dropna                                                                                                                                                882±20μs;...
[ 22.22%] ··· Running series_methods.IsIn.time_isin                                                                                                                                                 1.63±0.02ms;...
[ 27.78%] ··· Running series_methods.Map.time_map                                                                                                                                                       948±6μs;...
[ 33.33%] ··· Running series_methods.NSort.time_nlargest                                                                                                                                             2.78±0.1ms;...
[ 38.89%] ··· Running series_methods.NSort.time_nsmallest                                                                                                                                            1.96±0.3ms;...
[ 44.44%] ··· Running series_methods.SeriesConstructor.time_constructor                                                                                                                                 314±1ms;...
[ 50.00%] ··· Running series_methods.ValueCounts.time_value_counts                                                                                                                                  2.14±0.03ms;...
[ 50.00%] · For pandas commit hash bd8a3cff:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 55.56%] ··· Running series_methods.Clip.time_clip                                                                                                                                                      121±0.05μs
[ 61.11%] ··· Running series_methods.Dir.time_dir_strings                                                                                                                                               1.75±0.01ms
[ 66.67%] ··· Running series_methods.Dropna.time_dropna                                                                                                                                                874±10μs;...
[ 72.22%] ··· Running series_methods.IsIn.time_isin                                                                                                                                                 1.63±0.01ms;...
[ 77.78%] ··· Running series_methods.Map.time_map                                                                                                                                                       950±8μs;...
[ 83.33%] ··· Running series_methods.NSort.time_nlargest                                                                                                                                             2.80±0.1ms;...
[ 88.89%] ··· Running series_methods.NSort.time_nsmallest                                                                                                                                            1.96±0.3ms;...
[ 94.44%] ··· Running series_methods.SeriesConstructor.time_constructor                                                                                                                                 320±4ms;...
[100.00%] ··· Running series_methods.ValueCounts.time_value_counts                                                                                                                                  2.08±0.08ms;...
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Looks like performance is not significantly affected.

I look forward for your feedback.

@codecov
Copy link

codecov bot commented Apr 2, 2018

Codecov Report

Merging #20429 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20429      +/-   ##
==========================================
+ Coverage   91.82%   91.84%   +0.02%     
==========================================
  Files         153      153              
  Lines       49256    49257       +1     
==========================================
+ Hits        45229    45242      +13     
+ Misses       4027     4015      -12
Flag Coverage Δ
#multiple 90.23% <100%> (+0.02%) ⬆️
#single 41.91% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 93.9% <100%> (ø) ⬆️
pandas/plotting/_converter.py 66.81% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73cb32e...7341cd1. Read the comment docs.

@TomAugspurger
Copy link
Contributor

Fixed a merge conflict.

Any concerns with this @gfyoung?

@jreback
Copy link
Contributor

jreback commented Apr 3, 2018

don’t merge tom - need to look

util.set_value_at_unsafe(
result,
i,
unicode(arr_i) if arr_i is not np.nan else np.nan)
Copy link
Member

@gfyoung gfyoung Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting spacing...maybe we should do this instead:

uni_arr_i = unicode(arr_i) if arr_i is not np.nan else np.nan
util.set_value_at_unsafe(result, i, uni_arr_i)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.isnan here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using np.nan here raises:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that is not friendly to strings - ok
use checknull (should already be imported
from util )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When i use that, all hell breaks loose. I get errors in tests like this one https://github.com/pandas-dev/pandas/blob/master/pandas/tests/frame/test_dtypes.py#L533

Is it because they use np.NaN? It looks like checknull checks both np.NaN and np.nan, while before the change I used to check only np.nan.
If that's the case, then I have to modify more tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung are you sure this indentation is a big problem? Because if I do what you suggest, then how should I declare uni_arr_i (and str_arr_i) in the cdef?
Would it be ok if I changed it to sth like

util.set_value_at_unsafe(
    ...
)

(moved the close bracket in the next line)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nans are the same; iow they point to the same object
go ahead and change tests if need be and i will have a look

@@ -1098,6 +1098,7 @@ I/O
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
- Bug in ``usecols`` parameter in :func:`pandas.io.read_csv` and :func:`pandas.io.read_table` where error is not raised correctly when passing a string. (:issue:`20529`)
- Bug in :func:`read_excel` and :func:`read_csv` where missing values turned to ``'nan'`` with ``dtype=str`` and ``na_filter=True``. Now, they turn to ``np.nan``. (:issue `20377`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make the last part a bit more clear. These missing values are converted to the string missing indicator, np.nan

@@ -465,7 +465,11 @@ cpdef ndarray[object] astype_unicode(ndarray arr):
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, unicode(arr[i]))
arr_i = arr[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is arr_i in the cdef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh!

util.set_value_at_unsafe(
result,
i,
unicode(arr_i) if arr_i is not np.nan else np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.isnan here

util.set_value_at_unsafe(
result,
i,
str(arr_i) if arr_i is not np.nan else np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -149,6 +149,7 @@ def test_astype_str_map(self, dtype, series):
# see gh-4405
result = series.astype(dtype)
expected = series.map(compat.text_type)
expected.replace('nan', np.nan, inplace=True) # see gh-20377
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use inplace

@nikoskaragiannakis
Copy link
Contributor Author

Back to you @jreback

@@ -1099,6 +1099,7 @@ I/O
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
- Bug in ``usecols`` parameter in :func:`pandas.io.read_csv` and :func:`pandas.io.read_table` where error is not raised correctly when passing a string. (:issue:`20529`)
- Bug in :func:`HDFStore.keys` when reading a file with a softlink causes exception (:issue:`20523`)
- Bug in :func:`read_excel` and :func:`read_csv` where missing values turned to ``'nan'`` with ``dtype=str`` and ``na_filter=True``. Now, these missing values are converted to the string missing indicator, ``np.nan``. (:issue `20377`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean make this a separate sub-section showing the previous and the new

@@ -4153,4 +4153,8 @@ def _try_cast(arr, take_fast_path):
data = np.array(data, dtype=dtype, copy=False)
subarr = np.array(data, dtype=object, copy=copy)

# GH 20377
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? this should not be necessary (not to mention non-performant)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this fixes an error I got in this test https://github.com/pandas-dev/pandas/blob/master/pandas/tests/frame/test_dtypes.py#L533

This line https://github.com/nikoskaragiannakis/pandas/blob/7341cd17e11461728969afa159250e882b32dee0/pandas/core/series.py#L4154 does the damage by turning a np.nan to 'nan'. But this comes directly from numpy, so I cannot change it.

Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how's that an error? that test is correct.

Copy link
Contributor Author

@nikoskaragiannakis nikoskaragiannakis May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that with the changes in the .pyx files, np.nan stays a float, even if you try to cast it to str.
For example

pd.DataFrame([np.NaN]).astype(str)

leaves np.NaN a float, while before the changes you used to get a 'nan'.

So I have to ask: do we want calls like the one above to turn np.NaN into nan or not?
If not, then we would be inconsistent with this https://github.com/pandas-dev/pandas/pull/20429/files#diff-6e435422c67fa1384140f92110fb69a7R379

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -369,3 +369,27 @@ def test_no_na_filter_on_index(self):
expected = DataFrame({"a": [1, 4], "c": [3, 6]},
index=Index([np.nan, 5.0], name="b"))
tm.assert_frame_equal(out, expected)

def test_na_values_with_dtype_str_and_na_filter_true(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this on na_filter (you will need to provide the nan_value as well in the parameterize as they are different)

'c': str,
'd': str})

expected['a'] = expected['a'].astype('float64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this higher (by the expected), you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this higher (by the expected)

I'm not sure what you mean here.

you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list

First of all, this is copy-paste from the previous test, which was added for #8212

Do you mean to do

expected = DataFrame({'a': Series([1,2,3,4], dtype='float64'),
                      'b': Series([2.5,3.5,4.5,5.5], dtype='float32'),
                      ...})

?
If i use, for example, for the 'c' column: Series([1, 2, 3, 4], dtype=str), then it will give me ['1', '2', '3', '4'] instead of the expected ['001', '002', '003', '004'].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly. If you need things like '001', then just do it that way, e.g. Series(['001', '002'....])

'b': [2.5, 3.5, 4.5, 5.5],
'c': [1, 2, 3, 4],
'd': [1.0, 2.0, np.nan, 4.0]}).reindex(
columns=['a', 'b', 'c', 'd'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just specify columns=list('abcd') rather than reindex

@@ -4153,4 +4153,8 @@ def _try_cast(arr, take_fast_path):
data = np.array(data, dtype=dtype, copy=False)
subarr = np.array(data, dtype=object, copy=copy)

# GH 20377
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how's that an error? that test is correct.

@@ -529,7 +529,7 @@ def test_astype_str(self):
# consistency in astype(str)
for tt in set([str, compat.text_type]):
result = DataFrame([np.NaN]).astype(tt)
expected = DataFrame(['nan'])
expected = DataFrame([np.NaN], dtype=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no leave this alone. turning a float nan into string should still work

'c': str,
'd': str})

expected['a'] = expected['a'].astype('float64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly. If you need things like '001', then just do it that way, e.g. Series(['001', '002'....])

@@ -149,6 +149,7 @@ def test_astype_str_map(self, dtype, series):
# see gh-4405
result = series.astype(dtype)
expected = series.map(compat.text_type)
expected = expected.replace('nan', np.nan) # see gh-20377
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this, this is equiv to .astype(str) its mapping to a string and is correct

@jreback
Copy link
Contributor

jreback commented Oct 11, 2018

closing as stale, if you want to continue working, pls ping and we can re-open. you will need to merge master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_excel with dtype=str converts empty cells to the string 'nan'