read_excel with dtype=str converts empty cells to np.nan #20429

nikoskaragiannakis · 2018-03-20T20:59:44Z

Checklist for other PRs (remove this part if you are doing a PR for the pandas documentation sprint):

closes read_excel with dtype=str converts empty cells to the string 'nan' #20377
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

nikoskaragiannakis · 2018-03-20T22:20:04Z

No idea what's going on with this conflict. Any help?

cbertinato · 2018-03-21T17:28:35Z

doc/source/whatsnew/v0.23.0.txt

@@ -981,6 +981,8 @@ I/O
 - :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
 - Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
 - Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
+- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
+- Bug in :`read_excel` where it transforms np.nan to 'nan' if dtype=str is chosen. Now keeps np.nan as they are. (:issue:`20377`)


Should be :func:`read_excel`

cbertinato · 2018-03-21T17:29:05Z

Try rebasing and then resolving the conflict?

jreback

@gfyoung do we handle this correctly in read_csv itself?

jreback · 2018-03-22T22:55:52Z

doc/source/whatsnew/v0.23.0.txt

@@ -981,6 +981,8 @@ I/O
 - :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
 - Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
 - Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
+- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)


you are including some other changes here, pls rebase on master.

it's not mine. I deleted it by mistake and added it back.
You can check master here https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v0.23.0.txt#L985
However, even after rebasing, I keep getting this conflict

If you rebased off master and resolved the conflicts in the rebase then it should be ok. Did you fetch the current master before rebasing?

jreback · 2018-03-22T22:56:20Z

doc/source/whatsnew/v0.23.0.txt

@@ -981,6 +981,8 @@ I/O
 - :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
 - Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
 - Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
+- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
+- Bug in :`read_excel` where it transforms np.nan to 'nan' if dtype=str is chosen. Now keeps np.nan as they are. (:issue:`20377`)


use double back-ticks around dtype=str and around np.nan

gfyoung · 2018-03-23T00:40:12Z

@gfyoung do we handle this correctly in read_csv itself?

@jreback : unfortunately, no. Too bad I didn't see this PR earlier. I would have actually suggested to fix it in read_csv and then check if the bug still persists in read_excel (since they share parsing engines).

gfyoung · 2018-03-23T00:45:05Z

pandas/io/excel.py

+                dtypes = output[asheetname].dtypes
+                output[asheetname].replace('nan', np.nan, inplace=True)
+                output[asheetname] = output[asheetname].astype(dtypes,
+                                                               copy=False)


I worry about this patch being a performance hit against read_excel. The Python parser (in io/parsers.py) processes each of the Excel elements before placing it into a DataFrame. I would look there for the fix, since as I mentioned below, this bug impacts other read_* functions.

The result from read_csv is not the same as the one from read_excel with dtype=str. In the former case, empties are read in as np.nan, whereas in the latter they are read in as the string 'nan'.

True, though that doesn't change my opinion. The problematic part still likely stems in the engine parsing, which would also effect read_csv (the more popular of the two IMO). Thus, if we can kill two birds with one stone, that would be even better.

arnau126 · 2018-03-23T08:31:08Z

read_csv only works correctly (empty cells turn to np.nan) if it uses its default engine CParserWrapper.

In read_excel empty cells turn to the string 'nan', because it uses PythonParser. This parser has a function called self._cast_types, which uses astype_nansafe, which uses astype_unicode and astype_str (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L460).

The later functions are only used by PythonParser, so I think that we should fix them by using checknull as astype_intsafe does (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L437).

jreback · 2018-03-23T10:41:29Z

this is my point - the fix needs to be in read_csv not here

arnau126 · 2018-03-23T10:47:03Z

I think it's not a matter of read_csv or read_excel.
The fix needs to be in PythonParser which can be used in both read_csv and read_excel.

cbertinato · 2018-03-23T14:22:25Z

@arnau126 and @gfyoung have hit the nail on the head. The issue is at least in PythonParser, if not deeper. Using the python parser with read_csv produces the same result as @jreback has pointed out. Question is: which is the expected behavior? np.nan or 'nan'?

gfyoung · 2018-03-23T14:39:50Z

I would go with empty string as per the issue.

nikoskaragiannakis · 2018-03-24T00:25:39Z

In the ticket I mentioned this #20377 (comment).
@cbertinato gave an answer there but, since we're discussing it, I'd like some more opinions.

cbertinato · 2018-03-24T13:06:48Z

I would go with empty string as per the issue.

Should the C parser also be brought in line?

@nikoskaragiannakis: @gfyoung suggests an empty string. That sounds like a good idea as it is consistent with the originating DataFrame.

gfyoung · 2018-03-24T13:45:02Z

Should the C parser also be brought in line?

Absolutely! Both parsers should be patched.

…to whatsnew (pandas-dev#20392)

cbertinato · 2018-03-28T21:33:19Z

Sorry to have added confusion. I misinterpreted an earlier comment. I believe @jreback’s suggested changes are the best way to go.

nikoskaragiannakis · 2018-03-28T22:17:20Z

So, what needs to be done here is to make sure that an existing np.nan is not converted to a string 'nan', right? Does this mean that it should remain as a np.nan?

… tests for np.nan (pandas-dev#20377)

pep8speaks · 2018-04-02T00:36:03Z

Hello @nikoskaragiannakis! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on April 08, 2018 at 11:56 Hours UTC

nikoskaragiannakis · 2018-04-02T00:43:17Z

I made the changes so now both read_csv and read_excel

turn empty values to np.nan when dtype=str and na_filter=True
turn empty values to empty string when dtype=str and na_filter=False

I also ran the performance tests for csv, excel, and series:

io.csv
======
[  0.00%] · For pandas commit hash 7d5f6b20:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  3.33%] ··· Running io.csv.ReadCSVCategorical.time_convert_direct                                                                                                                                      44.4±0.2ms
[  6.67%] ··· Running io.csv.ReadCSVCategorical.time_convert_post                                                                                                                                        70.3±0.1ms
[ 10.00%] ··· Running io.csv.ReadCSVComment.time_comment                                                                                                                                                2.01±0.01ms
[ 13.33%] ··· Running io.csv.ReadCSVDInferDatetimeFormat.time_read_csv                                                                                                                              1.15±0.01ms;...
[ 16.67%] ··· Running io.csv.ReadCSVFloatPrecision.time_read_csv                                                                                                                                    1.64±0.01ms;...
[ 20.00%] ··· Running io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine                                                                                                                      1.81±0.01ms;...
[ 23.33%] ··· Running io.csv.ReadCSVParseDates.time_baseline                                                                                                                                               3.15±0ms
[ 26.67%] ··· Running io.csv.ReadCSVParseDates.time_multiple_date                                                                                                                                          3.10±0ms
[ 30.00%] ··· Running io.csv.ReadCSVSkipRows.time_skipprows                                                                                                                                         21.5±0.04ms;...
[ 33.33%] ··· Running io.csv.ReadCSVThousands.time_thousands                                                                                                                                        18.0±0.02ms;...
[ 36.67%] ··· Running io.csv.ReadUint64Integers.time_read_uint64                                                                                                                                        1.13±0.01ms
[ 40.00%] ··· Running io.csv.ReadUint64Integers.time_read_uint64_na_values                                                                                                                              1.22±0.01ms
[ 43.33%] ··· Running io.csv.ReadUint64Integers.time_read_uint64_neg_values                                                                                                                                1.15±0ms
[ 46.67%] ··· Running io.csv.ToCSV.time_frame                                                                                                                                                         156±0.7ms;...
[ 50.00%] ··· Running io.csv.ToCSVDatetime.time_frame_date_formatting                                                                                                                                   11.6±0.05ms
[ 50.00%] · For pandas commit hash bd8a3cff:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..............................................................................
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 53.33%] ··· Running io.csv.ReadCSVCategorical.time_convert_direct                                                                                                                                      43.0±0.2ms
[ 56.67%] ··· Running io.csv.ReadCSVCategorical.time_convert_post                                                                                                                                        66.9±0.2ms
[ 60.00%] ··· Running io.csv.ReadCSVComment.time_comment                                                                                                                                                2.04±0.02ms
[ 63.33%] ··· Running io.csv.ReadCSVDInferDatetimeFormat.time_read_csv                                                                                                                              1.17±0.01ms;...
[ 66.67%] ··· Running io.csv.ReadCSVFloatPrecision.time_read_csv                                                                                                                                    1.69±0.01ms;...
[ 70.00%] ··· Running io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine                                                                                                                      1.84±0.01ms;...
[ 73.33%] ··· Running io.csv.ReadCSVParseDates.time_baseline                                                                                                                                            3.07±0.02ms
[ 76.67%] ··· Running io.csv.ReadCSVParseDates.time_multiple_date                                                                                                                                       3.10±0.02ms
[ 80.00%] ··· Running io.csv.ReadCSVSkipRows.time_skipprows                                                                                                                                         21.4±0.06ms;...
[ 83.33%] ··· Running io.csv.ReadCSVThousands.time_thousands                                                                                                                                        17.9±0.03ms;...
[ 86.67%] ··· Running io.csv.ReadUint64Integers.time_read_uint64                                                                                                                                        1.13±0.07ms
[ 90.00%] ··· Running io.csv.ReadUint64Integers.time_read_uint64_na_values                                                                                                                              1.25±0.02ms
[ 93.33%] ··· Running io.csv.ReadUint64Integers.time_read_uint64_neg_values                                                                                                                             1.17±0.01ms
[ 96.67%] ··· Running io.csv.ToCSV.time_frame                                                                                                                                                         156±0.3ms;...
[100.00%] ··· Running io.csv.ToCSVDatetime.time_frame_date_formatting                                                                                                                                   11.4±0.03ms
BENCHMARKS NOT SIGNIFICANTLY CHANGED.



io.excel
========
[  0.00%] · For pandas commit hash 7d5f6b20:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running io.excel.Excel.time_read_excel                                                                                                                                                  154±0.2ms;...
[ 50.00%] ··· Running io.excel.Excel.time_write_excel                                                                                                                                                     641ms;...
[ 50.00%] · For pandas commit hash bd8a3cff:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· Running io.excel.Excel.time_read_excel                                                                                                                                                  154±0.1ms;...
[100.00%] ··· Running io.excel.Excel.time_write_excel                                                                                                                                                     641ms;...
BENCHMARKS NOT SIGNIFICANTLY CHANGED.



series_methods
==============
[  0.00%] · For pandas commit hash 7d5f6b20:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  5.56%] ··· Running series_methods.Clip.time_clip                                                                                                                                                       125±0.5μs
[ 11.11%] ··· Running series_methods.Dir.time_dir_strings                                                                                                                                               1.74±0.01ms
[ 16.67%] ··· Running series_methods.Dropna.time_dropna                                                                                                                                                882±20μs;...
[ 22.22%] ··· Running series_methods.IsIn.time_isin                                                                                                                                                 1.63±0.02ms;...
[ 27.78%] ··· Running series_methods.Map.time_map                                                                                                                                                       948±6μs;...
[ 33.33%] ··· Running series_methods.NSort.time_nlargest                                                                                                                                             2.78±0.1ms;...
[ 38.89%] ··· Running series_methods.NSort.time_nsmallest                                                                                                                                            1.96±0.3ms;...
[ 44.44%] ··· Running series_methods.SeriesConstructor.time_constructor                                                                                                                                 314±1ms;...
[ 50.00%] ··· Running series_methods.ValueCounts.time_value_counts                                                                                                                                  2.14±0.03ms;...
[ 50.00%] · For pandas commit hash bd8a3cff:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 55.56%] ··· Running series_methods.Clip.time_clip                                                                                                                                                      121±0.05μs
[ 61.11%] ··· Running series_methods.Dir.time_dir_strings                                                                                                                                               1.75±0.01ms
[ 66.67%] ··· Running series_methods.Dropna.time_dropna                                                                                                                                                874±10μs;...
[ 72.22%] ··· Running series_methods.IsIn.time_isin                                                                                                                                                 1.63±0.01ms;...
[ 77.78%] ··· Running series_methods.Map.time_map                                                                                                                                                       950±8μs;...
[ 83.33%] ··· Running series_methods.NSort.time_nlargest                                                                                                                                             2.80±0.1ms;...
[ 88.89%] ··· Running series_methods.NSort.time_nsmallest                                                                                                                                            1.96±0.3ms;...
[ 94.44%] ··· Running series_methods.SeriesConstructor.time_constructor                                                                                                                                 320±4ms;...
[100.00%] ··· Running series_methods.ValueCounts.time_value_counts                                                                                                                                  2.08±0.08ms;...
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Looks like performance is not significantly affected.

I look forward for your feedback.

… tests for np.nan (pandas-dev#20377) TST: pep8 (pandas-dev#20377) TST: Correction in a test (pandas-dev#20377)

codecov · 2018-04-02T02:54:33Z

Codecov Report

Merging #20429 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20429      +/-   ##
==========================================
+ Coverage   91.82%   91.84%   +0.02%     
==========================================
  Files         153      153              
  Lines       49256    49257       +1     
==========================================
+ Hits        45229    45242      +13     
+ Misses       4027     4015      -12

Flag	Coverage Δ
#multiple	`90.23% <100%> (+0.02%)`	⬆️
#single	`41.91% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/series.py	`93.9% <100%> (ø)`	⬆️
pandas/plotting/_converter.py	`66.81% <0%> (+1.73%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73cb32e...7341cd1. Read the comment docs.

…s-read_excel_str

TomAugspurger · 2018-04-03T19:41:54Z

Fixed a merge conflict.

Any concerns with this @gfyoung?

jreback · 2018-04-03T19:43:22Z

don’t merge tom - need to look

gfyoung · 2018-04-03T20:15:02Z

pandas/_libs/lib.pyx

+        util.set_value_at_unsafe(
+            result,
+            i,
+            unicode(arr_i) if arr_i is not np.nan else np.nan)


Interesting spacing...maybe we should do this instead:

uni_arr_i = unicode(arr_i) if arr_i is not np.nan else np.nan util.set_value_at_unsafe(result, i, uni_arr_i)

use np.isnan here

Using np.nan here raises:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

yeah that is not friendly to strings - ok
use checknull (should already be imported
from util )

When i use that, all hell breaks loose. I get errors in tests like this one https://github.com/pandas-dev/pandas/blob/master/pandas/tests/frame/test_dtypes.py#L533

Is it because they use np.NaN? It looks like checknull checks both np.NaN and np.nan, while before the change I used to check only np.nan.
If that's the case, then I have to modify more tests.

@gfyoung are you sure this indentation is a big problem? Because if I do what you suggest, then how should I declare uni_arr_i (and str_arr_i) in the cdef?
Would it be ok if I changed it to sth like

util.set_value_at_unsafe( ... )

(moved the close bracket in the next line)?

That would work as well.

the nans are the same; iow they point to the same object
go ahead and change tests if need be and i will have a look

jreback · 2018-04-05T15:23:05Z

doc/source/whatsnew/v0.23.0.txt

@@ -1098,6 +1098,7 @@ I/O
 - Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
 - Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
 - Bug in ``usecols`` parameter in :func:`pandas.io.read_csv` and :func:`pandas.io.read_table` where error is not raised correctly when passing a string. (:issue:`20529`)
+- Bug in :func:`read_excel` and :func:`read_csv` where missing values turned to ``'nan'`` with ``dtype=str`` and ``na_filter=True``. Now, they turn to ``np.nan``. (:issue `20377`)


can you make the last part a bit more clear. These missing values are converted to the string missing indicator, np.nan

jreback · 2018-04-05T15:23:39Z

pandas/_libs/lib.pyx

@@ -465,7 +465,11 @@ cpdef ndarray[object] astype_unicode(ndarray arr):
    for i in range(n):
        # we can use the unsafe version because we know `result` is mutable
        # since it was created from `np.empty`
-        util.set_value_at_unsafe(result, i, unicode(arr[i]))
+        arr_i = arr[i]


is arr_i in the cdef?

jreback · 2018-04-05T15:24:18Z

pandas/_libs/lib.pyx

+        util.set_value_at_unsafe(
+            result,
+            i,
+            unicode(arr_i) if arr_i is not np.nan else np.nan)


use np.isnan here

jreback · 2018-04-05T15:24:30Z

pandas/_libs/lib.pyx

+        util.set_value_at_unsafe(
+            result,
+            i,
+            str(arr_i) if arr_i is not np.nan else np.nan)


jreback · 2018-04-05T15:25:03Z

pandas/tests/series/test_dtypes.py

@@ -149,6 +149,7 @@ def test_astype_str_map(self, dtype, series):
        # see gh-4405
        result = series.astype(dtype)
        expected = series.map(compat.text_type)
+        expected.replace('nan', np.nan, inplace=True)  # see gh-20377


don't use inplace

nikoskaragiannakis · 2018-04-08T13:19:02Z

Back to you @jreback

jreback · 2018-04-09T16:52:24Z

doc/source/whatsnew/v0.23.0.txt

@@ -1099,6 +1099,7 @@ I/O
 - Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
 - Bug in ``usecols`` parameter in :func:`pandas.io.read_csv` and :func:`pandas.io.read_table` where error is not raised correctly when passing a string. (:issue:`20529`)
 - Bug in :func:`HDFStore.keys` when reading a file with a softlink causes exception (:issue:`20523`)
+- Bug in :func:`read_excel` and :func:`read_csv` where missing values turned to ``'nan'`` with ``dtype=str`` and ``na_filter=True``. Now, these missing values are converted to the string missing indicator, ``np.nan``. (:issue `20377`)


i mean make this a separate sub-section showing the previous and the new

jreback · 2018-04-09T16:52:59Z

pandas/core/series.py

@@ -4153,4 +4153,8 @@ def _try_cast(arr, take_fast_path):
                data = np.array(data, dtype=dtype, copy=False)
            subarr = np.array(data, dtype=object, copy=copy)

+            # GH 20377


huh? this should not be necessary (not to mention non-performant)

Well, this fixes an error I got in this test https://github.com/pandas-dev/pandas/blob/master/pandas/tests/frame/test_dtypes.py#L533

This line https://github.com/nikoskaragiannakis/pandas/blob/7341cd17e11461728969afa159250e882b32dee0/pandas/core/series.py#L4154 does the damage by turning a np.nan to 'nan'. But this comes directly from numpy, so I cannot change it.

Am I missing something?

how's that an error? that test is correct.

I think that with the changes in the .pyx files, np.nan stays a float, even if you try to cast it to str.
For example

pd.DataFrame([np.NaN]).astype(str)

leaves np.NaN a float, while before the changes you used to get a 'nan'.

So I have to ask: do we want calls like the one above to turn np.NaN into nan or not?
If not, then we would be inconsistent with this https://github.com/pandas-dev/pandas/pull/20429/files#diff-6e435422c67fa1384140f92110fb69a7R379

jreback · 2018-04-09T16:54:19Z

pandas/tests/io/parser/na_values.py

@@ -369,3 +369,27 @@ def test_no_na_filter_on_index(self):
        expected = DataFrame({"a": [1, 4], "c": [3, 6]},
                             index=Index([np.nan, 5.0], name="b"))
        tm.assert_frame_equal(out, expected)
+
+    def test_na_values_with_dtype_str_and_na_filter_true(self):


can you parameterize this on na_filter (you will need to provide the nan_value as well in the parameterize as they are different)

jreback · 2018-04-09T16:55:56Z

pandas/tests/io/test_excel.py

+                                         'c': str,
+                                         'd': str})
+
+        expected['a'] = expected['a'].astype('float64')


move this higher (by the expected), you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list

move this higher (by the expected)

I'm not sure what you mean here.

you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list

First of all, this is copy-paste from the previous test, which was added for #8212

Do you mean to do

expected = DataFrame({'a': Series([1,2,3,4], dtype='float64'), 'b': Series([2.5,3.5,4.5,5.5], dtype='float32'), ...})

?
If i use, for example, for the 'c' column: Series([1, 2, 3, 4], dtype=str), then it will give me ['1', '2', '3', '4'] instead of the expected ['001', '002', '003', '004'].

yes exactly. If you need things like '001', then just do it that way, e.g. Series(['001', '002'....])

jreback · 2018-04-09T16:56:08Z

pandas/tests/io/test_excel.py

+            'b': [2.5, 3.5, 4.5, 5.5],
+            'c': [1, 2, 3, 4],
+            'd': [1.0, 2.0, np.nan, 4.0]}).reindex(
+                columns=['a', 'b', 'c', 'd'])


just specify columns=list('abcd') rather than reindex

jreback · 2018-04-11T01:38:47Z

pandas/core/series.py

@@ -4153,4 +4153,8 @@ def _try_cast(arr, take_fast_path):
                data = np.array(data, dtype=dtype, copy=False)
            subarr = np.array(data, dtype=object, copy=copy)

+            # GH 20377


how's that an error? that test is correct.

jreback · 2018-04-11T01:39:22Z

pandas/tests/frame/test_dtypes.py

@@ -529,7 +529,7 @@ def test_astype_str(self):
        # consistency in astype(str)
        for tt in set([str, compat.text_type]):
            result = DataFrame([np.NaN]).astype(tt)
-            expected = DataFrame(['nan'])
+            expected = DataFrame([np.NaN], dtype=object)


no leave this alone. turning a float nan into string should still work

jreback · 2018-04-11T01:40:17Z

pandas/tests/io/test_excel.py

+                                         'c': str,
+                                         'd': str})
+
+        expected['a'] = expected['a'].astype('float64')


yes exactly. If you need things like '001', then just do it that way, e.g. Series(['001', '002'....])

jreback · 2018-04-11T01:40:53Z

pandas/tests/series/test_dtypes.py

@@ -149,6 +149,7 @@ def test_astype_str_map(self, dtype, series):
        # see gh-4405
        result = series.astype(dtype)
        expected = series.map(compat.text_type)
+        expected = expected.replace('nan', np.nan)  # see gh-20377


remove this, this is equiv to .astype(str) its mapping to a string and is correct

jreback · 2018-10-11T01:48:55Z

closing as stale, if you want to continue working, pls ping and we can re-open. you will need to merge master.

nikoskaragiannakis and others added 12 commits March 20, 2018 02:31

TST: Test for astype_nansafe. Modified test for astype

dd53df8

BUG: np.nan should stay as it is when we cast to str/basestring

6f771fb

BUG: revert change in lib.pyx. modify excel functionality directly

37f00ad

TST: revert changes in dtypes/test_cast. test excel functionality

f194b70

DOC: added description

eb8f4c5

TST: correction and pep8

ac6a409

BUG: pep8

6994bb0

TST: remove unused import

40a563f

DOC: resolved conflict

9858259

Update v0.23.0.txt

5f71a99

conflict again

0a93b60

arghh

f296f9a

cbertinato reviewed Mar 21, 2018

View reviewed changes

jreback requested changes Mar 22, 2018

View reviewed changes

jreback added Dtype Conversions Unexpected or buggy dtype conversions IO Excel read_excel, to_excel labels Mar 22, 2018

gfyoung reviewed Mar 23, 2018

View reviewed changes

gfyoung changed the title ~~read_excel with dtype=str converst empty cells to np.nan~~ read_excel with dtype=str converts empty cells to np.nan Mar 23, 2018

DOC: add disallowing of Series construction of len-1 list with index …

7c0af1f

…to whatsnew (pandas-dev#20392)

nikoskaragiannakis added 3 commits April 2, 2018 01:31

DOC: updated IO section (pandas-dev#20377)

478d08d

BUG: np.nan stays as np.nan (pandas-dev#20377)

edb26d7

TXT: Moved test from series.test_io to io.parser.na_values. Corrected…

c3ab9cb

… tests for np.nan (pandas-dev#20377)

DOC: updated IO section (pandas-dev#20377)

69f6c95

nikoskaragiannakis added 5 commits April 2, 2018 01:45

TST: pep8 (pandas-dev#20377)

97a345a

TXT: Moved test from series.test_io to io.parser.na_values. Corrected…

8b2fb0b

… tests for np.nan (pandas-dev#20377) TST: pep8 (pandas-dev#20377) TST: Correction in a test (pandas-dev#20377)

DOC: updated IO section (pandas-dev#20377)

c9f5120

resolve conflict

fab0b27

pep8 correction

571d5c4

Merge remote-tracking branch 'upstream/master' into nikoskaragiannaki…

0712392

…s-read_excel_str

gfyoung reviewed Apr 3, 2018

View reviewed changes

jreback requested changes Apr 5, 2018

View reviewed changes

nikoskaragiannakis added 5 commits April 6, 2018 00:27

DOC: Better explanation (pandas-dev#20377)

47bc105

BUG: use checknull (pandas-dev#20377)

3740dfe

TST: update tests (pandas-dev#20377)

7d453bb

BUG: string nans to np.nan in Series for list data (pandas-dev#20377)

bcd739d

sync

7341cd1

jreback requested changes Apr 9, 2018

View reviewed changes

jreback requested changes Apr 11, 2018

View reviewed changes

jreback closed this Oct 11, 2018

gfyoung mentioned this pull request Oct 15, 2018

BUG: Don't parse NaN as 'nan' in Data IO #23162

Merged

read_excel with dtype=str converts empty cells to np.nan #20429

read_excel with dtype=str converts empty cells to np.nan #20429

Conversation

nikoskaragiannakis commented Mar 20, 2018

nikoskaragiannakis commented Mar 20, 2018

Choose a reason for hiding this comment

cbertinato commented Mar 21, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikoskaragiannakis Mar 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Mar 23, 2018

gfyoung Mar 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnau126 commented Mar 23, 2018 • edited Loading

jreback commented Mar 23, 2018

arnau126 commented Mar 23, 2018

cbertinato commented Mar 23, 2018

gfyoung commented Mar 23, 2018 • edited Loading

nikoskaragiannakis commented Mar 24, 2018

cbertinato commented Mar 24, 2018

gfyoung commented Mar 24, 2018

cbertinato commented Mar 28, 2018

nikoskaragiannakis commented Mar 28, 2018

pep8speaks commented Apr 2, 2018 • edited Loading

Comment last updated on April 08, 2018 at 11:56 Hours UTC

nikoskaragiannakis commented Apr 2, 2018

codecov bot commented Apr 2, 2018 • edited Loading

Codecov Report

TomAugspurger commented Apr 3, 2018

jreback commented Apr 3, 2018

gfyoung Apr 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikoskaragiannakis commented Apr 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikoskaragiannakis May 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 11, 2018

nikoskaragiannakis Mar 25, 2018 •

edited

Loading

gfyoung Mar 23, 2018 •

edited

Loading

arnau126 commented Mar 23, 2018 •

edited

Loading

gfyoung commented Mar 23, 2018 •

edited

Loading

pep8speaks commented Apr 2, 2018 •

edited

Loading

codecov bot commented Apr 2, 2018 •

edited

Loading

gfyoung Apr 3, 2018 •

edited

Loading

nikoskaragiannakis May 18, 2018 •

edited

Loading