-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_excel with dtype=str converts empty cells to np.nan #20429
read_excel with dtype=str converts empty cells to np.nan #20429
Conversation
No idea what's going on with this conflict. Any help? |
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -981,6 +981,8 @@ I/O | |||
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`) | |||
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`) | |||
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`) | |||
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`) | |||
- Bug in :`read_excel` where it transforms np.nan to 'nan' if dtype=str is chosen. Now keeps np.nan as they are. (:issue:`20377`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be :func:`read_excel`
Try rebasing and then resolving the conflict? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung do we handle this correctly in read_csv itself?
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -981,6 +981,8 @@ I/O | |||
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`) | |||
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`) | |||
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`) | |||
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are including some other changes here, pls rebase on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not mine. I deleted it by mistake and added it back.
You can check master here https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v0.23.0.txt#L985
However, even after rebasing, I keep getting this conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rebased off master and resolved the conflicts in the rebase then it should be ok. Did you fetch the current master before rebasing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did now
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -981,6 +981,8 @@ I/O | |||
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`) | |||
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`) | |||
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`) | |||
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`) | |||
- Bug in :`read_excel` where it transforms np.nan to 'nan' if dtype=str is chosen. Now keeps np.nan as they are. (:issue:`20377`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double back-ticks around dtype=str
and around np.nan
pandas/io/excel.py
Outdated
dtypes = output[asheetname].dtypes | ||
output[asheetname].replace('nan', np.nan, inplace=True) | ||
output[asheetname] = output[asheetname].astype(dtypes, | ||
copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about this patch being a performance hit against read_excel
. The Python parser (in io/parsers.py
) processes each of the Excel elements before placing it into a DataFrame
. I would look there for the fix, since as I mentioned below, this bug impacts other read_*
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result from read_csv
is not the same as the one from read_excel
with dtype=str
. In the former case, empties are read in as np.nan
, whereas in the latter they are read in as the string 'nan'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, though that doesn't change my opinion. The problematic part still likely stems in the engine parsing, which would also effect read_csv
(the more popular of the two IMO). Thus, if we can kill two birds with one stone, that would be even better.
In The later functions are only used by |
this is my point - the fix needs to be in read_csv not here |
I think it's not a matter of |
I would go with empty string as per the issue. |
In the ticket I mentioned this #20377 (comment). |
Should the C parser also be brought in line? @nikoskaragiannakis: @gfyoung suggests an empty string. That sounds like a good idea as it is consistent with the originating DataFrame. |
Absolutely! Both parsers should be patched. |
Sorry to have added confusion. I misinterpreted an earlier comment. I believe @jreback’s suggested changes are the best way to go. |
So, what needs to be done here is to make sure that an existing |
Hello @nikoskaragiannakis! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on April 08, 2018 at 11:56 Hours UTC |
I made the changes so now both
I also ran the performance tests for csv, excel, and series:
Looks like performance is not significantly affected. I look forward for your feedback. |
… tests for np.nan (pandas-dev#20377) TST: pep8 (pandas-dev#20377) TST: Correction in a test (pandas-dev#20377)
Codecov Report
@@ Coverage Diff @@
## master #20429 +/- ##
==========================================
+ Coverage 91.82% 91.84% +0.02%
==========================================
Files 153 153
Lines 49256 49257 +1
==========================================
+ Hits 45229 45242 +13
+ Misses 4027 4015 -12
Continue to review full report at Codecov.
|
Fixed a merge conflict. Any concerns with this @gfyoung? |
don’t merge tom - need to look |
pandas/_libs/lib.pyx
Outdated
util.set_value_at_unsafe( | ||
result, | ||
i, | ||
unicode(arr_i) if arr_i is not np.nan else np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting spacing...maybe we should do this instead:
uni_arr_i = unicode(arr_i) if arr_i is not np.nan else np.nan
util.set_value_at_unsafe(result, i, uni_arr_i)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.isnan
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using np.nan
here raises:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that is not friendly to strings - ok
use checknull (should already be imported
from util )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When i use that, all hell breaks loose. I get errors in tests like this one https://github.com/pandas-dev/pandas/blob/master/pandas/tests/frame/test_dtypes.py#L533
Is it because they use np.NaN
? It looks like checknull
checks both np.NaN
and np.nan
, while before the change I used to check only np.nan
.
If that's the case, then I have to modify more tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung are you sure this indentation is a big problem? Because if I do what you suggest, then how should I declare uni_arr_i (and str_arr_i) in the cdef?
Would it be ok if I changed it to sth like
util.set_value_at_unsafe(
...
)
(moved the close bracket in the next line)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the nans are the same; iow they point to the same object
go ahead and change tests if need be and i will have a look
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1098,6 +1098,7 @@ I/O | |||
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`) | |||
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`) | |||
- Bug in ``usecols`` parameter in :func:`pandas.io.read_csv` and :func:`pandas.io.read_table` where error is not raised correctly when passing a string. (:issue:`20529`) | |||
- Bug in :func:`read_excel` and :func:`read_csv` where missing values turned to ``'nan'`` with ``dtype=str`` and ``na_filter=True``. Now, they turn to ``np.nan``. (:issue `20377`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make the last part a bit more clear. These missing values are converted to the string missing indicator, np.nan
@@ -465,7 +465,11 @@ cpdef ndarray[object] astype_unicode(ndarray arr): | |||
for i in range(n): | |||
# we can use the unsafe version because we know `result` is mutable | |||
# since it was created from `np.empty` | |||
util.set_value_at_unsafe(result, i, unicode(arr[i])) | |||
arr_i = arr[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is arr_i in the cdef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d'oh!
pandas/_libs/lib.pyx
Outdated
util.set_value_at_unsafe( | ||
result, | ||
i, | ||
unicode(arr_i) if arr_i is not np.nan else np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.isnan
here
pandas/_libs/lib.pyx
Outdated
util.set_value_at_unsafe( | ||
result, | ||
i, | ||
str(arr_i) if arr_i is not np.nan else np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/tests/series/test_dtypes.py
Outdated
@@ -149,6 +149,7 @@ def test_astype_str_map(self, dtype, series): | |||
# see gh-4405 | |||
result = series.astype(dtype) | |||
expected = series.map(compat.text_type) | |||
expected.replace('nan', np.nan, inplace=True) # see gh-20377 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use inplace
Back to you @jreback |
@@ -1099,6 +1099,7 @@ I/O | |||
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`) | |||
- Bug in ``usecols`` parameter in :func:`pandas.io.read_csv` and :func:`pandas.io.read_table` where error is not raised correctly when passing a string. (:issue:`20529`) | |||
- Bug in :func:`HDFStore.keys` when reading a file with a softlink causes exception (:issue:`20523`) | |||
- Bug in :func:`read_excel` and :func:`read_csv` where missing values turned to ``'nan'`` with ``dtype=str`` and ``na_filter=True``. Now, these missing values are converted to the string missing indicator, ``np.nan``. (:issue `20377`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean make this a separate sub-section showing the previous and the new
@@ -4153,4 +4153,8 @@ def _try_cast(arr, take_fast_path): | |||
data = np.array(data, dtype=dtype, copy=False) | |||
subarr = np.array(data, dtype=object, copy=copy) | |||
|
|||
# GH 20377 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? this should not be necessary (not to mention non-performant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this fixes an error I got in this test https://github.com/pandas-dev/pandas/blob/master/pandas/tests/frame/test_dtypes.py#L533
This line https://github.com/nikoskaragiannakis/pandas/blob/7341cd17e11461728969afa159250e882b32dee0/pandas/core/series.py#L4154 does the damage by turning a np.nan
to 'nan'
. But this comes directly from numpy, so I cannot change it.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how's that an error? that test is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that with the changes in the .pyx files, np.nan
stays a float, even if you try to cast it to str
.
For example
pd.DataFrame([np.NaN]).astype(str)
leaves np.NaN
a float, while before the changes you used to get a 'nan'
.
So I have to ask: do we want calls like the one above to turn np.NaN
into nan
or not?
If not, then we would be inconsistent with this https://github.com/pandas-dev/pandas/pull/20429/files#diff-6e435422c67fa1384140f92110fb69a7R379
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback ^
@@ -369,3 +369,27 @@ def test_no_na_filter_on_index(self): | |||
expected = DataFrame({"a": [1, 4], "c": [3, 6]}, | |||
index=Index([np.nan, 5.0], name="b")) | |||
tm.assert_frame_equal(out, expected) | |||
|
|||
def test_na_values_with_dtype_str_and_na_filter_true(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize this on na_filter (you will need to provide the nan_value as well in the parameterize as they are different)
'c': str, | ||
'd': str}) | ||
|
||
expected['a'] = expected['a'].astype('float64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this higher (by the expected), you can simply construct things directly by using e.g. Series(...., dtype='float32')
rather than a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this higher (by the expected)
I'm not sure what you mean here.
you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list
First of all, this is copy-paste from the previous test, which was added for #8212
Do you mean to do
expected = DataFrame({'a': Series([1,2,3,4], dtype='float64'),
'b': Series([2.5,3.5,4.5,5.5], dtype='float32'),
...})
?
If i use, for example, for the 'c'
column: Series([1, 2, 3, 4], dtype=str)
, then it will give me ['1', '2', '3', '4']
instead of the expected ['001', '002', '003', '004']
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly. If you need things like '001', then just do it that way, e.g. Series(['001', '002'....])
'b': [2.5, 3.5, 4.5, 5.5], | ||
'c': [1, 2, 3, 4], | ||
'd': [1.0, 2.0, np.nan, 4.0]}).reindex( | ||
columns=['a', 'b', 'c', 'd']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just specify columns=list('abcd')
rather than reindex
@@ -4153,4 +4153,8 @@ def _try_cast(arr, take_fast_path): | |||
data = np.array(data, dtype=dtype, copy=False) | |||
subarr = np.array(data, dtype=object, copy=copy) | |||
|
|||
# GH 20377 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how's that an error? that test is correct.
@@ -529,7 +529,7 @@ def test_astype_str(self): | |||
# consistency in astype(str) | |||
for tt in set([str, compat.text_type]): | |||
result = DataFrame([np.NaN]).astype(tt) | |||
expected = DataFrame(['nan']) | |||
expected = DataFrame([np.NaN], dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no leave this alone. turning a float nan into string should still work
'c': str, | ||
'd': str}) | ||
|
||
expected['a'] = expected['a'].astype('float64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly. If you need things like '001', then just do it that way, e.g. Series(['001', '002'....])
@@ -149,6 +149,7 @@ def test_astype_str_map(self, dtype, series): | |||
# see gh-4405 | |||
result = series.astype(dtype) | |||
expected = series.map(compat.text_type) | |||
expected = expected.replace('nan', np.nan) # see gh-20377 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this, this is equiv to .astype(str)
its mapping to a string and is correct
closing as stale, if you want to continue working, pls ping and we can re-open. you will need to merge master. |
Checklist for other PRs (remove this part if you are doing a PR for the pandas documentation sprint):
git diff upstream/master -u -- "*.py" | flake8 --diff