Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: fill_value argument for shift #15527

Closed
wants to merge 2 commits into from

Conversation

ResidentMario
Copy link
Contributor

@ResidentMario ResidentMario commented Feb 27, 2017

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Feb 27, 2017

Tests+ are a WIP.

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

@ResidentMario tip. write tests first! This validates that you are actually testing the right thing.

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Feb 28, 2017

Will try to follow TDD when here going forward.

Right now, this looks like it works for non-datetimes, but with datetimes the fill value is treated as UNIX time. So e.g. pd.Series(pd.date_range('1/1/2015', '1/30/2015')).shift(1, fill_value=0) replaces the first instance with 1970-01-01, not 0.

What behavior should be followed in that case? Trying to read the fill_value input as a datetime maybe?

@ResidentMario
Copy link
Contributor Author

It seems to me that the behavior the emulate here w.r.t. df.shift(fill_value=0) is that of df.iloc[:, 0] = 0. Which does the following by column:

  • int and object columns get 0 (int) and retain their dtype.
  • float columns get 0.0 (float) and retain their dtype.
  • datetime columns get 1970-01-01 (datetime; UNIX timestamp) and retain their dtype.
  • period columns get 0 (int) and become object dtype. (!)
  • categorical columns fail out with ValueError: Cannot setitem on a Categorical with a new category, set the categories first.
  • sparse columns fail with a NotImplementedError.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2017

no the fill value has to be a valid value for that dtype (this is checked already internally)

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Feb 28, 2017

The tests for shift are kind of all over the place (?). For the moment, I'm keeping the one's I'm writing in tests/frame/test_timeseries::TestDataFrameTimeSeriesMethods.test_shift_fillna (for DataFrame.shift) and tests/series/test_timeseries::TestSeriesTimeSeriesMethods.test_shift_fillna (for Series.shift).

As you stated above, the implementation is from a surface level as simple as passing the fill_value argument down to the requisite shift methods (several exist, one for each "kind" of axis), where in most cases there's a _maybe_upcast function which accepts the new fill_value as a keyword argument.

A couple of things I'm seeing in testing, however. I'm finding that in this implementation non-valid values actually get converted to np.nan: such is the fate of ts.shift(1, fill_value="Hello World"), which results in a first value of NaT. I'd have expected a ValueError instead, however...is this the desired behavior?

The other thing is that Categorical and Sparse don't make a call to _maybe_upcast internally. Should something else be done in these cases instead? Maybe these column types should raise a NotImplemented when passed a fill_value instead?

@jreback
Copy link
Contributor

jreback commented Feb 28, 2017

yes .shift tests are a bit of a mixed bag (always glad to consolidate things though - generally separate PR for things like that is good).

So we certainly can raise on a not-None fill_value for certain types (categorical; sparse in theory could work though, but don't spend too much time on, you can simply open an issue)

and to your last part, we are prob not validating things as much as we should so help there would be great as well!

@ResidentMario
Copy link
Contributor Author

Would it be OK to make reworking the fill_value validation a separate issue? This seems like a broader issue, as other functions which have their own validation code paths. For example, with Series.add:

pd.Series([0, 1]).add(pd.Series([np.nan, np.nan]), fill_value='Hello')

Raises:

ValueError: invalid literal for int() with base 10: 'Hello'

On:

.../series.py in _binop(self, other, func, level, fill_value)
   1639             # one but not both
   1640             mask = this_mask ^ other_mask
-> 1641             this_vals[this_mask & mask] = fill_value
   1642             other_vals[other_mask & mask] = fill_value
   1643 

Whereas this implementation results in an upcast to object type:

 >>> pd.Series([0, 1]).shift(1, fill_value="he")
 0    he
 1     0
 dtype: object

It would be good to write up something like a _maybe_fill method that would handle this check across all of the APIs that implement a fill_value and make this behavior consistent.

With this PR, I suggest:

  • Raise a NotImplemented error for categorical/sparse.
  • Write the tests where they are, for now.
  • Merge the current working implementation for non-categorical/non-sparse.
  • File an issue for consolidating shift tests.
  • File an issue for a _maybe_fill consolidator.
  • File an issue for extending fill_value to sparse.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2017

@ResidentMario absolutely!

in general, 1 PR -> 1 'issue'. It makes it easier to reason about / review.

So for example I would leave this PR, and do another that does the validation as you suggest. You can then rebase this one on your new one (until it is merged), then rebase on master.

Very happy with independent (or even dependent PR's as I suggest above), just indicate if that is the case in the top of the PR.

and certainly filing issues is good as well; for bugs / features even if you plan to do them. It keeps things organized.

@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Feb 28, 2017
@jreback
Copy link
Contributor

jreback commented Aug 1, 2017

closing as stale but good idea. master has changed a bit since this PR, might be easier now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: fill_value argument for shift
2 participants