Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISCUSSION: Add format parameter to .astype when converting to str dtype #17211

Open
topper-123 opened this issue Aug 10, 2017 · 19 comments
Open
Labels
Astype Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@topper-123
Copy link
Contributor

topper-123 commented Aug 10, 2017

I propose adding a string formatting possibility to .astype when converting to str dtype: I think it's reasonable to expect that you can choose the string format when converting to a string dtype, as you're basically freezing a representation of your series, and just using .astype(str) for this is often too crude.

This possibility should take shape of a format parameter to .astype, that can take a string and can only be used when converting to string dtype. This would lessen the reliance on .apply for converting non-strings to more complex strings and make such conversions more readable (IMO) and maybe faster (as we're avoiding .apply which is slow, though Im not too knowledgable on such optimizations).

The current procedure for converting to a complex string goes like this:

In [1] ser = pd.Series([-1, 1.234])
In [2] ser.apply("{:+.1f} $".format)
0    -1.0 $
1    +1.2 $
dtype: object

I propose to make this possible:

In [3] ser.astype(str, format="{:+.1f} $")
0    -1.0 $
1    +1.2 $
dtype: object

If the dtype parameter is not str, setting of the format parameter should raise an exception. If format is not set, the current behaviour will be used. The proposed change is therefore backward compatible.

Also to consider:

Allowing a placeholder name

Should a placeholder name be available? Then you could do:

In [4] ser = pd.Series(pd.date_range('2017-03', periods=2, freq='M'))
In [x] ser.astype(str, format="Y{value.dt.year}-Q{value.dt.quarter}")
0    Y2017-Q1
1    Y2017-Q2
dtype: object

(Note that we above have an implicit parameter on .astype with a default value "value", so adding a placeholder name is transparent. Note also the above behaviour is present in ser.dt.strftime, but please look at the principle rather than the concrete example).

A downside to allowing a placeholder name could be the potential for abuse (stuffing too much into the format string) and possibly losing the option to vectorize (though this is not my expertize).

Adding a .format method

It could also be considered adding a .str.format or .format method to DataFrame/Series.

If .format is added to the .str namespace it would only be usable for string dataframes/series (which I'd be quite ok with, if the format parameter is also available on .astype for other data types).

Alternatively, such a method could be available directly on all DataFrames/Series. Then you'd do ser.format('{:+.1f}') rather than ser.astype(str, format='{:+.1f}'). IMO though, it would be inconsistent to have such a string conversion method directly on pandas objects, but not for other types. Why have .format but not .to_numeric as a dataframes/series method?

IMO therefore, astype(str, format=...) combined with a .str.format method is better than adding a new .format method for this. So:

  • .astype(str, format=...) makes it very obvious that we're now changing to string datatype, and
  • .str.format(...) makes it clear that we're doing a string manipulation.
@topper-123 topper-123 changed the title DISCUSSION: Add formatting to .astype when converting to str dtype DISCUSSION: Add format parameter to .astype when converting to str dtype Aug 10, 2017
@jorisvandenbossche
Copy link
Member

A big +1 to add a more convenient way to do achieve string formatting. However, IMO it might be cleaner to have it as a separate format method instead of overloading astype (although we do this overloading already for 'category' dtype as well)

@chris-b1
Copy link
Contributor

xref #15550

Personally I'm not a huge of the overloaded astype (even our existing uses) as it is at times difficult to reason about going back and forth with numpy. That said it is convenient and generally does what people actually want.

I'm not sure this is the right api, but in #15550 I suggested something like this to split out the numpy behavior and pandas overrides

s.type.astype(....) <- does exactly what numpy does
s.type.to_datetime
s.type.to_string(format=...)

@topper-123
Copy link
Contributor Author

topper-123 commented Aug 10, 2017

@jorisvandenbossche , a .format method was my first thought too, but I felt the API got too fragmented:

  • .astype(dtype) for most conversions,
  • .astype('category', ordered=..., categories=...) for categorical series,
  • .format(...) for strings,
  • maybe pd.to_numeric(...) for converting to numeric type

In addition, if we have a .format, why not also have .to_category, .to_numeric etc. and the API gets unwieldy quickly. I'd much prefer to keep conversions in one namespace, to make it easier to learn the conversion API.

I find @chris-b1 's idea very, very appealing, and it is kind of similar to how .plot already does it. I'm negative on the name type though, as it might be confused with .dtype and also the Python function type. Otherwise I'm +1.

An idea: could .astype just be the catch-all (almost as today) and we set attributes on that to be specific methods? So:

  • .astype(...) (catchall, mostly like today),
  • .astype.str(...) (my proposal above wrt. strings),
  • .astype.category(...) (like current .astype('category', ...) behaviour),
  • .astype.datetime(...) (to datetime),
  • .astype.numeric(...) (to numeric),
  • etc...

This is very similar to how .plot does it, which I count as a big positive.

@TomAugspurger
Copy link
Contributor

FWIW, the .astype('category', ordered...) won't be nescessary once #16015 is finished. I'd rather not overload .astype any more.

@topper-123
Copy link
Contributor Author

Sorry, closed by accident.

@topper-123 topper-123 reopened this Aug 10, 2017
@topper-123
Copy link
Contributor Author

From the there seemes to be support for the idea, but the API for type conversions has not been settled. This issue is primarily on a format parameter in .astype, while a larger API discussion is in #15550 already.

Probably the API discussion needs to be finished first, and then the format functionality should follow from that, that is; I need to wait a little with this?

@jreback
Copy link
Contributor

jreback commented Aug 12, 2017

adding parameters to .astype() is not a great idea. We already have

.to_string(...) and could support some kind of .astype(str).str.format(..) I suppose.

why are you wanting to add complexity to an this API?

@topper-123
Copy link
Contributor Author

A num_series.astype(str).str.format('{+.1f}') solution wouldn't work, as the format string '{+.1f}' requires number as input, and in the astype step we've already casted into strings. So the format-string needs to be supplied at the same time as the casting into strings, not after.

The closest equivalent as far as I know would require an apply: num_series.apply('{+.1f}'.format). I don't think there is any way around .apply ATM for custom formatting of numbers into strings, and the need for such formatting is quite common (e.g. for data presentation).

Anyway, I understand that I'm clearly in the minority wrt. adding a format parameter to astype. Do you think there's some other acceptable solution, or am I wrong in my understanding, that apply is the only current way to achieve this?

@jreback
Copy link
Contributor

jreback commented Aug 15, 2017

This is pretty much a special case. .apply is fine here. .to_string() is still the canonical output formatter, which does have a float_converter arg.

@jreback jreback closed this as completed Aug 15, 2017
@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Aug 16, 2017
@jorisvandenbossche
Copy link
Member

We already have .to_string(...)

This is something completely different. This converts the full dataframe to a string represenation, while here it is about converting values to formatted string values inside a dataframe

I like having some way to do this (but the question is indeed in what kind of API), but I would also be OK to end the discussion with the decision that it is not important enough to add specialized functionality and that using the s.apply("{..} ..".format) idiom is the recommended way here. But let's at least have that discussion.

@topper-123
Copy link
Contributor Author

topper-123 commented Aug 17, 2017

Yeah, .to_string is not the same as this.

Pandas does offer conversion through .astype(str) + that pandas has really great support for vectorized string formatting (through the .str namespace) , so nice string operations in general is supported, but only not for numbers.

I agree on the benefits of agreeing on a "canonical" way for converting numbers to strings and writing up this approved method the docs, even if .apply("{...}".format) should be the sanctioned way. I could write that up if needed. However, I do think that .apply("{...}".format) is ugly and illogical given that pandas does not use apply for formatting strings.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

ok, given the discussion we are having on #18347. more amenable to this.

@h-vetinari
Copy link
Contributor

h-vetinari commented Sep 12, 2019

Any update on this? (IOW has a conclusion been reached that could be implemented?)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 12, 2019 via email

@peterdhansen
Copy link

My two cents: if you have a mixed datatype Series

>>> x = pd.Series([np.nan, 1, 2.0, "foo"])
>>> print(x)
0    NaN
1      1
2      2
3    foo
dtype: object
>>> print(x.astype(str))
0    nan
1      1
2    2.0
3    foo
dtype: object
>>> print(x.fillna("").astype(str).str.split())
0       []
1      [1]
2    [2.0]
3    [foo]
dtype: object

None of the above suggestions would let me get a string column where the 2 is formatted like the int.

@kylekeppler
Copy link
Contributor

I am surprised Series.str.format() is not a thing, was that intentional?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 17, 2019 via email

@h-vetinari
Copy link
Contributor

I am surprised Series.str.format() is not a thing, was that intentional?

@TomAugspurger: Probably not intentional.

I'd say it hasn't been intentionally omitted, but it cannot easily be added due to the way the .str-accessor works. You need to have strings before you're even able to call .str, so by the time you get to .str.format(...) the question of the format has already been settled for you (to the default).

Of course it would be hypothetically possible to delay the execution of calculating the string representation, and more than that, allow calling .str also for non-string columns (or rather non-object cols for lack of a string-type), but the trend has been going in the opposite direction - as in disabling the .str-accessor for columns that are not strings.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 17, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Astype Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants