-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect result when resampling DatetimeIndex due to hidden cast of ints to floats #3707
Comments
Since numpy cannot support a native |
@jreback how annoying would it be to roll up a |
@cpcloud you could use the |
yeah i hate maskedarrays...(link doesn't work) :) |
If I understand you correctly, rather than trying to determine if an operation would create I still think that in the case where an int as cast to a float to do an operation, pandas should return floats. It's very non-obvious that an operation that accepts ints and returns ints is actually doing floating point calculations in the background! My example uses big integers, to demonstrate the problem, but it could occur with smaller numbers too. Finally, changing the topic slightly, the behavior I'm actually looking for is to replace
I can't see how this is easily possible, but I'm not an expert here. |
checking for Nans costs time, and in any event these types of routines are implemented only as floats (they could in theory be done as ints, but this ONLY works if there are NO Nans at all, which when doing resampling is very rare if you think about it, most of the time you will end up generating nan for missing periods) pandas tries to return the input type, its just more intuitve that way (before 0.11 pandas used to return floats in almost all cases, THAT was a bit of an issue). Now that pandas supports many dtypes it tries to return the correct one, casting is done ONLY when there is not loss of result. (do your above operation with floats, it returns the same result, 0) 2nd question
will work; I don't think pandas supports a |
Here's another way of doing this w/o the casting http://stackoverflow.com/questions/16807836/how-to-resample-a-timeseries-in-pandas-with-a-fill-value |
I'm not sure what's wrong with returning floats when floats have been used to do the calculations, and I personally think it is much clearer than silently casting them back into ints with the subsequent loss of precsion, but it sounds like there may be other issues that I am not aware of. Anyway, it seems that when using ints there are two possible scenarios: (1) the calculation is done entirely with ints without casting (2) the ints are cast to floats for calculation and then cast back to ints when the results are returned. When calling a method, how can I tell which one will happen? Seems like this should at least be documented or even better a signal could be emitted when precision is lost by casting (perhaps by using a mechanism like decimal.Context which allows the user to decide which conditions they wish to trap). Thank you for your answers to my second question. The stackoverflow question you linked was actually my post! Using a custom aggregation function as suggested in the answer does work correctly, but unfortunately the performance is about 30 times worse than Finally, just want to say thanks for developing and supporting pandas. Despite my complaints, I've found it to be an incredibly useful tool in my work. |
yes, unfortunately using a lambda is much slower; that said, it might be worth it to build this type of filling in (e.g. provide a using however, pandas does try to maintain dtype where possible; that is why the cast back happens. And as I said there is NO loss of precision (IF numpy allows it, which it does in this case), e.g.
|
It's not true that there is no loss of precision when casting from int to float. The code in my first post demonstrates the problem and how it can effect the results of a resample operation. Another trivial example:
The value of |
In general, sure (but by definition this is true). @jreback pandas does lose precision with these values. For example, s = Series([18014398509481983] * 10, date_range(start='1/1/2000', periods=10, freq='D'))
s.resample('M')[0] == s[0] # False
s.resample('M')[0] == s[0] + 1 # True
s += 1
s.resample('M')[0] == s[0] # True It ( |
i think u are misunderstanding what pandas is actually doing. the conversion from float back to int occurs ONLY if numpy compares the values as equal (and there are no Nan's and the original dtype was int to begin with) your above example fails as the numbers don't compare equal in the first place however, if there is a loss of precision, eg your example that said if could figure out a reasonable way to determine if there possibly would be a loss of precision then could raise an error ( and let the user decide what to do ). as u suggested above here's a numpy related discussinon: I think we could do the test before casting, essentially this:
I don't believe this is expensive to do either only issue is prob should do this in lots of places :) The cast back does this test so that's not a problem (or it will leave it as float) |
There are TWO casts being done in the resample example that I originally posted. The first is as cast from int to float the second is a cast back to int from float. It doesn't mater if the second cast from float to int occurs only if numpy compares the values as equal, because we've already lost the precision in the FIRST cast from int to float. The code @cpcloud posted shows another example of the same problem. You said "your above example fails as the numbers don't compare equal in the first place" which is exactly the point! We CANNOT guarantee that a cast from int to float will not lose precision! (More precisely, if we are working with 64-bit types, I believe that if the number of bits in the integer exceeds 53, then it cannot be represented exactly as float as that would exceed the float's significand precision.) Anyway, I propose the following possible solutions:
Personally, I am strongly in favor of 3, but I realize there may be other opinions. My reasons are:
In short, I believe this solution has the least confusing behavior, performs the fastest and requires the least amount of code. Thoughts? |
nice summary. The issue is the FIRST cast is not 'safe' (as I defined above), while SECOND is. As far as your solutions. I know you like option 3 (cast most things to float and leave), but
So we are left with a gotchas doc, or providing a warning (or exception). I think providing a Would you do a PR for a short section in http://pandas.pydata.org/pandas-docs/dev/gotchas.html for 0.11.1? |
I've been using pandas for quite a while, but this is my first time venturing into the "development" side of things and I'm not sure how changes are introduced to the code, but I'd be curious to hear the opinions of other developers or users on this issue. Finally, I realize my use-case is probably unusual and that is why no one has reported this problem before. I'm going to redesign my application to use Decimal objects instead of integers which will be a big performance hit, but I don't want to risk finding any more "gotchas" like this. |
Some calculations could be done using ints directly, however this means additional code generation and complexity. This is a really narrow case that in practice is very unlikely. Dtype preservation is quite important; it comes up much more in that almost all calculates with for example There are tradeoffs that are done in pandas for performance gains, if they are giving an incorrect result, that can be fixed, its not really difficult (as I mentioned above), so I think a doc change now is appropriate, and can fix in 0.12 my 2c. Using There are not many gotchas in pandas, in fact, great length has gone to remove many, many more gotchas/outright bugs that exist in numpy. Thanks for bringing this issue up, that's how things get fixed! |
OK. I think we've both made all our points. :) Agree that at least a documentation gotcha should be added here and perhaps a Finally, can I make a feature request for adding a |
@gerdemb go ahead and put up the If you could do a PR for the docs (or just post here, and I can put in, either way).. thanks..... |
closing in favor of #11199 |
Resurrected in #16674. |
To summarize: should raise when loss of precision when casting from ints to floats for computations (everywhere!)
The wrong result is returned in the following case because of unnecessary casting of ints to floats.
This is the offending code in
groupby.py
that does the casting:and even worse, sneakily casts the floats back to integers hiding the problem!
It should be possible to perform this calculation without any casting. For example, the
sum()
of a DataFrame returns the correct result:I am working with financial data that needs to stay as int to protect its precision. In my opinion, any operation that casts values to floats do a calculation, but then returns results as ints is very wrong. At the very least, the results should be returned as floats to show that a cast has been done. Even better would be to perform the calculations with ints whenever possible.
The text was updated successfully, but these errors were encountered: