Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: added no exception versions of '_string_to_dts' and 'parse_iso_8601_datetime' functions #26220

Merged
merged 14 commits into from
May 7, 2019

Conversation

anmyachev
Copy link
Contributor

  • closes #N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

asv continuous -f 1.05 origin/master HEAD -b ^io.csv -b ^timeseries -a warmup_time=1 -a sample_time=1:

master patch ratio test name
98.4±3ms 76.2±2ms 0.77 timeseries.ToDatetimeFormatQuarters.time_infer_quarter
15.2±0.6ms 9.75±0.04ms 0.64 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY')
37.1±2ms 22.4±0.1ms 0.60 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY')

@codecov
Copy link

codecov bot commented Apr 26, 2019

Codecov Report

Merging #26220 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26220      +/-   ##
==========================================
- Coverage   91.97%   91.97%   -0.01%     
==========================================
  Files         175      175              
  Lines       52379    52379              
==========================================
- Hits        48178    48175       -3     
- Misses       4201     4204       +3
Flag Coverage Δ
#multiple 90.52% <ø> (ø) ⬆️
#single 40.69% <ø> (-0.16%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.71% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65466f0...0598922. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 26, 2019

Codecov Report

Merging #26220 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26220      +/-   ##
==========================================
- Coverage   91.98%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52379    52374       -5     
==========================================
- Hits        48183    48174       -9     
- Misses       4196     4200       +4
Flag Coverage Δ
#multiple 90.53% <ø> (-0.01%) ⬇️
#single 40.72% <ø> (-0.16%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/compat/__init__.py 94.59% <0%> (-0.28%) ⬇️
pandas/io/excel/_util.py 87.32% <0%> (-0.18%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️
pandas/core/reshape/merge.py 94.47% <0%> (-0.03%) ⬇️
pandas/core/reshape/pivot.py 96.52% <0%> (-0.02%) ⬇️
pandas/core/indexes/range.py 97.96% <0%> (-0.01%) ⬇️
pandas/io/pytables.py 90.22% <0%> (-0.01%) ⬇️
pandas/core/dtypes/inference.py 100% <0%> (ø) ⬆️
pandas/core/groupby/groupby.py 97.24% <0%> (+0.01%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f46ab96...cffa8a6. Read the comment docs.

@WillAyd WillAyd added Performance Memory or execution speed performance Datetime Datetime data dtype labels Apr 26, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the speedup here is purely replacing the "except? -1" clause in the cdef with manual error checking right?

pandas/_libs/tslib.pyx Outdated Show resolved Hide resolved
@anmyachev
Copy link
Contributor Author

So the speedup here is purely replacing the "except? -1" clause in the cdef with manual error checking right?

@WillAyd you're right.

@WillAyd
Copy link
Member

WillAyd commented Apr 27, 2019

OK thanks. Do you know if the question mark in that statement has a performance hit?

Not against this change just surprised to see it make that big of a difference. Certainly being able to leverage the built-in Cython functionality would be simpler and preferable if removing the question mark gets us closer (hence my question)

@anmyachev
Copy link
Contributor Author

OK thanks. Do you know if the question mark in that statement has a performance hit?

The performance degradation from this is unimportant because most likely this is one additional PyErr_Occurred() call, I think.

Not against this change just surprised to see it make that big of a difference. Certainly being able to leverage the built-in Cython functionality would be simpler and preferable if removing the question mark gets us closer (hence my question)

The main performance increase is that we stop throwing a ValueError exception (this is too expensive for python). So removing the question mark will not allow performance gains and will most likely change the logic of date parsing.

@jbrockmendel
Copy link
Member

The main performance increase is that we stop throwing a ValueError exception (this is too expensive for python)

@anmyachev any idea how many more formats/functions you plan to optimize like this? In the past we've been trying to get code out of the C files and to stay python-idiomatic where possible. It looks like you're making an effort to keep the non-idiomatic code contained to np_datetime, which is good.

Could the raising of ValueError be made more efficient upstream in cython? Just spitballing: if the return-code based version is really equivalent to the raising-based version, it seems like cython should be able to figure that out.

iresult[i] = NPY_NAT
string_to_dts_failed = _string_to_dts_noexc(
val, &dts, &out_local,
&out_tzoffset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think cython now supports specifying typed tuples for return types, e.g. cdef (bint, int, int) _string_to_dts_no_exc(...). That would be more idiomatic than pointer-passing. Thoughts @WillAyd ?

@anmyachev
Copy link
Contributor Author

@anmyachev any idea how many more formats/functions you plan to optimize like this? In the past we've been trying to get code out of the C files and to stay python-idiomatic where possible. It looks like you're making an effort to keep the non-idiomatic code contained to np_datetime, which is good.

At the moment, nothing more in this way. This function was "hot" when profiling, and the main slowdown was just in throwing an exception. Other functions with exactly the same problem have not been identified yet.

Could the raising of ValueError be made more efficient upstream in cython? Just spitballing: if the return-code based version is really equivalent to the raising-based version, it seems like cython should be able to figure that out.

Probably, but the main problem is the speed of exceptions in the python itself, and not how сython generates it.

In the first case, we need the function to throw an exception, and in the other it does not, depending on our choice(for this we use want_exc argument). I cannot achieve this purely with the construction of a cython.

If the existence of functions "_string_to_dts_noexc" and "_string_to_dts" is not very good for you, then I can try to make one function, but with an additional parameter (to throw an exception or not) while leaving with the instruction "except? -1". What do you say?

@jbrockmendel
Copy link
Member

If the existence of functions "_string_to_dts_noexc" and "_string_to_dts" is not very good for you [...]

Not a problem at all. Why would we ever use the slow version?

Probably, but the main problem is the speed of exceptions in the python itself, and not how сython generates it.

_string_to_dts is cdef, so it can only be called from C/cython. The idea was that for these functions, cython could make the substitution and only ever do the raising (slow) in a python-exposed function.

@vnlitvinov
Copy link
Contributor

The idea was that for these functions, cython could make the substitution and only ever do the raising (slow) in a python-exposed function.

This is possible if desired, we'd just have to make parse_iso8601 return different error codes to specify correct exception messages.

The exceptions themselves are not that slow, but in this particular case they were taking way more time than parse_iso8601 function itself (because the function on usual data very quickly either parses the string or figures out it's not a ISO8601 datetime - nothing of this requires new allocations), and raising an exception requires allocating a few objects (at least the exception and the traceback objects) and unwinding the stack to store in the traceback. If the only thing we care (as in this particular case) is whether exception happened at all (and we don't care about message or stacktrace), then we should be fine using boolean "error happened" flag.

As for Cython being smart there - I think it might be possible to make a keyword or something that would tell Cython to generate some error-code-based error checking and hide it pretending it's exceptions (like @nogil pretends it's still Pythonic but releases GIL).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is adding a lot of complexity for a relativily small gain. If you want to remove the exception handling, then just do it and force a check everytime it is used. having 2 function is way more complex code.

@anmyachev
Copy link
Contributor Author

this is adding a lot of complexity for a relativily small gain. If you want to remove the exception handling, then just do it and force a check everytime it is used. having 2 function is way more complex code.

@jreback "noexc" version of functions removed

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this actually have a non-trivial improvemet?

pandas/_libs/tslib.pyx Outdated Show resolved Hide resolved
pandas/tests/tslibs/test_parse_iso8601.py Outdated Show resolved Hide resolved
pandas/_libs/tslib.pyx Outdated Show resolved Hide resolved
@anmyachev
Copy link
Contributor Author

@jreback I simplified changes.

@jreback jreback added this to the 0.25.0 milestone Apr 30, 2019
pandas/_libs/tslibs/conversion.pyx Outdated Show resolved Hide resolved
pandas/_libs/tslibs/conversion.pyx Outdated Show resolved Hide resolved
pandas/_libs/tslibs/conversion.pyx Outdated Show resolved Hide resolved
pandas/_libs/tslibs/conversion.pyx Outdated Show resolved Hide resolved
pandas/_libs/tslibs/conversion.pyx Outdated Show resolved Hide resolved
pandas/_libs/tslibs/conversion.pyx Outdated Show resolved Hide resolved
@anmyachev
Copy link
Contributor Author

@jreback @WillAyd all CI builds (include macOS py35_macos) actually completed successfully.

@anmyachev anmyachev force-pushed the string-to-dts-noexc branch from 030ffa0 to 543257a Compare May 4, 2019 15:19
pandas/_libs/tslibs/conversion.pyx Show resolved Hide resolved
pandas/_libs/tslibs/conversion.pyx Show resolved Hide resolved
@jreback jreback merged commit 2e4e0b9 into pandas-dev:master May 7, 2019
@jreback
Copy link
Contributor

jreback commented May 7, 2019

thanks @anmyachev

@TomAugspurger
Copy link
Contributor

@anmyachev
Copy link
Contributor Author

@TomAugspurger I'll see what's the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants