-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup parsing of really-long files. #66
Conversation
Especially node spanning up really large amount of lines are taking forever in particular: - The mapping line -> Node takes forever to be constructed, so replace it by a pseudo-interval-tree like data structure, which stores only start/end of nodes and is lazy/caching. - Don't iterate on all the lines a node spans. From local profiling with example present in ipython/ipython#13731, this make the construction of self._nodes_by_line be negligible in the `__init__` function, vs taking ~50+ percent of the time. That reduce my test example from 2.35s to 1.43s, with a interpreter startup of ~0.25s.
splitlines() is quite faster than list comprehension, especially on long file this makes a difference.
b5f1325
to
de73be5
Compare
Another way of looking at the perf is that on the r6 example given in the IPython issue, and using line profiler, After this patch, it's 65% |
I'm leaving as draft, as I want to check the impact on different type of files, |
I've tested on Pandas's main file where |
Hi, I like the idea with the CachedIntervalList, but there is something I want to tell you.
I hope this helps you somehow. Let me know If you want to know more. |
I believe I was using 3.11, results are similar-ish on 3.8 (2.26s, vs 1.96s on worst case example), and 1-2% slower on simpler examples. I think I'm interested in faster traceback in general. Currently upstream in IPython on really long files I'm reverting to something that does not do syntax highlighting but is configurable). Another thing I realized is that even if the error is as the beginning of the file with all the dependencies in the stack we end up parsing the whole file. I'm also wondering if a on-disk cache may help for large files as right now the cache is only on a per-session basis ? |
I think I understand the problem. Your use case is that you want to map only some nodes in a (maybe large) file. Generating cache information for nodes which might be never mapped is a wast of time in this case. I think (currently just a idea) that it might be possible to remove the _nodes_by_line structure completely in 3.11. We could use the source position of the bytecode instruction and do some kind of binary search in the ast-tree of the file to lookup the node. This would provide better performance for the first node lookup but maybe slightly worse in the next lookups in the same file. @alexmojaki what do you think about it? |
hi, @Carreau. maybe you want to test this out, or you tell me how I can reproduce your problem. Please keep in mind that this will currently only work in 3.11. _nodes_by_line is still needed in other implementations. |
Hey, sorry for the silence, life has been busy lately. Thanks @Carreau for the contribution. I like this idea, but can you demonstrate a benefit in a real-life situation? The reproduction in ipython/ipython#13731 is really artificial and contrived: it has very few nodes and a single giant node. On the other hand you're saying that this is slightly slower for a real large file (i.e. pandas). Besides that, the example has a million lines which really exaggerates problems. I'd also like to see the code you're using to profile and get those percentages, I think that'd be useful in general. I tried just now and couldn't see that I think the only thing that makes |
@15r10nk
|
I know. The version which i just committed should support this. There is still no binary search in the child lists but the concept works (for python >= 3.8 because end_lineno is required). I think the runtime of the current test suite of executing should not be used as a metric to optimize executing (that was something which I did sometimes ... yea faster tests 🙂 ). Pre-computations like analysing the performance of executing is quite a challenge, because the input can vary a lot. Big files, big expressions, big functions. I got some similar problems during my work on #64. However I think what we need are some characteristic benchmarks which we could use as a goal for optimization. |
I'm using https://pypi.org/project/line-profiler/
With the You do have to add |
Were you able to confirm that this branch improves things on the mysterious client file? |
No, I'll see if I can. |
Especially node spanning up really large amount of lines are taking forever in particular:
replace it by a pseudo-interval-tree like data structure, which
stores only start/end of nodes and is lazy/caching.
From local profiling with example present in
ipython/ipython#13731,
this make the construction of self._nodes_by_line be negligible in the
__init__
function, vs taking ~50+ percent of the time.That reduce my test example from 2.35s to 1.43s, with a interpreter startup of ~0.25s.