Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup parsing of really-long files. #66

Closed
wants to merge 3 commits into from

Conversation

Carreau
Copy link

@Carreau Carreau commented Feb 13, 2023

Especially node spanning up really large amount of lines are taking forever in particular:

  • The mapping line -> Node takes forever to be constructed, so
    replace it by a pseudo-interval-tree like data structure, which
    stores only start/end of nodes and is lazy/caching.
  • Don't iterate on all the lines a node spans.

From local profiling with example present in
ipython/ipython#13731,
this make the construction of self._nodes_by_line be negligible in the __init__ function, vs taking ~50+ percent of the time.

That reduce my test example from 2.35s to 1.43s, with a interpreter startup of ~0.25s.

Especially node spanning up really large amount of lines are taking
forever in particular:

    - The mapping line -> Node takes forever to be constructed, so
      replace it by a pseudo-interval-tree like data structure, which
      stores only start/end of nodes and is lazy/caching.
    - Don't iterate on all the lines a node spans.

From local profiling with example present in
ipython/ipython#13731,
this make the construction of self._nodes_by_line be negligible in the
`__init__` function, vs taking ~50+ percent of the time.

That reduce my test example from 2.35s to 1.43s, with a interpreter
startup of ~0.25s.
splitlines() is quite faster than list comprehension, especially on long
file this makes a difference.
@Carreau
Copy link
Author

Carreau commented Feb 13, 2023

Another way of looking at the perf is that on the r6 example given in the IPython issue, and using line profiler, Source.__init__ is 20% self.tree = ast.parse(ast_text, filename=filename), 55% self._nodes_by_line[lineno].append(node).

After this patch, it's 65% self.tree = ast.parse(ast_text, filename=filename), 15% self.lines = text.splitlines()

@Carreau Carreau marked this pull request as draft February 13, 2023 13:22
@Carreau
Copy link
Author

Carreau commented Feb 13, 2023

I'm leaving as draft, as I want to check the impact on different type of files,

@Carreau
Copy link
Author

Carreau commented Feb 13, 2023

I've tested on Pandas's main file where DataFrame is defined and is ~12k lines. It is maybe 1% slower than the master branch. I think this is expected as the logic of adding a node that span a few lines is a bit slower on this implementation than on the master branch.

@15r10nk
Copy link
Collaborator

15r10nk commented Feb 13, 2023

Hi, I like the idea with the CachedIntervalList, but there is something I want to tell you.

  1. which python version are you using? 3.11 uses an new algorithm to find the ast-nodes. The runtime there might be different from the runtime in <3.11.
  2. The algorithm in 3.11 can be improved by using a different caching. It is currently using _nodes_by_line and searches the node in the list of nodes in this line. A better way would be to use a data structure which maps something like (line,offset,end_line,end_offset) to the node (the real implementation would be a bit more complicated). I don't know if this would solve the issue with the runtime complexity which you are describing, but might change the way the CachedIntervalList has to be implemented.
  3. I'm currently working on a new NodeFinder for 3.8, 3.9 and 3.10. The goal there is to be able to map almost every bytecode instruction to a ast-node to provide similar results like the 3.11 algorithm. And this NodeFinder has (you might guessed it) again a different runtime behavior.

I hope this helps you somehow. Let me know If you want to know more.

@Carreau
Copy link
Author

Carreau commented Feb 14, 2023

  1. which python version are you using?

I believe I was using 3.11, results are similar-ish on 3.8 (2.26s, vs 1.96s on worst case example), and 1-2% slower on simpler examples.

I think I'm interested in faster traceback in general. Currently upstream in IPython on really long files I'm reverting to something that does not do syntax highlighting but is configurable).

Another thing I realized is that even if the error is as the beginning of the file with all the dependencies in the stack we end up parsing the whole file.

I'm also wondering if a on-disk cache may help for large files as right now the cache is only on a per-session basis ?

@15r10nk
Copy link
Collaborator

15r10nk commented Feb 14, 2023

I think I understand the problem. Your use case is that you want to map only some nodes in a (maybe large) file. Generating cache information for nodes which might be never mapped is a wast of time in this case.

I think (currently just a idea) that it might be possible to remove the _nodes_by_line structure completely in 3.11. We could use the source position of the bytecode instruction and do some kind of binary search in the ast-tree of the file to lookup the node.

This would provide better performance for the first node lookup but maybe slightly worse in the next lookups in the same file.

@alexmojaki what do you think about it?

@15r10nk
Copy link
Collaborator

15r10nk commented Feb 15, 2023

hi, @Carreau.
I remove _nodes_by_line almost completely, there are still some problems in the tests but the approach might work.

#67

maybe you want to test this out, or you tell me how I can reproduce your problem.

Please keep in mind that this will currently only work in 3.11. _nodes_by_line is still needed in other implementations.

@alexmojaki
Copy link
Owner

Hey, sorry for the silence, life has been busy lately. Thanks @Carreau for the contribution.

I like this idea, but can you demonstrate a benefit in a real-life situation? The reproduction in ipython/ipython#13731 is really artificial and contrived: it has very few nodes and a single giant node. On the other hand you're saying that this is slightly slower for a real large file (i.e. pandas). Besides that, the example has a million lines which really exaggerates problems.

I'd also like to see the code you're using to profile and get those percentages, I think that'd be useful in general. I tried just now and couldn't see that _nodes_by_line was the problem.

I think the only thing that makes node_linenos returning multiple lines actually necessary is multiline attributes and their parent nodes (see alexmojaki/stack_data#13 (comment)). Narrowing down to those exact expressions probably isn't worth it, but simply excluding ast.Constant might solve a big chunk of the problem in a very simple way. That'd take care of docstrings which might be the biggest problem of this type.

@15r10nk
Copy link
Collaborator

15r10nk commented Feb 19, 2023

@15r10nk statements_at_line is part of the public API and would still need to work even if not used internally.

I know. The version which i just committed should support this.

There is still no binary search in the child lists but the concept works (for python >= 3.8 because end_lineno is required).

I think the runtime of the current test suite of executing should not be used as a metric to optimize executing (that was something which I did sometimes ... yea faster tests 🙂 ).
The generic checks of the sample_files perform a mapping of (almost) every instruction, which is not the way executing is used most of the time (please tell me if I'm wrong here). There are usually only some instructions mapped. Precomputations for the other instructions are a wast of compute in this case.

Pre-computations like _nodes_by_line have a impact for the first lookup, which usually experienced in end user code. But it has no big impact if we map every instruction in the file. This impact can be huge, if it triggers a O(n*m) case (for a huge expression over multiple lines).

analysing the performance of executing is quite a challenge, because the input can vary a lot. Big files, big expressions, big functions.

I got some similar problems during my work on #64. However I think what we need are some characteristic benchmarks which we could use as a goal for optimization.

@Carreau
Copy link
Author

Carreau commented Feb 20, 2023

I'd also like to see the code you're using to profile and get those percentages, I think that'd be useful in general.

I'm using https://pypi.org/project/line-profiler/

$ python -m kernprof -l -v  IPython/__main__.py -c "from r6 import f; f()"

With the r6.py file given in the upstream IPython issue. Unfortunately the actual slowdown is in a client file that I do not have access to, so I am also limited in trying to reproduce. I'm also fairly certain the client won't be on 3.11 in the near future.

You do have to add @profile to profile the relevant functions.

@alexmojaki
Copy link
Owner

Were you able to confirm that this branch improves things on the mysterious client file?

@Carreau
Copy link
Author

Carreau commented Feb 20, 2023

Were you able to confirm that this branch improves things on the mysterious client file?

No, I'll see if I can.

@Carreau Carreau closed this Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants