Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace tokenize with ast to parse the source code #144

Closed
wants to merge 4 commits into from

Conversation

arnaud-ma
Copy link
Contributor

@arnaud-ma arnaud-ma commented Jul 11, 2024

Instead of parsing by hand token by token, ast returns a python object that just needs to be read. We can still make mistakes when reading this object, but no more inevitable edge cases from parsing the source code.

Closes #97
Closes #130

@@ -219,56 +226,154 @@ def source_line_to_tokens(obj: object) -> Dict[int, List[Dict[str, Union[str, in
return line_to_tokens


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three functions above are public but are no longer used in the code at all. Should they just stay there?

@swansonk14
Copy link
Owner

Hi @arnaud-ma,

Thank you so much for your PR! We really like your ast approach to resolving the multiline assignment issue.

After studying this PR, we realized that some of the ideas in this PR could be applied with fewer changes. This is implemented in: #148.

Best,
Jesse and Kyle

@swansonk14 swansonk14 closed this Aug 24, 2024
@arnaud-ma
Copy link
Contributor Author

arnaud-ma commented Aug 25, 2024

Hi, thanks for the update! The idea behind this PR was to completely replace the use of tokenize with ast. This solves all possible bugs with code parsing. I also think (to test) that there should be a speed up in terms of performance. Moreover in the new PR, the source code is read twice (with inspect.getsource): once for tokenize and once for ast. And it is this reading that is (by far) the most expensive in performance. But I understand that what i proposed was too much change at once

@martinjm97
Copy link
Collaborator

Oh, very interesting! I'd be interested in knowing whether AST resolves: #45. Either you/(Kyle and I) can compare the performance of code using only ast to the current code. If it resolves performance issues, then I think that probably justifies a larger code change. Sorry we missed this aspect of your PR. Thanks again for the contribution.

@arnaud-ma
Copy link
Contributor Author

arnaud-ma commented Aug 27, 2024

Given this code where NewTap is from my PR and Tap is from before #148:

import timeit

from tap import Tap
from tap.tap import NewTap
from tap.utils import _old_get_class_variables, get_class_variables


class SimpleArgumentParser(Tap):
    name: str  # Your name
    language: str = "Python"  # Programming language
    package: str = "Tap"  # Package name
    stars: int  # Number of stars
    max_stars: int = 5  # Maximum stars


class NewSimpleArgumentParser(NewTap):
    name: str  # Your name
    language: str = "Python"  # Programming language
    package: str = "Tap"  # Package name
    stars: int  # Number of stars
    max_stars: int = 5  # Maximum stars


if __name__ == "__main__":
    number = 1000
    number_cls = 100

    t_new_cls = timeit.timeit(
        "NewSimpleArgumentParser()",
        globals=globals(),
        number=number_cls,
    )

    t_old_cls = timeit.timeit(
        "SimpleArgumentParser()",
        globals=globals(),
        number=number_cls,
    )

    t_new = timeit.timeit(
        "get_class_variables(SimpleArgumentParser)",
        globals=globals(),
        number=number,
    )
    t_old = timeit.timeit(
        "_old_get_class_variables(SimpleArgumentParser)",
        globals=globals(),
        number=number,
    )

    increase = (t_old - t_new) / t_old
    increase_cls = (t_old_cls - t_new_cls) / t_old_cls
    print(f"time new get_class_variables: {t_new / number * 1000:.4f} ms per call")
    print(f"time old get_class_variables: {t_old / number_cls * 1000:.4f} ms per call")
    print(f"new is {increase * 100:.2f}% faster")

    print(f"time new class init: {t_new_cls / number * 1000:.4f} ms per call")
    print(f"time old class init: {t_old_cls / number_cls * 1000:.4f} ms per call")
    print(f"new is {increase_cls * 100:.2f}% faster")

I get:

time new get_class_variables: 0.3262 ms per call
time old get_class_variables: 0.5686 ms per call
new is 42.63% faster
time new class init: 17.0593 ms per call
time old class init: 21.7422 ms per call
new is 21.54% faster

It's better, but it won't solve #45 since it is inspect.getsource that takes the majority of the remaining time

@arnaud-ma arnaud-ma deleted the use-ast branch August 27, 2024 22:19
@arnaud-ma
Copy link
Contributor Author

I made a specific PR (#149) related to performance, you can ignore this one

@martinjm97
Copy link
Collaborator

Perfect! Thank you for the update! Kyle and I will discuss it next time we meet. I approved running the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants