Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken regex for allowed_deserialization_classes #36147

Merged
merged 12 commits into from
Jan 22, 2024

Conversation

tobiaszorzetto
Copy link
Contributor

@tobiaszorzetto tobiaszorzetto commented Dec 9, 2023

This PR aims to fix the problem involving a broken regex in the function utilized for standardizing the allowed_deserialization_classes, as can be seen in issue #34093.

Before these changes, class paths with . in the middle, such as airflow.example, were not matched with paths it was supposed to match. The old regex would substitute the . with a \\... and so, the path mentioned would become airlow\\..example. This new string would be passed as the pattern to be matched, but what it does is match the word airflow, followed by a literal ., then the second . would act as a wildcard to match any character, followed by the word example. Therefore, the original string airflow.example would not match this pattern, because it would be missing an extra character after the . and before the word example. For example, a class such as airflow.texample would match in a path such as the one given before. This error was not caught before, because the test cases didn't include paths with . in the middle, only in the end.

The solution for this issue, as discussed in the PR comments, was to refactor how the flag works. Now, instead of accepting both glob and regex patterns (and then transforming glob patterns into a regexp, in a complex manner), the allowed_deserialization_classes flag now only accepts glob patterns.

If the user still wants to use regexp to determine their allowed deserialization classes, they can use the new flag allowed_deserialization_classes_regexp. In this way, the code will first try to match the classes with a glob pattern, and if it fails will try to match it with a regexp.

These 2 different flags maintain the flexibility that was the goal for PR #28829, simplifying code and test readability, allowing future developers to understand easier whats going on, not needing to understand a complex nesting of regex transformations.

So that these changes could be tested and the functionality evidenced, the unit tests for the serialization were touched and new unit tests were added.

Closes: #34093

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link

boring-cyborg bot commented Dec 9, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@tobiaszorzetto tobiaszorzetto changed the title Fix broken regex for Fix broken regex for allowed_deserialization_classes Dec 9, 2023
@eladkal eladkal added this to the Airflow 2.8.1 milestone Dec 10, 2023
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Dec 10, 2023
@potiuk
Copy link
Member

potiuk commented Dec 13, 2023

Although alternatively a allowed_serialization_classes could be the glob version and allowed_serialization_classes_regexp could be the regexp version.

Could that work?

Yes. I think that's better.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following @bolkedebruin's performance comment. I think we could improve caching and perforrmance much more there:

  1. the _get_patterns should returned compiled regexps not strings
  2. _match_glob and _match_regexp could use lru_cache as well.

@tobiaszorzetto
Copy link
Contributor Author

Following @bolkedebruin's performance comment. I think we could improve caching and perforrmance much more there:

  1. the _get_patterns should returned compiled regexps not strings
  2. _match_glob and _match_regexp could use lru_cache as well.

Hey @potiuk! we have implemented these changes :). However, I'm thinking if maybe the lru_caches for the _get_patterns fuctions may be unecessary as the _match fuctions, which are the ones calling the prior mentioned, are already wrapped by the cache functions. Do you think they are still a good precaution? As I don't have much experience with the use of cache maybe I'm missing something...

@potiuk
Copy link
Member

potiuk commented Dec 20, 2023

Hey @potiuk! we have implemented these changes :). However, I'm thinking if maybe the lru_caches for the _get_patterns fuctions may be unecessary as the _match fuctions, which are the ones calling the prior mentioned, are already wrapped by the cache functions. Do you think they are still a good precaution? As I don't have much experience with the use of cache maybe I'm missing something...

These two serve different purpose and will work differently (you can compare it to L1 / L2 cache in processors.

Tier 1) cache: __get_patterns will compile all the regular expression exactly once per interpreter run. Even if there different classes attempted to be serialized, they will return the same compiled list of patterns used over and over -(withou thte need to compile them again).

Tier 2) cache: __match functions will have precisely one True/False boolean stored PER class. When you have a method that has string as parameter, the cache works like a dictionary - you will have one value computed per each combination of parameters. The parameters need to be hashable and positional (which they are)

It means that in this case if you run _match on (say) 'my_package.my_module.MyClass- the bool return value for it will be calculated by matching regexps and stored in the dictionary (and every time the same class is checked, it will be very quickly returned from that dictionary. But if you use it formy_package.my_module.MyClass2` - then MyClass2 will be checked against the same list of compiled patterns. Matching will happen (once per class) but the patterns will be reused as they are cached in Level 1)

Once more comment - there is a slight worry for Tier 2) that there will be a lot of classes checked whether they are serializable and the cache will grow a lot - but I think theis will be small. First of all it is unlikely to have vast amount of classes to be serialized. Secondly those classes should already be in memory (because otherwise they would not be checked) so they should fit memory anyway. Keeping directory of Bools hashed by class name is a very small overhead - the cache dictionary will reuse bool True/False objects (True/False are singletons) and the only overhead will be calculated hash based on class module/name.

@potiuk
Copy link
Member

potiuk commented Dec 20, 2023

Just spelchecking /docs needs to be fixed.

@VictorDominguite
Copy link
Contributor

Just spelchecking /docs needs to be fixed.

Hi @potiuk! First of all, we'd like to thank you for the explanation about the caches, it's all clearer now :) . Second of all, this might sound silly, but we are having a little bit of a hard time trying to figure out where the spelling mistake is... maybe we didn't know how to interpret the logs of the spellchecking test. Do you have any tips on how to find this error?

@potiuk
Copy link
Member

potiuk commented Dec 20, 2023

I rebased your PR. There were some changes to auth-manager docs structure recently and it works in the way that if you have not rebased, such changes in document structure might cause docs building to fail and you need to rebase to the latest main to fix it.

@potiuk
Copy link
Member

potiuk commented Dec 20, 2023

Generally - as a rule - if you see unrelated error, rebase to see if this is not a result of being behind main

@potiuk
Copy link
Member

potiuk commented Dec 21, 2023

@bolkedebruin ?

@tobiaszorzetto
Copy link
Contributor Author

Hey @potiuk ! Sorry for the ask, but I was wondering if there was a way to accelerate the process of closing this PR. Thanks!

@bolkedebruin
Copy link
Contributor

Good point. I got sidetracked on this one. I'll take a look

@bolkedebruin bolkedebruin merged commit 20cb70b into apache:main Jan 22, 2024
55 checks passed
Copy link

boring-cyborg bot commented Jan 22, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

jedcunningham pushed a commit that referenced this pull request Feb 9, 2024
---------

Co-authored-by: Victor Dominguite <[email protected]>
Co-authored-by: Elad Kalif <[email protected]>
(cherry picked from commit 20cb70b)
potiuk pushed a commit that referenced this pull request Feb 13, 2024
---------

Co-authored-by: Victor Dominguite <[email protected]>
Co-authored-by: Elad Kalif <[email protected]>
(cherry picked from commit 20cb70b)
ephraimbuddy pushed a commit that referenced this pull request Feb 22, 2024
---------

Co-authored-by: Victor Dominguite <[email protected]>
Co-authored-by: Elad Kalif <[email protected]>
(cherry picked from commit 20cb70b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The allowed_deserialization_classes regex is broken
6 participants