-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken regex for allowed_deserialization_classes #36147
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
Yes. I think that's better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following @bolkedebruin's performance comment. I think we could improve caching and perforrmance much more there:
- the _get_patterns should returned compiled regexps not strings
- _match_glob and _match_regexp could use lru_cache as well.
Hey @potiuk! we have implemented these changes :). However, I'm thinking if maybe the lru_caches for the _get_patterns fuctions may be unecessary as the _match fuctions, which are the ones calling the prior mentioned, are already wrapped by the cache functions. Do you think they are still a good precaution? As I don't have much experience with the use of cache maybe I'm missing something... |
These two serve different purpose and will work differently (you can compare it to L1 / L2 cache in processors. Tier 1) cache: __get_patterns will compile all the regular expression exactly once per interpreter run. Even if there different classes attempted to be serialized, they will return the same compiled list of patterns used over and over -(withou thte need to compile them again). Tier 2) cache: __match functions will have precisely one True/False boolean stored PER class. When you have a method that has string as parameter, the cache works like a dictionary - you will have one value computed per each combination of parameters. The parameters need to be hashable and positional (which they are) It means that in this case if you run Once more comment - there is a slight worry for Tier 2) that there will be a lot of classes checked whether they are serializable and the cache will grow a lot - but I think theis will be small. First of all it is unlikely to have vast amount of classes to be serialized. Secondly those classes should already be in memory (because otherwise they would not be checked) so they should fit memory anyway. Keeping directory of Bools hashed by class name is a very small overhead - the cache dictionary will reuse bool True/False objects (True/False are singletons) and the only overhead will be calculated hash based on class module/name. |
Just spelchecking /docs needs to be fixed. |
Hi @potiuk! First of all, we'd like to thank you for the explanation about the caches, it's all clearer now :) . Second of all, this might sound silly, but we are having a little bit of a hard time trying to figure out where the spelling mistake is... maybe we didn't know how to interpret the logs of the spellchecking test. Do you have any tips on how to find this error? |
Co-authored-by: Elad Kalif <[email protected]>
I rebased your PR. There were some changes to auth-manager docs structure recently and it works in the way that if you have not rebased, such changes in document structure might cause docs building to fail and you need to rebase to the latest main to fix it. |
Generally - as a rule - if you see unrelated error, rebase to see if this is not a result of being behind |
Hey @potiuk ! Sorry for the ask, but I was wondering if there was a way to accelerate the process of closing this PR. Thanks! |
Good point. I got sidetracked on this one. I'll take a look |
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
--------- Co-authored-by: Victor Dominguite <[email protected]> Co-authored-by: Elad Kalif <[email protected]> (cherry picked from commit 20cb70b)
--------- Co-authored-by: Victor Dominguite <[email protected]> Co-authored-by: Elad Kalif <[email protected]> (cherry picked from commit 20cb70b)
--------- Co-authored-by: Victor Dominguite <[email protected]> Co-authored-by: Elad Kalif <[email protected]> (cherry picked from commit 20cb70b)
This PR aims to fix the problem involving a broken regex in the function utilized for standardizing the allowed_deserialization_classes, as can be seen in issue #34093.
Before these changes, class paths with
.
in the middle, such asairflow.example
, were not matched with paths it was supposed to match. The old regex would substitute the.
with a\\...
and so, the path mentioned would becomeairlow\\..example
. This new string would be passed as the pattern to be matched, but what it does is match the wordairflow
, followed by a literal.
, then the second.
would act as a wildcard to match any character, followed by the wordexample
. Therefore, the original stringairflow.example
would not match this pattern, because it would be missing an extra character after the.
and before the wordexample
. For example, a class such asairflow.texample
would match in a path such as the one given before. This error was not caught before, because the test cases didn't include paths with.
in the middle, only in the end.The solution for this issue, as discussed in the PR comments, was to refactor how the flag works. Now, instead of accepting both glob and regex patterns (and then transforming glob patterns into a regexp, in a complex manner), the
allowed_deserialization_classes
flag now only accepts glob patterns.If the user still wants to use regexp to determine their allowed deserialization classes, they can use the new flag
allowed_deserialization_classes_regexp
. In this way, the code will first try to match the classes with a glob pattern, and if it fails will try to match it with a regexp.These 2 different flags maintain the flexibility that was the goal for PR #28829, simplifying code and test readability, allowing future developers to understand easier whats going on, not needing to understand a complex nesting of regex transformations.
So that these changes could be tested and the functionality evidenced, the unit tests for the serialization were touched and new unit tests were added.
Closes: #34093
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.