-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test_xcom_map_error_fails_task
test
#33178
Comments
cc: @uranusjr |
Another incarnation of the same problem (SQLite produces different error): https://github.com/apache/airflow/actions/runs/5775967682/job/15654656922
|
Possibly related case: https://github.com/apache/airflow/actions/runs/5778361625/job/15659550497#step:5:8291 This is
|
This is what Alternatively I think doing |
I think I applied the run(session=) in all those tests allready thinking that it might help @uranusjr . But to no avail. This is still happening even after I did (or maybe there are other places I missed ?) |
See alll my attempts from last weekend:
I really reach out to you becasue I already exhausted everything I could come up with and look for help. |
BTW., That's the first time I hear about |
I looked up the error messages an it seems like the errors generally happen when there are multiple concurrently running sessions, and the recommended fix is generally to close one of them. But since this happens in a test, and more importantly flaky, I think the error points to there’s another open session somewhere else that is breaking the test. In this case it’d probably be very difficult to pinpoint the offending test since there’s no log on it. The It’s obviously not a very good fix, but considering the situation I wonder if we have a better choice. I’ll see if there’s a good way to toggle that flag more easily (than to pass it manually everywhere). |
Attempt to solve some of the flakiness we saw recently - this change will run closing all opened sqlalchemy sessions before each test. This should be a little slower than before - especially for the very fast tests, but it provide us with much better isolation between the tests - thus avoiding the flakiness that we observe recently - as documented in apache#33178 Hopefully this one Fixes: apache#33178
I have several hypotheses around that.
I had So I think now that changing everywhere
However I find it pretty strange that those test failures are occuring only in the "mapped" versions of tests. There is - I believe - an easy way to verify this hypothesis. I just created this PR #33190 with "full tests needed" - where I added an I believe this is even something that we can geneally leave "for good" in our However, when we exclude that (let's see if we can see the flaky tests with the "close_all_sessions". then whatever is left must be the reason.
The flaky tests are only mapping related. I run probably few 100s (if not thousands) of test jobs recently in my quest to squash the flaky tests and those errors only occur in tests which are related to mapped task instances. Not even once I saw it in other tests. And I saw about 30-40 of such failuers already. For me this is likely that somewhere deep-down the stack there is one forgotten place where we do not pass session but create a new one and they are used together. We just need to find that place. But let's see after I run the "close_all_sessions" change quite a few times and let's see if we can see it happening there. |
BTW. Yes. I absolutely agree that |
Unfortunately no dice.
For me it looks like REALLY there is a bug somewhere in mapping code that we need to track. |
Not yet closed ... |
Attempt to solve some of the flakiness we saw recently - this change will run closing all opened sqlalchemy sessions before each test. This should be a little slower than before - especially for the very fast tests, but it provide us with much better isolation between the tests - thus avoiding the flakiness that we observe recently - as documented in #33178 Hopefully this one Fixes: #33178 (cherry picked from commit 515179f)
I think my deduction and observations were quite right @uranusjr. There is indeed a session management problem in only mapped operators. The out-of-sync operatiors are with allmost 100% certainty caused by this code in airflow/midels/mapped_operator.py and in particular by not passing down the session to
By addind logs for opening and destroying session I caught one case when it happeened and the only extra sessions created were there in I think we should find out a better solution than this hack - it is going to cause us more troubles than it's worth IMHO. I am happy to brainstorm on that or see whether I can come up with some proposal. But yes - it looks like our tests are actually detecting a real problem with our implementation. |
That makes sense, it’s indeed a dangling session object. In a real deployment this code is always only run in a worker process, so the session is collected when the worker terminates and does not cause a leak. Test are however obviously different. Maybe we can use some monkey patching to deal with this. |
I opened #33307 as a possible solution. How can this be validated? |
It happened frequently enough that just running PR with "full tests needed' label few times should give us the answer |
Though, I think it will not fix it. I think the problem is not that we are leaving the dangling session between tests - the problem is that we are running two run_tasks one after the other in the same test - so we are likely keeping the dangling session from the first run while the second is running. That would basically mean that we will have to close the session in between them. |
The monkey-patching fixture in the PR makes all tasks use the same session so I think it has a chance (since there’s nothing to conflict with) |
BTW. From the tests (and after a bit thinking) it's clear that it's not only in run_task. The sessions were leaking also in www and this is quite a bit worse. |
OR maybe not... The www test was something else so maybe that was jumping to conclusions. |
The `render_template_fields` method of mapped operator needs to use database session object to render mapped fields, but it cannot get the session passed by @provide_session decorator, because it is used in derived classes and we cannot change the signature without impacting those classes. So far it was done by creating new session in mapped_operator, but it has the drawback of creating an extra session while one is already created (remnder_template_fields is always run in the context of task run and it always has a session created already in _run_raw_task). It also causes problems in our tests where two opened database session accessed database at the same time and it cases sqlite exception on concurrent access and mysql error on running operations out of sync - likely when the same object was modified in both sessions. This PR changes the approach - rather than creating a new session in the mapped_operator, we are retrieving the session from one stored by the _run_raw_task. It is done by context manager and adequate protection has been added to make sure that: a) the call is made within the context manager b) context manageer is never initialized twice in the same call stack After this change, resources used by running task will be smaller, and mapped tasks will not always open 2 DB sesions. Fixes: apache#33178
The `render_template_fields` method of mapped operator needs to use database session object to render mapped fields, but it cannot get the session passed by @provide_session decorator, because it is used in derived classes and we cannot change the signature without impacting those classes. So far it was done by creating new session in mapped_operator, but it has the drawback of creating an extra session while one is already created (remnder_template_fields is always run in the context of task run and it always has a session created already in _run_raw_task). It also causes problems in our tests where two opened database session accessed database at the same time and it cases sqlite exception on concurrent access and mysql error on running operations out of sync - likely when the same object was modified in both sessions. This PR changes the approach - rather than creating a new session in the mapped_operator, we are retrieving the session from one stored by the _run_raw_task. It is done by context manager and adequate protection has been added to make sure that: a) the call is made within the context manager b) context manageer is never initialized twice in the same call stack After this change, resources used by running task will be smaller, and mapped tasks will not always open 2 DB sesions. Fixes: #33178
The `render_template_fields` method of mapped operator needs to use database session object to render mapped fields, but it cannot get the session passed by @provide_session decorator, because it is used in derived classes and we cannot change the signature without impacting those classes. So far it was done by creating new session in mapped_operator, but it has the drawback of creating an extra session while one is already created (remnder_template_fields is always run in the context of task run and it always has a session created already in _run_raw_task). It also causes problems in our tests where two opened database session accessed database at the same time and it cases sqlite exception on concurrent access and mysql error on running operations out of sync - likely when the same object was modified in both sessions. This PR changes the approach - rather than creating a new session in the mapped_operator, we are retrieving the session from one stored by the _run_raw_task. It is done by context manager and adequate protection has been added to make sure that: a) the call is made within the context manager b) context manageer is never initialized twice in the same call stack After this change, resources used by running task will be smaller, and mapped tasks will not always open 2 DB sesions. Fixes: #33178 (cherry picked from commit ef85c67)
Body
This flaky test appears recently in our jobs and it seems this is a real problem with our code - after few attempts of fixing it, it still appears in our builds:
Example failures:
Committer
The text was updated successfully, but these errors were encountered: