-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove timeouts from db connection #17424
base: develop2
Are you sure you want to change the base?
Conversation
Hi @vient I think this is quite risky. If some operation on the DB that in normal conditions takes milliseconds, is not able to work in 10 seconds, then waiting for ever will probably not only not solve the problem, but will aggravate the whole situation, getting all concurrent jobs acting on the DB to stall or interlock each other. The fact that one of the concurrent processes fails because of this timeout is a safeguard to let the other processes succeed. In our experience this happens more for excessive concurrency over the DB than because of high CPU usage on the machine. Are you using separate caches for separate jobs, and still seeing this issue? Or are you sharing the same cache with a high concurrency (see https://docs.conan.io/2/knowledge/guidelines.html notes about concurrent caches) |
Our case looks like this: Jenkins starts pods which use local conan folders. To speed up conan start up we pass a shared download cache and set Also I wonder if there may be a thread starvation occurring somehow - we use around 100 conan packages, if one thread is waiting for lock while another thread continuously succeeds in acquiring it and then performs a long db operation because of disk latency, first thread may see a 10s timeout even if all db operations take <100ms.
Yes, but this principle works if you use retries. Conan does not retry db operations though, exiting with error immediately instead. This approach aggravates the situation even more, at least in my case when this conan process is the only one using this particular conan folder anyway. My current understanding is that these timeouts only help point out that there is something wrong with db operation latency. There are cases like mine though where nothing can be done about it, and it is expected. If you think that timeouts are indeed helpful, may we discuss increasing them or making an option for timeout value instead? |
Thanks for the detailed feedback.
I am a bit surprised that you are not seeing other kind of race conditions, because as commented above, the cache is not yet designed for concurrency. It is true that if you are very careful and make sure to install/build different packages in the different pods, and avoid the race conditions by the right orchestration and order of builds, the race conditions might be avoided. If there is system exhaustion at the machine level, it might be possible that the DB is sometimes not the actual blocker or bottleneck, but maybe if the DB is getting timeouts even after 10 seconds, then those timeouts are not the bigger problem anyway? If it helps, I think we might be able to increase a bit more those timeouts, like to 15 or even 20 seconds, but I'd still like to keep some limit, waiting forever without any limit seems to me to have more risks than benefits. |
Changelog: Fix: Remove DB connection timeouts
As observed in #13924 (comment) and also by me,
Conan failed to acquire database lock
may be triggered on machines with lots of cores and high load - in CI, in other words. All cases I observed were because of high load. I don't think it is meaningful to fail because of this, as it's treated as a flap, build gets restarted which does not help with cpu load. I don't know any other possible reasons for this timeout to trigger, so in my opinion it should not exist at all.develop
branch, documenting this one.