Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove timeouts from db connection #17424

Open
wants to merge 1 commit into
base: develop2
Choose a base branch
from

Conversation

vient
Copy link
Contributor

@vient vient commented Dec 5, 2024

Changelog: Fix: Remove DB connection timeouts

As observed in #13924 (comment) and also by me, Conan failed to acquire database lock may be triggered on machines with lots of cores and high load - in CI, in other words. All cases I observed were because of high load. I don't think it is meaningful to fail because of this, as it's treated as a flap, build gets restarted which does not help with cpu load. I don't know any other possible reasons for this timeout to trigger, so in my opinion it should not exist at all.

  • Refer to the issue that supports this Pull Request.
  • If the issue has missing info, explain the purpose/use case/pain/need that covers this Pull Request.
  • I've read the Contributing guide.
  • I've followed the PEP8 style guides for Python code.
  • I've opened another PR in the Conan docs repo to the develop branch, documenting this one.

@memsharded memsharded self-assigned this Dec 6, 2024
@memsharded
Copy link
Member

Hi @vient

I think this is quite risky. If some operation on the DB that in normal conditions takes milliseconds, is not able to work in 10 seconds, then waiting for ever will probably not only not solve the problem, but will aggravate the whole situation, getting all concurrent jobs acting on the DB to stall or interlock each other. The fact that one of the concurrent processes fails because of this timeout is a safeguard to let the other processes succeed.

In our experience this happens more for excessive concurrency over the DB than because of high CPU usage on the machine. Are you using separate caches for separate jobs, and still seeing this issue? Or are you sharing the same cache with a high concurrency (see https://docs.conan.io/2/knowledge/guidelines.html notes about concurrent caches)

@vient
Copy link
Contributor Author

vient commented Dec 7, 2024

Our case looks like this: Jenkins starts pods which use local conan folders. To speed up conan start up we pass a shared download cache and set parallel=4, that's it. Server has 256 logical cores, we see spontaneous failed to acquire database lock errors when server load approaches its capacity. I'll also note that our servers seem to be weak on disk side - I infer from that that an INSERT into db may take a long time if server is loaded because there is a big disk queue.

Also I wonder if there may be a thread starvation occurring somehow - we use around 100 conan packages, if one thread is waiting for lock while another thread continuously succeeds in acquiring it and then performs a long db operation because of disk latency, first thread may see a 10s timeout even if all db operations take <100ms.

If some operation on the DB that in normal conditions takes milliseconds, is not able to work in 10 seconds, then waiting for ever will probably not only not solve the problem, but will aggravate the whole situation, getting all concurrent jobs acting on the DB to stall or interlock each other.

Yes, but this principle works if you use retries. Conan does not retry db operations though, exiting with error immediately instead. This approach aggravates the situation even more, at least in my case when this conan process is the only one using this particular conan folder anyway.


My current understanding is that these timeouts only help point out that there is something wrong with db operation latency. There are cases like mine though where nothing can be done about it, and it is expected. If you think that timeouts are indeed helpful, may we discuss increasing them or making an option for timeout value instead?
There is also a WAL mode but I don't know much about it.

@memsharded
Copy link
Member

Thanks for the detailed feedback.

To speed up conan start up we pass a shared download cache and set parallel=4

I am a bit surprised that you are not seeing other kind of race conditions, because as commented above, the cache is not yet designed for concurrency. It is true that if you are very careful and make sure to install/build different packages in the different pods, and avoid the race conditions by the right orchestration and order of builds, the race conditions might be avoided.

If there is system exhaustion at the machine level, it might be possible that the DB is sometimes not the actual blocker or bottleneck, but maybe if the DB is getting timeouts even after 10 seconds, then those timeouts are not the bigger problem anyway?

If it helps, I think we might be able to increase a bit more those timeouts, like to 15 or even 20 seconds, but I'd still like to keep some limit, waiting forever without any limit seems to me to have more risks than benefits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants