-
Notifications
You must be signed in to change notification settings - Fork 935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running lxc profile list on a system with lots of profiles results in table: context deadline exceeded #13401
Comments
This is probably due to inefficient queries. Lets look into making fewer queries that return the info needed. |
How many profiles is "lots" in this case? |
Thanks for the quick reply. I actually don't know as I can't list them - maybe I could get a count from the dump - but I'd say we are in the hundreds if not 1K+ |
Try doing |
Count gives me 616 just doing the select and dumping the table is pretty quick:
|
Cool thanks. Suspect its doing a separate query to get each profile's config, rather than a single query with multiple profile IDs and then separating the results in LXD. |
@tomponline This was fixed with a fairly significant db refactor (see #10463 and #10183) that landed in LXD 5.5. This fun one-liner works great with LXD 5.21.1:
It doesn't look to me like it's feasible to backport that set of fixes to 5.0.3; I'm guessing it won't be straightforward to come up with a separate patch for 5.0.3 either, although I haven't done much spelunking to confirm that. Let me know what you think the most reasonable course of action is here. |
We upgraded our system to 5.21.1 and get this:
Sooo... Not fixed @MggMuggins or am I missing something? |
@webdock-io Hi! Some news on this :D I managed to reproduce the issue by simulating network latency between two local VMs, I suspect this is why @MggMuggins 's reproducer did not quite catch the problem.
Mind that, in my reproduction, we only get a timeout when querying from a non-leader LXD cluster member. That makes sense, since all queries on the leader happen locally, so no latency. Could you confirm this also applies to your case? I suspect this is happenning because we make a separate database query for each profile to populate the |
Thanks for your efforts. However, we've essentially switched all of our infrastructure almost 100% to Incus by now where this issue has been solved for ages (or, about a day after we reported it there) This huge wait for bug fixes in LXD was a primary reason we switched, as it's untenable for production workloads like ours. Anyway, I believe the issue did not stem from network latency as this was all happening on a single instance and not a cluster. I believe it was solved om Incus by simply refactoring database code to reduce lookups, doing some caching, things of that nature. But I really don't know the details, you'd have to check the Incus source for that :) |
Will do! In any case, thanks for your report and for your availability, we will proceed with the fix all the same. |
@tomponline This problem actually relates to a tomeout when listing profiles in a standalone envirionment. To fix this, Incus just increased the timeout for transacions, as can be seen here. The other improvements for listing profiles on the same PR are already on LXD for quite some time. If we don't want to go down that road, I think we can just close this. I plan on following up on the discussed fix to efficiently populate the |
I'd like to avoid increasing the timeout to 30s as that feels like just papering over the issue rather than fixing it to me. Suggest instead we first try importing these: |
Yeah I agree
Sure, I have seen those and they contain some caching logic that could be nice to have. But mind that caching alone would not fix this issue, so this is probably why they bumped their timeout. |
What is the issue then (I mean the one from the OP that is happening on a single node, not the one you described when accessing from a non-leader over a slow network)? |
@tomponline #14315 includes some significant improvements to profile listing. I got from 350ms on average to 290ms on my machine. This improvement should be greatly increased the worse the latency is between the LXD server and the database it is reading from. This is the best we can do since we can't reproduce this issue so I think we can close this after merging those improvements. A similar improvement can be done when populating the
From reading the code, other endpoints for listing entities could use this kind of improvement when populating the |
Sounds good thanks! |
Ubuntu Jammy
LXD v5.0.3 and LXD 5.21.1
Running
lxc profile list
on a system with lots of profiles results in the following:Running
lxd sql global .dump
returns almost immediately and lists all data in the databaseWe have a real use case for supporting a lot of profiles in a remote (we allow our customers to build their own)
Adding and deleting individual profiles seems to work, although it's hard to confirm deletion when we can't list them with lxd.
Is there any way to increase the timeout in lxd to allow for listing of our (large, and will only grow larger) profiles list? We could start hacking away at sql queries, but I'd much rather be able to do an lxc profile list
(this use case came up as we actually wanted to make sure the list was cleaned up so any unused profiles were removed)
The text was updated successfully, but these errors were encountered: