Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running lxc profile list on a system with lots of profiles results in table: context deadline exceeded #13401

Closed
webdock-io opened this issue Apr 29, 2024 · 17 comments
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@webdock-io
Copy link

webdock-io commented Apr 29, 2024

Ubuntu Jammy
LXD v5.0.3 and LXD 5.21.1

Running lxc profile list on a system with lots of profiles results in the following:

root@lxdremote:~# lxc profile list
Error: Failed to fetch from "profile_device_config" table: Failed to fetch from "profile_device_config" table: context deadline exceeded

Running lxd sql global .dump returns almost immediately and lists all data in the database

We have a real use case for supporting a lot of profiles in a remote (we allow our customers to build their own)

Adding and deleting individual profiles seems to work, although it's hard to confirm deletion when we can't list them with lxd.

Is there any way to increase the timeout in lxd to allow for listing of our (large, and will only grow larger) profiles list? We could start hacking away at sql queries, but I'd much rather be able to do an lxc profile list

(this use case came up as we actually wanted to make sure the list was cleaned up so any unused profiles were removed)

@tomponline
Copy link
Member

This is probably due to inefficient queries. Lets look into making fewer queries that return the info needed.

@tomponline
Copy link
Member

How many profiles is "lots" in this case?

@webdock-io
Copy link
Author

Thanks for the quick reply.

I actually don't know as I can't list them - maybe I could get a count from the dump - but I'd say we are in the hundreds if not 1K+

@tomponline
Copy link
Member

tomponline commented Apr 29, 2024

Try doing lxd sql global 'select * from profiles' or lxd sql global 'select count(*) from profiles'

@webdock-io
Copy link
Author

Count gives me 616

just doing the select and dumping the table is pretty quick:

time lxd sql global 'select * from profiles'
... stuff
real    0m0.166s
user    0m0.044s
sys     0m0.064s

@tomponline
Copy link
Member

Count gives me 616

just doing the select and dumping the table is pretty quick:

time lxd sql global 'select * from profiles'
... stuff
real    0m0.166s
user    0m0.044s
sys     0m0.064s

Cool thanks.

Suspect its doing a separate query to get each profile's config, rather than a single query with multiple profile IDs and then separating the results in LXD.

@tomponline tomponline added the Bug Confirmed to be a bug label Apr 29, 2024
@tomponline tomponline added this to the lxd-6.1 milestone Apr 29, 2024
@MggMuggins MggMuggins self-assigned this May 15, 2024
@MggMuggins
Copy link
Contributor

@tomponline This was fixed with a fairly significant db refactor (see #10463 and #10183) that landed in LXD 5.5. This fun one-liner works great with LXD 5.21.1:

for p in p{1..1000}; do lxc profile create $p & done && lxc profile ls

It doesn't look to me like it's feasible to backport that set of fixes to 5.0.3; I'm guessing it won't be straightforward to come up with a separate patch for 5.0.3 either, although I haven't done much spelunking to confirm that. Let me know what you think the most reasonable course of action is here.

@webdock-io
Copy link
Author

We upgraded our system to 5.21.1 and get this:

root@lxdremote:~# snap refresh lxd --channel=latest/stable
2024-06-04T07:46:48Z INFO Waiting for "snap.lxd.daemon.service" to stop.
lxd 5.21.1-2d13beb from Canonical✓ refreshed
root@lxdremote:~# nano /etc/hosts
root@lxdremote:~# lxc profile list
Error: Failed to fetch from "profile_device_config" table: Failed to fetch from "profile_device_config" table: context deadline exceeded
root@lxdremote:~# lxc --version
5.21.1 LTS

Sooo... Not fixed @MggMuggins or am I missing something?

@tomponline tomponline modified the milestones: lxd-6.1, lxd-6.2 Jun 19, 2024
@MggMuggins MggMuggins removed their assignment Jun 28, 2024
@hamistao
Copy link
Contributor

hamistao commented Oct 16, 2024

@webdock-io Hi! Some news on this :D

I managed to reproduce the issue by simulating network latency between two local VMs, I suspect this is why @MggMuggins 's reproducer did not quite catch the problem.

tc qdisc replace dev enp5s0 root netem delay 100ms # simulate latency of 100ms on enp5s0 interface
for p in p{1..200}; do lxc profile create $p & done && lxc profile ls # this now results in a timeout

Mind that, in my reproduction, we only get a timeout when querying from a non-leader LXD cluster member. That makes sense, since all queries on the leader happen locally, so no latency. Could you confirm this also applies to your case?

I suspect this is happenning because we make a separate database query for each profile to populate the usedBy field. I am working on a fix for this now and, if my theory is correct, we should have it merged and working soon. The fix will then be backported to 5.21 in a few more days. Cheers :)

@webdock-io
Copy link
Author

Thanks for your efforts. However, we've essentially switched all of our infrastructure almost 100% to Incus by now where this issue has been solved for ages (or, about a day after we reported it there)

This huge wait for bug fixes in LXD was a primary reason we switched, as it's untenable for production workloads like ours.

Anyway, I believe the issue did not stem from network latency as this was all happening on a single instance and not a cluster. I believe it was solved om Incus by simply refactoring database code to reduce lookups, doing some caching, things of that nature. But I really don't know the details, you'd have to check the Incus source for that :)

@hamistao
Copy link
Contributor

Will do! In any case, thanks for your report and for your availability, we will proceed with the fix all the same.

@hamistao
Copy link
Contributor

@tomponline This problem actually relates to a tomeout when listing profiles in a standalone envirionment.

To fix this, Incus just increased the timeout for transacions, as can be seen here. The other improvements for listing profiles on the same PR are already on LXD for quite some time. If we don't want to go down that road, I think we can just close this.

I plan on following up on the discussed fix to efficiently populate the usedby field, but keep in mind this is a separate problem that was uncovered while investigating this.

@tomponline
Copy link
Member

To fix this, Incus just increased the timeout for transacions, as can be seen here. The other improvements for listing profiles on the same PR are already on LXD for quite some time. If we don't want to go down that road, I think we can just close this.

I'd like to avoid increasing the timeout to 30s as that feels like just papering over the issue rather than fixing it to me.

Suggest instead we first try importing these:

lxc/incus#1140
lxc/incus#1314

@hamistao
Copy link
Contributor

I'd like to avoid increasing the timeout to 30s as that feels like just papering over the issue rather than fixing it to me.

Yeah I agree

Suggest instead we first try importing these

Sure, I have seen those and they contain some caching logic that could be nice to have. But mind that caching alone would not fix this issue, so this is probably why they bumped their timeout.

@tomponline
Copy link
Member

But mind that caching alone would not fix this issue, so this is probably why they bumped their timeout.

What is the issue then (I mean the one from the OP that is happening on a single node, not the one you described when accessing from a non-leader over a slow network)?

@hamistao
Copy link
Contributor

hamistao commented Oct 21, 2024

@tomponline #14315 includes some significant improvements to profile listing. I got from 350ms on average to 290ms on my machine. This improvement should be greatly increased the worse the latency is between the LXD server and the database it is reading from. This is the best we can do since we can't reproduce this issue so I think we can close this after merging those improvements.

A similar improvement can be done when populating the usedBy field on profiles, but instead of profile devices and config, it would be with instances for the usedBy field. The improvement would consist of caching InstanceProfiles (as in the relationship table between Instances and Profiles) before iterating through the profiles on profilesGet:

// Getting all the InstanceProfiles objects here

profileDevices, err := dbCluster.GetDevices(ctx, tx.Tx(), "profile")
if err != nil {
return err
}

for _, profile := range profiles {
      apiProfile, err := profile.ToAPI(ctx, tx.Tx(), profileDevices)
      if err != nil {
	      return err
      }
      
      // profileUsedBy makes a query for InstanceProfiles for each profile, we could simplify this with only one query outside the loop.
      apiProfile.UsedBy, err = profileUsedBy(ctx, tx, profile)
      if err != nil {
	      return err
      }
      
      apiProfiles = append(apiProfiles, apiProfile)
      }

From reading the code, other endpoints for listing entities could use this kind of improvement when populating the usedBy field, such as projects, network ACLs and network zones. If this is of interest, I can create an Improvement issue for this proposal.

@tomponline
Copy link
Member

Sounds good thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

No branches or pull requests

4 participants