Running lxc profile list on a system with lots of profiles results in table: context deadline exceeded #13401

webdock-io · 2024-04-29T12:45:33Z

Ubuntu Jammy
LXD v5.0.3 and LXD 5.21.1

Running lxc profile list on a system with lots of profiles results in the following:

root@lxdremote:~# lxc profile list
Error: Failed to fetch from "profile_device_config" table: Failed to fetch from "profile_device_config" table: context deadline exceeded

Running lxd sql global .dump returns almost immediately and lists all data in the database

We have a real use case for supporting a lot of profiles in a remote (we allow our customers to build their own)

Adding and deleting individual profiles seems to work, although it's hard to confirm deletion when we can't list them with lxd.

Is there any way to increase the timeout in lxd to allow for listing of our (large, and will only grow larger) profiles list? We could start hacking away at sql queries, but I'd much rather be able to do an lxc profile list

(this use case came up as we actually wanted to make sure the list was cleaned up so any unused profiles were removed)

The text was updated successfully, but these errors were encountered:

tomponline · 2024-04-29T12:51:23Z

This is probably due to inefficient queries. Lets look into making fewer queries that return the info needed.

tomponline · 2024-04-29T12:51:40Z

How many profiles is "lots" in this case?

webdock-io · 2024-04-29T12:59:30Z

Thanks for the quick reply.

I actually don't know as I can't list them - maybe I could get a count from the dump - but I'd say we are in the hundreds if not 1K+

tomponline · 2024-04-29T13:00:43Z

Try doing lxd sql global 'select * from profiles' or lxd sql global 'select count(*) from profiles'

webdock-io · 2024-04-29T13:05:08Z

Count gives me 616

just doing the select and dumping the table is pretty quick:

time lxd sql global 'select * from profiles'
... stuff
real    0m0.166s
user    0m0.044s
sys     0m0.064s

tomponline · 2024-04-29T13:17:49Z

Count gives me 616

just doing the select and dumping the table is pretty quick:
time lxd sql global 'select * from profiles'
... stuff
real    0m0.166s
user    0m0.044s
sys     0m0.064s

Cool thanks.

Suspect its doing a separate query to get each profile's config, rather than a single query with multiple profile IDs and then separating the results in LXD.

MggMuggins · 2024-05-15T21:50:48Z

@tomponline This was fixed with a fairly significant db refactor (see #10463 and #10183) that landed in LXD 5.5. This fun one-liner works great with LXD 5.21.1:

for p in p{1..1000}; do lxc profile create $p & done && lxc profile ls

It doesn't look to me like it's feasible to backport that set of fixes to 5.0.3; I'm guessing it won't be straightforward to come up with a separate patch for 5.0.3 either, although I haven't done much spelunking to confirm that. Let me know what you think the most reasonable course of action is here.

webdock-io · 2024-06-04T07:48:45Z

We upgraded our system to 5.21.1 and get this:

root@lxdremote:~# snap refresh lxd --channel=latest/stable
2024-06-04T07:46:48Z INFO Waiting for "snap.lxd.daemon.service" to stop.
lxd 5.21.1-2d13beb from Canonical✓ refreshed
root@lxdremote:~# nano /etc/hosts
root@lxdremote:~# lxc profile list
Error: Failed to fetch from "profile_device_config" table: Failed to fetch from "profile_device_config" table: context deadline exceeded
root@lxdremote:~# lxc --version
5.21.1 LTS

Sooo... Not fixed @MggMuggins or am I missing something?

hamistao · 2024-10-16T17:22:05Z

@webdock-io Hi! Some news on this :D

I managed to reproduce the issue by simulating network latency between two local VMs, I suspect this is why @MggMuggins 's reproducer did not quite catch the problem.

tc qdisc replace dev enp5s0 root netem delay 100ms # simulate latency of 100ms on enp5s0 interface
for p in p{1..200}; do lxc profile create $p & done && lxc profile ls # this now results in a timeout

Mind that, in my reproduction, we only get a timeout when querying from a non-leader LXD cluster member. That makes sense, since all queries on the leader happen locally, so no latency. Could you confirm this also applies to your case?

I suspect this is happenning because we make a separate database query for each profile to populate the usedBy field. I am working on a fix for this now and, if my theory is correct, we should have it merged and working soon. The fix will then be backported to 5.21 in a few more days. Cheers :)

webdock-io · 2024-10-16T17:56:17Z

Thanks for your efforts. However, we've essentially switched all of our infrastructure almost 100% to Incus by now where this issue has been solved for ages (or, about a day after we reported it there)

This huge wait for bug fixes in LXD was a primary reason we switched, as it's untenable for production workloads like ours.

Anyway, I believe the issue did not stem from network latency as this was all happening on a single instance and not a cluster. I believe it was solved om Incus by simply refactoring database code to reduce lookups, doing some caching, things of that nature. But I really don't know the details, you'd have to check the Incus source for that :)

hamistao · 2024-10-16T18:25:21Z

Will do! In any case, thanks for your report and for your availability, we will proceed with the fix all the same.

hamistao · 2024-10-17T05:28:49Z

@tomponline This problem actually relates to a tomeout when listing profiles in a standalone envirionment.

To fix this, Incus just increased the timeout for transacions, as can be seen here. The other improvements for listing profiles on the same PR are already on LXD for quite some time. If we don't want to go down that road, I think we can just close this.

I plan on following up on the discussed fix to efficiently populate the usedby field, but keep in mind this is a separate problem that was uncovered while investigating this.

tomponline · 2024-10-17T07:17:00Z

To fix this, Incus just increased the timeout for transacions, as can be seen here. The other improvements for listing profiles on the same PR are already on LXD for quite some time. If we don't want to go down that road, I think we can just close this.

I'd like to avoid increasing the timeout to 30s as that feels like just papering over the issue rather than fixing it to me.

Suggest instead we first try importing these:

lxc/incus#1140
lxc/incus#1314

hamistao · 2024-10-17T11:28:25Z

I'd like to avoid increasing the timeout to 30s as that feels like just papering over the issue rather than fixing it to me.

Yeah I agree

Suggest instead we first try importing these

Sure, I have seen those and they contain some caching logic that could be nice to have. But mind that caching alone would not fix this issue, so this is probably why they bumped their timeout.

tomponline · 2024-10-17T12:16:34Z

But mind that caching alone would not fix this issue, so this is probably why they bumped their timeout.

What is the issue then (I mean the one from the OP that is happening on a single node, not the one you described when accessing from a non-leader over a slow network)?

hamistao · 2024-10-21T19:06:46Z

@tomponline #14315 includes some significant improvements to profile listing. I got from 350ms on average to 290ms on my machine. This improvement should be greatly increased the worse the latency is between the LXD server and the database it is reading from. This is the best we can do since we can't reproduce this issue so I think we can close this after merging those improvements.

A similar improvement can be done when populating the usedBy field on profiles, but instead of profile devices and config, it would be with instances for the usedBy field. The improvement would consist of caching InstanceProfiles (as in the relationship table between Instances and Profiles) before iterating through the profiles on profilesGet:

// Getting all the InstanceProfiles objects here

profileDevices, err := dbCluster.GetDevices(ctx, tx.Tx(), "profile")
if err != nil {
return err
}

for _, profile := range profiles {
      apiProfile, err := profile.ToAPI(ctx, tx.Tx(), profileDevices)
      if err != nil {
	      return err
      }
      
      // profileUsedBy makes a query for InstanceProfiles for each profile, we could simplify this with only one query outside the loop.
      apiProfile.UsedBy, err = profileUsedBy(ctx, tx, profile)
      if err != nil {
	      return err
      }
      
      apiProfiles = append(apiProfiles, apiProfile)
      }

From reading the code, other endpoints for listing entities could use this kind of improvement when populating the usedBy field, such as projects, network ACLs and network zones. If this is of interest, I can create an Improvement issue for this proposal.

tomponline · 2024-10-21T19:50:56Z

Sounds good thanks!

tomponline added the Bug Confirmed to be a bug label Apr 29, 2024

tomponline added this to the lxd-6.1 milestone Apr 29, 2024

MggMuggins self-assigned this May 15, 2024

tomponline modified the milestones: lxd-6.1, lxd-6.2 Jun 19, 2024

MggMuggins removed their assignment Jun 28, 2024

tomponline assigned hamistao Aug 22, 2024

tomponline closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running lxc profile list on a system with lots of profiles results in table: context deadline exceeded #13401

Running lxc profile list on a system with lots of profiles results in table: context deadline exceeded #13401

webdock-io commented Apr 29, 2024 •

edited by tomponline

Loading

tomponline commented Apr 29, 2024

tomponline commented Apr 29, 2024

webdock-io commented Apr 29, 2024

tomponline commented Apr 29, 2024 •

edited

Loading

webdock-io commented Apr 29, 2024

tomponline commented Apr 29, 2024

MggMuggins commented May 15, 2024

webdock-io commented Jun 4, 2024

hamistao commented Oct 16, 2024 •

edited

Loading

webdock-io commented Oct 16, 2024

hamistao commented Oct 16, 2024

hamistao commented Oct 17, 2024

tomponline commented Oct 17, 2024

hamistao commented Oct 17, 2024

tomponline commented Oct 17, 2024

hamistao commented Oct 21, 2024 •

edited

Loading

tomponline commented Oct 21, 2024

Running lxc profile list on a system with lots of profiles results in table: context deadline exceeded #13401

Running lxc profile list on a system with lots of profiles results in table: context deadline exceeded #13401

Comments

webdock-io commented Apr 29, 2024 • edited by tomponline Loading

tomponline commented Apr 29, 2024

tomponline commented Apr 29, 2024

webdock-io commented Apr 29, 2024

tomponline commented Apr 29, 2024 • edited Loading

webdock-io commented Apr 29, 2024

tomponline commented Apr 29, 2024

MggMuggins commented May 15, 2024

webdock-io commented Jun 4, 2024

hamistao commented Oct 16, 2024 • edited Loading

webdock-io commented Oct 16, 2024

hamistao commented Oct 16, 2024

hamistao commented Oct 17, 2024

tomponline commented Oct 17, 2024

hamistao commented Oct 17, 2024

tomponline commented Oct 17, 2024

hamistao commented Oct 21, 2024 • edited Loading

tomponline commented Oct 21, 2024

webdock-io commented Apr 29, 2024 •

edited by tomponline

Loading

tomponline commented Apr 29, 2024 •

edited

Loading

hamistao commented Oct 16, 2024 •

edited

Loading

hamistao commented Oct 21, 2024 •

edited

Loading