-
Notifications
You must be signed in to change notification settings - Fork 940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LXD cluster 'lxc list' command extremely laggy #4548
Comments
How long does |
with |
The slowness is most probably not due to the cluster database (although I can't completely rule out contention issues). To allow us profiling your specific case, we'd need you to turn on debug logging on all nodes, run "lxc list" and attach all logs within that window to this ticket. How many nodes do you have by the way? |
@freeekanayaka for comparison, |
This is just a 3 node cluster. I'm basically in the middle of doing some testing, it's not prod so I'm happy to do anything to narrow it down |
I ran a couple in a row:
|
Thanks for the logs. I can somehow reproduce the issue in a test cluster on Canonical's OpenStack cloud. I believe at least a good part of the issue is due to database contention. I'll need to profile that further to understand exactly what happens, but I already have some hypothesis about where bottlenecks could be and how to speed up things (both in general and specifically about contention). It will take a bit to fix, but I think we can get to the bottom of it and improve this area. |
I can confirm that I am seeing the same behavior, I am unable to actually pinpoint where the slowness is occurring, but I agree that it is not the LXD daemon itself, or requests / response from the hosts. It would appear to be DB related. @freeekanayaka if you need any help testing things, feel free to ping me |
+1 to this |
I did not found the root cause yet, but I believe I have a good-enough workaround that at least will make these timings predictable and reasonable for now. In order to improve performance significantly we'll need more work on various parts of the stack. I'm probably going to push a PR tomorrow. |
This change makes us use transactions also in test code, and the /internal/sql API endpoint which was not doing that before. It also drops concurrent calls in the GET /containers and cluster heartbeat code, since at the moment they are hardly going to take advantage of concurrency, as the nodes are going to serialize db reads anyways (and db reads atm are a substantial part of the total time spent in handling an API request). The lower-level change to actually serialize reads was committed partly in go-grpc-sql and partly in dqlite. This should mitigate canonical#4548 for now. Moving forward we should start optimizing dqlite to be faster (I believe there are substantial gains to be made there) , and perhaps also change the LXD code that interacts with the database to be more efficient (e.g. caching prepared statements, not entering/exiting a transaction for every query, etc). Signed-off-by: Free Ekanayaka <[email protected]>
I'm running into this problem as well, but I'm not running a cluster. I have 10 containers, and |
This change makes us use transactions also in test code, and the /internal/sql API endpoint which was not doing that before. It also drops concurrent calls in the GET /containers and cluster heartbeat code, since at the moment they are hardly going to take advantage of concurrency, as the nodes are going to serialize db reads anyways (and db reads atm are a substantial part of the total time spent in handling an API request). The lower-level change to actually serialize reads was committed partly in go-grpc-sql and partly in dqlite. This should mitigate #4548 for now. Moving forward we should start optimizing dqlite to be faster (I believe there are substantial gains to be made there) , and perhaps also change the LXD code that interacts with the database to be more efficient (e.g. caching prepared statements, not entering/exiting a transaction for every query, etc). Signed-off-by: Free Ekanayaka <[email protected]>
I think I used to have the same problem as 19wolf, which caused our build times to double unexpectedly. I could reproduce this with LXD 3.1 from snap manually: After a fresh restart of LXD, this statement reported: This slowdown appears to be persistent, until I restart the LXD daemon. After refreshing to edge/git-54d43dc I am at around a stable total time for minutes: |
Thanks for the report and for trying out edge. That seems to confirm that #4582 did help. We have some database performance improvements in the pipeline (hopefully very significant), but that won't happen before 3.3 or 3.4, so stay tuned. Note that recently we've also seen some issues supposedly related to the Go garbage collector. That's still under investigation and might affect you under some circumstances. I believe the cause is not yet completely clear, but @stgraber might be able to provide more details on this. |
I was able to replicate the problem really bad on snap/edge git-a305011 (7435) 56MB [ I looked into
after the 5m timeout, everything worked. The relevant
So I guess that systemd-resolved misbehaves, because the router/firewall infront of the lxd host doesnt support DNSSEC. Until systemd-resolved decides to use a degraded feature set,
|
@killua-eu as you point, this last problem you reported is only partially related to this issue. I think it's the first time we see something like that, so if you find anything more please do follow-up. More broadly (and tangentially related to what you just reported), our plans to improve the situation here currently are:
|
@freeekanayaka , the problem will likely be a mess to track. With 16.04 -> 18.04, the network stack now has netplan, fan network, cloud-init and systemd-resolved, possibly the problems with snap don't help too. There's quite a number of bugs and blogposts with painful frustration https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1624320 As I dig deeper, I'm a bit at loss what my system actually does. Documentation is sparse, good practise examples are limited, there's a number of combinations to try out. I can either post more here or shift it to a new issue, just let me know what's preferred now. |
@killua-eu thanks for the pointers, interesting read (for some definition of "interesting" :) ). Of course I can't really speak for the systemd-resolved and netplan parts, but it looks like systemd-resolved indeed needs some change, which I hope will eventually get sorted, given the issues it caused. Probably that should be enough as long as you use systemd+netplan (I didn't dig too deep, but it seems that the cloud-init/netplan bug only applies to non-systemd now?). Now, on the application side, there might be some work to do to improve robustness in snapd and lxd when things like this happen. Please yes, do file another issue, with any additional detail you find so we can properly evaluate if we need a lxd-level fix (or maybe a lxd-pkg-snap one). In this regard, reading the output of If you have them, the LXD logs under /var/snap might help too. |
This change makes us use transactions also in test code, and the /internal/sql API endpoint which was not doing that before. It also drops concurrent calls in the GET /containers and cluster heartbeat code, since at the moment they are hardly going to take advantage of concurrency, as the nodes are going to serialize db reads anyways (and db reads atm are a substantial part of the total time spent in handling an API request). The lower-level change to actually serialize reads was committed partly in go-grpc-sql and partly in dqlite. This should mitigate canonical#4548 for now. Moving forward we should start optimizing dqlite to be faster (I believe there are substantial gains to be made there) , and perhaps also change the LXD code that interacts with the database to be more efficient (e.g. caching prepared statements, not entering/exiting a transaction for every query, etc). Signed-off-by: Free Ekanayaka <[email protected]>
Required information
Issue description
When running
lxc list
the time to return information on 11 containers takes anywhere from 15 - 60 seconds. I understand that there will be some extra laggyness on clusters because of the quorum database, but I feel like something is going wrong or timing out on the back-end.lxc-list-debug.txt
Steps to reproduce
lxc list
on any of the nodesThe above command was run right after a previous command that seemed to take even longer. It is consistently slow.
Information to attach
I managed to capture the debug output of one example taking over 1 minute to list out information about 11 containers.
dmesg
)lxc info NAME --show-log
)lxc config show NAME --expanded
)lxc monitor
while reproducing the issue)The text was updated successfully, but these errors were encountered: