Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LXD cluster 'lxc list' command extremely laggy #4548

Closed
1 of 6 tasks
dfsdevops opened this issue May 9, 2018 · 17 comments · Fixed by #4854
Closed
1 of 6 tasks

LXD cluster 'lxc list' command extremely laggy #4548

dfsdevops opened this issue May 9, 2018 · 17 comments · Fixed by #4854
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@dfsdevops
Copy link

Required information

  • Distribution: Ubuntu
  • Distribution version: 18.04
  • The output of "lxc info" or if that fails:

config:
  core.https_address: 10.1.2.41:8443
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 10.1.2.41:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIFTTCCAzWgAwIBAgIRAIYi9SOBaeGArc8ha+7ZyvIwDQYJKoZIhvcNAQELBQAw
    ODEcMBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEYMBYGA1UEAwwPcm9vdEBs
    eGQtZGV2LTAxMB4XDTE4MDUwODIwNTU1N1oXDTI4MDUwNTIwNTU1N1owODEcMBoG
    A1UEChMTbGludXhjb250YWluZXJzLm9yZzEYMBYGA1UEAwwPcm9vdEBseGQtZGV2
    LTAxMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA0JxeMil0dOsHwO/x
    xIHJHXfC7VlNRP0WXNLb9glwUr3qVEExW+NG5XOdzts5uQm0wV1nhDrZoLr3AUwI
    dmrvREWbn7PQEr0Fw7ZHYPG1r+06b5m0fweEWuV6punvzZSCeiIV5M4w6S+93Xna
    C4fryXuIRIpMrXhhVzSclIGWiB4DA95KLN8oR8FfhEstZ6Jz+sYMMrOxDjbhEYLw
    OSV4MYgFhXlA+KGgeyCOrseiz9hycBeM8zniIxDzFIFTSGWd9B3omb4eK2li6zhb
    FNnNqsCdIOY7HsMJyD2bLJCKsuxXYkpN/XVJCNVOEoU13S1arTnIKPVzXvnFBpZx
    BREIbBQRI5fROYqlrbcMvOq/jEoWD9JQobK0OhfvdrTaJpWENF2ttf4QwANPx/zC
    DtSfIUbJ5I7iP0P+IuaShgwM2LG0aqyEwEpP3lPqV/LAPFiLj77QuxudiFVU0yZO
    iaA2UUFOskTNQNiMdZqD87j2Pp74dDrKSvmxTSwlXwJntUxAKoSUzaMBGhiRsuGp
    uc3atWRG+Ne7BMpepuyvEb2+kmFjexiyqH1pAXaAhNPJHithCiTFaOJQjbHx+n1n
    AxWqGnJQoDTyRm/JA1kyBo66NoDGW5X/RjVYIkD3Dc1svkYP6eWOB0Mi1mYj0qaq
    LG8jZ/e71ak9GSgdHDzXXvrLP0kCAwEAAaNSMFAwDgYDVR0PAQH/BAQDAgWgMBMG
    A1UdJQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwGwYDVR0RBBQwEoIKbHhk
    LWRldi0wMYcECgECKTANBgkqhkiG9w0BAQsFAAOCAgEAPTKfYKvM4tNuoItiIZ7C
    lKVN+73FriGg+xkl4APuT/FqciP4rXQ4eI+XW81BBc4jx1kK5ngmZbOO8tXROBZk
    lHp/6Vz1gxYLeLSbZA5Pi6dnJE5zJnPGrB3vUQgVxdlIF1zI2jtQ9h1h0wUH97nB
    uoK2f0Wd4S+z10bac6P4wGqrzwtFsfnT+077EIdpgtXeqQOxId1GaIfWEKzUhCRO
    Aj6aYw0bU5bigD0OhROI246nKk363X+4kZyMT7FNpUZMpESrw/ZjCBaCOkZkqOEq
    huCVn4yWhZ1sYIMlg9aitNeKaU70gOZDHajI+Lo8m2eemmiQoWGUqN8vu8Tk0Kzo
    s4bFhjPv/NFd9h2LvuulhBqmclPszgyhk9CrSPJL0K0GgcVUMG3IiuxoHBqVQn9M
    nPOVPlf2cYqb4OFC+kdDT3JqG29K+4am8gvV6rAFl+dCsRd6k45keWAB1MjQfl7N
    8whAL6TEXmiT1sb3pWvgd7E53SeIAXEPP/furs0excMJvu/yfZOGhaD/NTZzoC8g
    5WvNNxuuvuWISMJL0uHrn6NcgZ1rZFeOdMwOYewElyEkVLbFMpTacrGj3vYlIEfK
    MXSIiGOGMundoCvlDRRcLCqkYGbiu2Nlab23wpMWBxpFASsaZGgEXRRIn4ujie3t
    OboWBqb/G5F+82vh6wgV2t0=
    -----END CERTIFICATE-----
  certificate_fingerprint: fcc5ff4aea9e3513c7e2791e865bdae6f81b321c6c007a2fd7498d1d727060d0
  driver: lxc
  driver_version: 3.0.0
  kernel: Linux
  kernel_architecture: x86_64
  kernel_version: 4.15.0-20-generic
  server: lxd
  server_pid: 2245
  server_version: 3.0.0
  storage: zfs
  storage_version: 0.7.5-1ubuntu13
  server_clustered: true
  server_name: lxd-dev-01

Issue description

When running lxc list the time to return information on 11 containers takes anywhere from 15 - 60 seconds. I understand that there will be some extra laggyness on clusters because of the quorum database, but I feel like something is going wrong or timing out on the back-end.
lxc-list-debug.txt

Steps to reproduce

  1. Set up 3 node cluster on same subnet, no maas server or fan networking involved.
  2. run lxc list on any of the nodes
ubuntu@lxd-dev-01:~$ time lxc list

real	0m28.707s
user	0m0.088s
sys	0m0.045s

The above command was run right after a previous command that seemed to take even longer. It is consistently slow.

Information to attach

I managed to capture the debug output of one example taking over 1 minute to list out information about 11 containers.

  • Any relevant kernel output (dmesg)
  • Container log (lxc info NAME --show-log)
  • Container configuration (lxc config show NAME --expanded)
  • Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
  • Output of the client with --debug
  • Output of the daemon with --debug (alternatively output of lxc monitor while reproducing the issue)
@stgraber
Copy link
Contributor

How long does lxc list --fast take?

@dfsdevops
Copy link
Author

with --fast I get am getting times anywhere between 1 second to 10 seconds.

@stgraber stgraber added this to the lxd-3.2 milestone May 14, 2018
@stgraber stgraber added the Bug Confirmed to be a bug label May 14, 2018
@freeekanayaka
Copy link
Contributor

The slowness is most probably not due to the cluster database (although I can't completely rule out contention issues). To allow us profiling your specific case, we'd need you to turn on debug logging on all nodes, run "lxc list" and attach all logs within that window to this ticket.

How many nodes do you have by the way?

@stgraber
Copy link
Contributor

@freeekanayaka for comparison, lxc list --fast returns within a second here with over 1k containers, so 10s would be a bit concerning.

@dfsdevops
Copy link
Author

This is just a 3 node cluster. I'm basically in the middle of doing some testing, it's not prod so I'm happy to do anything to narrow it down

@dfsdevops
Copy link
Author

I ran a couple in a row:

ubuntu@lxd-dev-01:~$ date
Mon May 14 18:06:47 UTC 2018
ubuntu@lxd-dev-01:~$
| cen7         | RUNNING | 10.2.23.245 (eth0) |                                               | PERSISTENT | 0         | lxd-dev-01 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen8         | RUNNING | 10.2.23.77 (eth0)  | fd42:3bf0:270c:cb69:216:3eff:feed:e8ed (eth0) | PERSISTENT | 0         | lxd-dev-02 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen9         | RUNNING | 10.2.23.159 (eth0) | fd42:3bf0:270c:cb69:216:3eff:fe45:f78a (eth0) | PERSISTENT | 0         | lxd-dev-03 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cenmove      | RUNNING | 10.2.23.138 (eth0) |                                               | PERSISTENT | 0         | lxd-dev-01 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| working-dory | RUNNING | 10.2.23.196 (eth0) | fd42:3bf0:270c:cb69:216:3eff:fea0:de56 (eth0) | PERSISTENT | 0         | lxd-dev-02 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+

real    0m17.587s
user    0m0.086s
sys     0m0.053s
ubuntu@lxd-dev-01:~$ time lxc list
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
|     NAME     |  STATE  |        IPV4        |                     IPV6                      |    TYPE    | SNAPSHOTS |  LOCATION  |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen1         | RUNNING | 10.2.23.62 (eth0)  |                                               | PERSISTENT | 0         | lxd-dev-01 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen10        | RUNNING | 10.2.23.178 (eth0) |                                               | PERSISTENT | 0         | lxd-dev-01 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen2         | RUNNING | 10.2.23.243 (eth0) | fd42:3bf0:270c:cb69:216:3eff:fefa:e3cc (eth0) | PERSISTENT | 0         | lxd-dev-02 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen3         | RUNNING | 10.2.23.48 (eth0)  | fd42:3bf0:270c:cb69:216:3eff:fee0:6ae1 (eth0) | PERSISTENT | 0         | lxd-dev-03 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen4         | RUNNING | 10.2.23.35 (eth0)  |                                               | PERSISTENT | 0         | lxd-dev-01 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen6         | RUNNING | 10.2.23.181 (eth0) | fd42:3bf0:270c:cb69:216:3eff:fe5a:87ed (eth0) | PERSISTENT | 0         | lxd-dev-03 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen7         | RUNNING | 10.2.23.245 (eth0) |                                               | PERSISTENT | 0         | lxd-dev-01 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen8         | RUNNING | 10.2.23.77 (eth0)  | fd42:3bf0:270c:cb69:216:3eff:feed:e8ed (eth0) | PERSISTENT | 0         | lxd-dev-02 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cen9         | RUNNING | 10.2.23.159 (eth0) | fd42:3bf0:270c:cb69:216:3eff:fe45:f78a (eth0) | PERSISTENT | 0         | lxd-dev-03 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| cenmove      | RUNNING | 10.2.23.138 (eth0) |                                               | PERSISTENT | 0         | lxd-dev-01 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+
| working-dory | RUNNING | 10.2.23.196 (eth0) | fd42:3bf0:270c:cb69:216:3eff:fea0:de56 (eth0) | PERSISTENT | 0         | lxd-dev-02 |
+--------------+---------+--------------------+-----------------------------------------------+------------+-----------+------------+

real    1m16.871s
user    0m0.092s
sys     0m0.075s

debug-logs.tar.gz

@freeekanayaka
Copy link
Contributor

Thanks for the logs.

I can somehow reproduce the issue in a test cluster on Canonical's OpenStack cloud. I believe at least a good part of the issue is due to database contention. I'll need to profile that further to understand exactly what happens, but I already have some hypothesis about where bottlenecks could be and how to speed up things (both in general and specifically about contention). It will take a bit to fix, but I think we can get to the bottom of it and improve this area.

@grsmith-projects
Copy link

I can confirm that I am seeing the same behavior, I am unable to actually pinpoint where the slowness is occurring, but I agree that it is not the LXD daemon itself, or requests / response from the hosts. It would appear to be DB related.

@freeekanayaka if you need any help testing things, feel free to ping me

@mbuiro
Copy link

mbuiro commented May 23, 2018

+1 to this

@freeekanayaka
Copy link
Contributor

I did not found the root cause yet, but I believe I have a good-enough workaround that at least will make these timings predictable and reasonable for now. In order to improve performance significantly we'll need more work on various parts of the stack. I'm probably going to push a PR tomorrow.

freeekanayaka added a commit to freeekanayaka/lxd that referenced this issue May 24, 2018
This change makes us use transactions also in test code, and the /internal/sql
API endpoint which was not doing that before.

It also drops concurrent calls in the GET /containers and cluster heartbeat
code, since at the moment they are hardly going to take advantage of
concurrency, as the nodes are going to serialize db reads anyways (and db reads
atm are a substantial part of the total time spent in handling an API request).

The lower-level change to actually serialize reads was committed partly in
go-grpc-sql and partly in dqlite.

This should mitigate canonical#4548 for now.

Moving forward we should start optimizing dqlite to be faster (I believe there
are substantial gains to be made there) , and perhaps also change the LXD code
that interacts with the database to be more efficient (e.g. caching prepared
statements, not entering/exiting a transaction for every query, etc).

Signed-off-by: Free Ekanayaka <[email protected]>
@19wolf
Copy link
Contributor

19wolf commented May 27, 2018

I'm running into this problem as well, but I'm not running a cluster.

I have 10 containers, and lxc list takes anywhere for 5 to 20 seconds, lxc list --fast takes from 2-8 seconds.

stgraber pushed a commit that referenced this issue May 28, 2018
This change makes us use transactions also in test code, and the /internal/sql
API endpoint which was not doing that before.

It also drops concurrent calls in the GET /containers and cluster heartbeat
code, since at the moment they are hardly going to take advantage of
concurrency, as the nodes are going to serialize db reads anyways (and db reads
atm are a substantial part of the total time spent in handling an API request).

The lower-level change to actually serialize reads was committed partly in
go-grpc-sql and partly in dqlite.

This should mitigate #4548 for now.

Moving forward we should start optimizing dqlite to be faster (I believe there
are substantial gains to be made there) , and perhaps also change the LXD code
that interacts with the database to be more efficient (e.g. caching prepared
statements, not entering/exiting a transaction for every query, etc).

Signed-off-by: Free Ekanayaka <[email protected]>
@laveolus
Copy link

laveolus commented Jun 1, 2018

I think I used to have the same problem as 19wolf, which caused our build times to double unexpectedly.
There seems to be a slowdown building up, every time we do a lxc exec (lxc file push seems to not cause it).

I could reproduce this with LXD 3.1 from snap manually:
while true; do time lxc exec sal-build-3937 id; done

After a fresh restart of LXD, this statement reported:
lxc exec sal-build-3937 id 0,02s user 0,01s system 14% cpu 0,218 total
The total time is gradually building up. After a few minutes it reported:
lxc exec sal-build-3937 id 0,03s user 0,02s system 8% cpu 0,542 total

This slowdown appears to be persistent, until I restart the LXD daemon.

After refreshing to edge/git-54d43dc I am at around a stable total time for minutes:
lxc exec sal-build-3937 id 0,02s user 0,02s system 32% cpu 0,111 total
and our builds are fine again.

@freeekanayaka
Copy link
Contributor

Thanks for the report and for trying out edge. That seems to confirm that #4582 did help. We have some database performance improvements in the pipeline (hopefully very significant), but that won't happen before 3.3 or 3.4, so stay tuned.

Note that recently we've also seen some issues supposedly related to the Go garbage collector. That's still under investigation and might affect you under some circumstances. I believe the cause is not yet completely clear, but @stgraber might be able to provide more details on this.

@killua-eu
Copy link

I was able to replicate the problem really bad on snap/edge git-a305011 (7435) 56MB [lxc list won't finish even after 5+ minutes], but with a different cause. It seems the issue is in systemd-resolve, will look into it later. Letting others know here to avoid time spent on debugging.

I looked into journalctl | grep snap

Jun 07 00:35:40 kai02 snapd[920]: 2018/06/07 00:35:40.061937 stateengine.go:101: state ensure error: cannot refresh snap-declaration for "core": Get https://api.snapcraft.io/api/v1/snaps/assertions/snap-declaration/16/99T7MUlRhtI3U0QFgl5mXXESAiSwt776?max-format=2: dial tcp: lookup api.snapcraft.io on 127.0.0.53:53: server misbehaving
Jun 07 00:35:40 kai02 systemd[1]: Started Wait until snapd is fully seeded.
Jun 07 00:35:40 kai02 lxd.daemon[892]: ==> Loading snap configuration
Jun 07 00:45:41 kai02 lxd.daemon[892]: Error: LXD still not running after 600s timeout (Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: no such file or directory)
Jun 07 00:45:41 kai02 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jun 07 00:45:41 kai02 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Jun 07 00:45:42 kai02 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Jun 07 00:45:42 kai02 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 1.
Jun 07 00:45:42 kai02 systemd[1]: Stopped Service for snap application lxd.daemon.
Jun 07 00:45:42 kai02 systemd[1]: Started Service for snap application lxd.daemon.
Jun 07 00:45:42 kai02 lxd.daemon[2534]: ==> Loading snap configurationing

after the 5m timeout, everything worked. The relevant journalctl | grep systemd-resolve tells

Jun 07 00:40:55 kai02 systemd-resolved[801]: Using degraded feature set (UDP) for DNS server 192.168.88.1.
Jun 07 01:00:26 kai02 systemd-resolved[801]: Grace period over, resuming full feature set (UDP+EDNS0) for DNS server 192.168.88.1.
Jun 07 01:00:26 kai02 systemd-resolved[801]: Using degraded feature set (UDP) for DNS server 192.168.88.1.
Jun 07 01:10:34 kai02 systemd-resolved[831]: Positive Trust Anchors:
Jun 07 01:10:34 kai02 systemd-resolved[831]: . IN DS 19036 8 2 49aac11d7b6f6446702e54a1607371607a1a41855200fd2ce1cdde32f24e8fb5
Jun 07 01:10:34 kai02 systemd-resolved[831]: . IN DS 20326 8 2 e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
Jun 07 01:10:34 kai02 systemd-resolved[831]: Negative trust anchors: 10.in-addr.arpa 16.172.in-addr.arpa 17.172.in-addr.arpa 18.172.in-addr.arpa 19.172.in-addr.arpa 20.172.in-addr.arpa 21.172.in-addr.arpa 22.172.in-addr.arpa 23.172.in-addr.arpa 24.172.in-addr.arpa 25.172.in-addr.arpa 26.172.in-addr.arpa 27.172.in-addr.arpa 28.172.in-addr.arpa 29.172.in-addr.arpa 30.172.in-addr.arpa 31.172.in-addr.arpa 168.192.in-addr.arpa d.f.ip6.arpa corp home internal intranet lan local private test
Jun 07 01:10:34 kai02 systemd-resolved[831]: Using system hostname 'kai02'.

So I guess that systemd-resolved misbehaves, because the router/firewall infront of the lxd host doesnt support DNSSEC. Until systemd-resolved decides to use a degraded feature set, lxc list won't work.

network:
    ethernets:
        enp0s31f6:
            addresses: []
            dhcp4: true
            optional: true
    version: 2

@freeekanayaka
Copy link
Contributor

@killua-eu as you point, this last problem you reported is only partially related to this issue.

I think it's the first time we see something like that, so if you find anything more please do follow-up.

More broadly (and tangentially related to what you just reported), our plans to improve the situation here currently are:

  1. There's some significant work going on in the cluster database code. Hopefully that will make things a lot faster for "regular" use (iow when everything is working fine, all lxc commands should be faster). This will take another few weeks, so it likely be released in 3.3 or beyond.

  2. We want to make the cluster more resilient when database nodes go down (see this comment). In general, we should be able to do a better job in detecting pathological situations and avoid lxc (or the API) hanging for too long. This will be definitely post 3.3.

  3. We are aware that snap refreshes are not as robust as we wished, especially if you use clustering. That's sometimes due to snap itself (probably this case), sometimes due to yet-to-be-identified bugs in the LXD clustering upgrade code (but we don't have clear evidence of this). In any case, the glue scripts and logic implemented in our snap package should be refined to be more defensive and take appropriate actions both in case of snapd misbehaving and in case of lxd misbehaving (see for instance lxc/lxd-pkg-snap/issues/11).

@killua-eu
Copy link

killua-eu commented Jun 7, 2018

@freeekanayaka , the problem will likely be a mess to track. With 16.04 -> 18.04, the network stack now has netplan, fan network, cloud-init and systemd-resolved, possibly the problems with snap don't help too. There's quite a number of bugs and blogposts with painful frustration

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1624320
http://edgeofsanity.net/rant/2017/12/20/systemd-resolved-is-broken.html
systemd/systemd#2683
https://bugs.launchpad.net/cloud-init/+bug/1750884
https://blobfolio.com/2017/05/fix-linux-dns-issues-caused-by-systemd-resolved/

As I dig deeper, I'm a bit at loss what my system actually does. Documentation is sparse, good practise examples are limited, there's a number of combinations to try out. I can either post more here or shift it to a new issue, just let me know what's preferred now.

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Jun 7, 2018

@killua-eu thanks for the pointers, interesting read (for some definition of "interesting" :) ).

Of course I can't really speak for the systemd-resolved and netplan parts, but it looks like systemd-resolved indeed needs some change, which I hope will eventually get sorted, given the issues it caused. Probably that should be enough as long as you use systemd+netplan (I didn't dig too deep, but it seems that the cloud-init/netplan bug only applies to non-systemd now?).

Now, on the application side, there might be some work to do to improve robustness in snapd and lxd when things like this happen. Please yes, do file another issue, with any additional detail you find so we can properly evaluate if we need a lxd-level fix (or maybe a lxd-pkg-snap one).

In this regard, reading the output of journalctl | grep snap that you pasted, the one thing that is not clear to me is what happened after snapd hit the DNS failure. Did snapd ignore it because it was safe to do it? Did it end up in some unintended state? Why did lxd fail to start (since the logs indicate there was no socket file)? It probably needs some reading of the snapd code (or some help from the snapd team).

If you have them, the LXD logs under /var/snap might help too.

Catramen pushed a commit to Catramen/lxd that referenced this issue Jun 18, 2018
This change makes us use transactions also in test code, and the /internal/sql
API endpoint which was not doing that before.

It also drops concurrent calls in the GET /containers and cluster heartbeat
code, since at the moment they are hardly going to take advantage of
concurrency, as the nodes are going to serialize db reads anyways (and db reads
atm are a substantial part of the total time spent in handling an API request).

The lower-level change to actually serialize reads was committed partly in
go-grpc-sql and partly in dqlite.

This should mitigate canonical#4548 for now.

Moving forward we should start optimizing dqlite to be faster (I believe there
are substantial gains to be made there) , and perhaps also change the LXD code
that interacts with the database to be more efficient (e.g. caching prepared
statements, not entering/exiting a transaction for every query, etc).

Signed-off-by: Free Ekanayaka <[email protected]>
@stgraber stgraber modified the milestones: lxd-3.2, lxd-3.3 Jun 19, 2018
@stgraber stgraber removed this from the lxd-3.3 milestone Jul 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants