hugetlb cgroup controller does not set `rsvd`, leading to segfaults in postgres + initdb #769

bazaah · 2024-04-18T00:49:56Z

Required information

Distribution: Archlinux
Distribution version: Linux <snip> 6.8.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 16 Mar 2024 17:15:35 +0000 x86_64 GNU/Linux
The output of "incus info":

Incus Info

config:
  core.https_address: <snip>
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
  addresses:
  - <snip>
  architectures:
  - x86_64
  - i686
  certificate: <snip>
  certificate_fingerprint: <snip>
  driver: lxc | qemu
  driver_version: 5.0.3 | 8.2.2
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "true"
    unpriv_fscaps: "true"
  kernel_version: 6.8.1-arch1-1
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Arch Linux
  os_version: ""
  project: default
  server: incus
  server_clustered: false
  server_event_mode: full-mesh
  server_name: <snip>
  server_pid: <snip>
  server_version: "0.6"
  storage: ceph
  storage_version: 18.2.2
  storage_supported_drivers:
  - name: btrfs
    version: 6.7.1
    remote: false
  - name: ceph
    version: 18.2.2
    remote: true
  - name: cephfs
    version: 18.2.2
    remote: true
  - name: cephobject
    version: 18.2.2
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.23(2) (2023-11-21) / 1.02.197 (2023-11-21) / 4.48.0
    remote: false

Issue description

While attempting to get hugepages working for an unprivileged container postgres database, I encountered repeated segfaults during the initdb sequence.

This was somewhat confusing to me, because by default postgres/initdb will attempt to use hugepages, but gracefully fallback to normal memory if unavailable,
so clearly, postgres had been sufficiently induced to believe that hugepages did exist, but when it went to use them the host kernel killed the process.

Sometime later this evening I think I have it figured out.

At the end of the repro, you'll be greeted with an error like:

running bootstrap script ... 2024-04-18 00:00:00.000 UTC [1111] DEBUG: invoking IpcMemoryCreate(size=3891200)
Bus error (core dumped)
child process exited with exit code 135

Googling around this error brings up lots of related issues, particularly around Kubernetes deployments.

However, eventually you'll find opencontainers/runtime-spec#1050 which explains the problem:

The previous non-rsvd max/limit_in_bytes does not account for reserved
huge page memory, making it possible for a processes to reserve all the
huge page memory, without being able to allocate it (due to cgroup
restrictions).

In practice this makes it possible to successfully mmap more huge page
memory than allowed via the cgroup settings, but when using the memory
the process will get a SIGBUS and crash. This is bad for applications
trying to mmap at startup (and it succeeds), but the program crashes
when starting to use the memory. eg. postgres is doing this by default.

Which was fixed/added to runc in opencontainers/runc#4073.

I'm not sure how exactly this translates to incus's codebase, but from what little digging I've done around the hugetlb controller,
I can find no mention of setting hugetlb.<pagesize>.rsvd cgroups, only the older hugetlb.<pagesize>.limit_in_bytes.

Steps to reproduce

# == On the host ==

# Ensure hugepages support is enabled & we have some allocated on the system
ls -l /dev/hugepages
sysctl -w vm.nr_hugepages=1024

# Make a debian container for the demo, with what _should_ allow hugepage support
incus init images:debian/bookworm hugepages-demo
incus config set hugepages-demo limits.hugepages.2MB=512 security.syscalls.intercept.mount=true security.syscalls.intercept.mount.allowed=hugetlbfs
incus start hugepages-demo
incus exec -t hugepages-demo -- su -

# == In the container ==

# Install postgres, ignore the default db that debian happily creates for you, though do note the initdb core dumps...
apt update && apt install -y eatmydata && eatmydata -- apt install -y postgresql postgresql-contrib

# Get /dev/hugepages mounted inside the container, e.g: the mount interception stuff above works
sed '/ConditionVirtualization/d' /usr/lib/systemd/system/dev-hugepages.mount > /etc/systemd/system/dev-hugepages.mount
systemctl daemon-reload && systemctl start dev-hugepages.mount && ls -lash /dev/hugepages

# Now for the failure.
#
# We rerun the initdb debian tried previous, but with debugging turned on
pg_createcluster 15 main -- --debug

# Will print something like:
#
# running bootstrap script ... 2024-04-18 00:00:00.000 UTC [1111] DEBUG:  invoking IpcMemoryCreate(size=3891200)
# Bus error (core dumped)
# child process exited with exit code 135

Information to attach

pg_createcluster log

root@hugepages-demo:~# pg_createcluster 15 main -- --debug
Creating new PostgreSQL cluster 15/main ...
/usr/lib/postgresql/15/bin/initdb -D /var/lib/postgresql/15/main --auth-local peer --auth-host scram-sha-256 --no-instructions --debug
Running in debug mode.
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

VERSION=15.6 (Debian 15.6-0+deb12u1)
PGDATA=/var/lib/postgresql/15/main
share_path=/usr/share/postgresql/15
PGPATH=/usr/lib/postgresql/15/bin
POSTGRES_SUPERUSERNAME=postgres
POSTGRES_BKI=/usr/share/postgresql/15/postgres.bki
POSTGRESQL_CONF_SAMPLE=/usr/share/postgresql/15/postgresql.conf.sample
PG_HBA_SAMPLE=/usr/share/postgresql/15/pg_hba.conf.sample
PG_IDENT_SAMPLE=/usr/share/postgresql/15/pg_ident.conf.sample
The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/15/main ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 20
selecting default shared_buffers ... 400kB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... 2024-04-18 00:41:23.828 UTC [4400] DEBUG:  invoking IpcMemoryCreate(size=3891200)
Bus error (core dumped)
child process exited with exit code 135
initdb: removing contents of data directory "/var/lib/postgresql/15/main"
Error: initdb failed

Side note, congrats on the first stable release of incus. I was very happy to see the project back in the hands of linuxcontainers after the Canonical announcement.

The text was updated successfully, but these errors were encountered:

stgraber · 2024-04-18T00:56:13Z

Your system is running cgroup1? (you can show ls -lh /sys/fs/cgroup if unsure)

bazaah · 2024-04-18T00:59:00Z

cgroup2 /sys/fs/cgroup/cgroup.controllers exists

Edit: from mount:

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

stgraber · 2024-04-18T03:07:30Z

Cool, thanks. I'll try to look into this one tomorrow or Friday, looks pretty easy to sort out based on the runc change.

stgraber · 2024-04-20T02:50:46Z

Got the issue reproduced, I'll try a quick fix now but this may get postponed for a week or so as I'm about to leave on a trip :)

Closes lxc#769 Signed-off-by: Stéphane Graber <[email protected]>

bazaah · 2024-04-20T10:18:48Z

Thanks for the fast turnaround, I appreciate it.

Closes #769 Signed-off-by: Stéphane Graber <[email protected]>

stgraber self-assigned this Apr 19, 2024

stgraber added Bug Easy Good for new contributors labels Apr 19, 2024

stgraber added this to the incus-6.1 milestone Apr 19, 2024

stgraber added a commit to stgraber/incus that referenced this issue Apr 20, 2024

incusd/cgroup: Set hugepages reserved limits

e8afad8

Closes lxc#769 Signed-off-by: Stéphane Graber <[email protected]>

stgraber mentioned this issue Apr 20, 2024

incusd/cgroup: Set hugepages reserved limits #774

Merged

hallyn closed this as completed in #774 Apr 20, 2024

stgraber added a commit that referenced this issue May 27, 2024

incusd/cgroup: Set hugepages reserved limits

6ea16ba

Closes #769 Signed-off-by: Stéphane Graber <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hugetlb cgroup controller does not set `rsvd`, leading to segfaults in postgres + initdb #769

hugetlb cgroup controller does not set `rsvd`, leading to segfaults in postgres + initdb #769

bazaah commented Apr 18, 2024

stgraber commented Apr 18, 2024

bazaah commented Apr 18, 2024 •

edited

Loading

stgraber commented Apr 18, 2024

stgraber commented Apr 20, 2024

bazaah commented Apr 20, 2024

hugetlb cgroup controller does not set rsvd, leading to segfaults in postgres + initdb #769

hugetlb cgroup controller does not set rsvd, leading to segfaults in postgres + initdb #769

Comments

bazaah commented Apr 18, 2024

Required information

Issue description

Steps to reproduce

Information to attach

stgraber commented Apr 18, 2024

bazaah commented Apr 18, 2024 • edited Loading

stgraber commented Apr 18, 2024

stgraber commented Apr 20, 2024

bazaah commented Apr 20, 2024

hugetlb cgroup controller does not set `rsvd`, leading to segfaults in postgres + initdb #769

hugetlb cgroup controller does not set `rsvd`, leading to segfaults in postgres + initdb #769

bazaah commented Apr 18, 2024 •

edited

Loading