Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hugetlb cgroup controller does not set rsvd, leading to segfaults in postgres + initdb #769

Closed
bazaah opened this issue Apr 18, 2024 · 5 comments · Fixed by #774
Closed
Assignees
Labels
Easy Good for new contributors
Milestone

Comments

@bazaah
Copy link

bazaah commented Apr 18, 2024

Required information

  • Distribution: Archlinux
  • Distribution version: Linux <snip> 6.8.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 16 Mar 2024 17:15:35 +0000 x86_64 GNU/Linux
  • The output of "incus info":
Incus Info
config:
  core.https_address: <snip>
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
  addresses:
  - <snip>
  architectures:
  - x86_64
  - i686
  certificate: <snip>
  certificate_fingerprint: <snip>
  driver: lxc | qemu
  driver_version: 5.0.3 | 8.2.2
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "true"
    unpriv_fscaps: "true"
  kernel_version: 6.8.1-arch1-1
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Arch Linux
  os_version: ""
  project: default
  server: incus
  server_clustered: false
  server_event_mode: full-mesh
  server_name: <snip>
  server_pid: <snip>
  server_version: "0.6"
  storage: ceph
  storage_version: 18.2.2
  storage_supported_drivers:
  - name: btrfs
    version: 6.7.1
    remote: false
  - name: ceph
    version: 18.2.2
    remote: true
  - name: cephfs
    version: 18.2.2
    remote: true
  - name: cephobject
    version: 18.2.2
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.23(2) (2023-11-21) / 1.02.197 (2023-11-21) / 4.48.0
    remote: false

Issue description

While attempting to get hugepages working for an unprivileged container postgres database, I encountered repeated segfaults during the initdb sequence.

This was somewhat confusing to me, because by default postgres/initdb will attempt to use hugepages, but gracefully fallback to normal memory if unavailable,
so clearly, postgres had been sufficiently induced to believe that hugepages did exist, but when it went to use them the host kernel killed the process.

Sometime later this evening I think I have it figured out.

At the end of the repro, you'll be greeted with an error like:

running bootstrap script ... 2024-04-18 00:00:00.000 UTC [1111] DEBUG: invoking IpcMemoryCreate(size=3891200)
Bus error (core dumped)
child process exited with exit code 135

Googling around this error brings up lots of related issues, particularly around Kubernetes deployments.

However, eventually you'll find opencontainers/runtime-spec#1050 which explains the problem:

The previous non-rsvd max/limit_in_bytes does not account for reserved
huge page memory, making it possible for a processes to reserve all the
huge page memory, without being able to allocate it (due to cgroup
restrictions).

In practice this makes it possible to successfully mmap more huge page
memory than allowed via the cgroup settings, but when using the memory
the process will get a SIGBUS and crash. This is bad for applications
trying to mmap at startup (and it succeeds), but the program crashes
when starting to use the memory. eg. postgres is doing this by default.

Which was fixed/added to runc in opencontainers/runc#4073.

I'm not sure how exactly this translates to incus's codebase, but from what little digging I've done around the hugetlb controller,
I can find no mention of setting hugetlb.<pagesize>.rsvd cgroups, only the older hugetlb.<pagesize>.limit_in_bytes.

Steps to reproduce

# == On the host ==

# Ensure hugepages support is enabled & we have some allocated on the system
ls -l /dev/hugepages
sysctl -w vm.nr_hugepages=1024

# Make a debian container for the demo, with what _should_ allow hugepage support
incus init images:debian/bookworm hugepages-demo
incus config set hugepages-demo limits.hugepages.2MB=512 security.syscalls.intercept.mount=true security.syscalls.intercept.mount.allowed=hugetlbfs
incus start hugepages-demo
incus exec -t hugepages-demo -- su -

# == In the container ==

# Install postgres, ignore the default db that debian happily creates for you, though do note the initdb core dumps...
apt update && apt install -y eatmydata && eatmydata -- apt install -y postgresql postgresql-contrib

# Get /dev/hugepages mounted inside the container, e.g: the mount interception stuff above works
sed '/ConditionVirtualization/d' /usr/lib/systemd/system/dev-hugepages.mount > /etc/systemd/system/dev-hugepages.mount
systemctl daemon-reload && systemctl start dev-hugepages.mount && ls -lash /dev/hugepages

# Now for the failure.
#
# We rerun the initdb debian tried previous, but with debugging turned on
pg_createcluster 15 main -- --debug

# Will print something like:
#
# running bootstrap script ... 2024-04-18 00:00:00.000 UTC [1111] DEBUG:  invoking IpcMemoryCreate(size=3891200)
# Bus error (core dumped)
# child process exited with exit code 135

Information to attach

pg_createcluster log
root@hugepages-demo:~# pg_createcluster 15 main -- --debug
Creating new PostgreSQL cluster 15/main ...
/usr/lib/postgresql/15/bin/initdb -D /var/lib/postgresql/15/main --auth-local peer --auth-host scram-sha-256 --no-instructions --debug
Running in debug mode.
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

VERSION=15.6 (Debian 15.6-0+deb12u1)
PGDATA=/var/lib/postgresql/15/main
share_path=/usr/share/postgresql/15
PGPATH=/usr/lib/postgresql/15/bin
POSTGRES_SUPERUSERNAME=postgres
POSTGRES_BKI=/usr/share/postgresql/15/postgres.bki
POSTGRESQL_CONF_SAMPLE=/usr/share/postgresql/15/postgresql.conf.sample
PG_HBA_SAMPLE=/usr/share/postgresql/15/pg_hba.conf.sample
PG_IDENT_SAMPLE=/usr/share/postgresql/15/pg_ident.conf.sample
The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/15/main ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 20
selecting default shared_buffers ... 400kB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... 2024-04-18 00:41:23.828 UTC [4400] DEBUG:  invoking IpcMemoryCreate(size=3891200)
Bus error (core dumped)
child process exited with exit code 135
initdb: removing contents of data directory "/var/lib/postgresql/15/main"
Error: initdb failed

Side note, congrats on the first stable release of incus. I was very happy to see the project back in the hands of linuxcontainers after the Canonical announcement.

@stgraber
Copy link
Member

Your system is running cgroup1? (you can show ls -lh /sys/fs/cgroup if unsure)

@bazaah
Copy link
Author

bazaah commented Apr 18, 2024

cgroup2 /sys/fs/cgroup/cgroup.controllers exists

Edit: from mount:

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

@stgraber
Copy link
Member

Cool, thanks. I'll try to look into this one tomorrow or Friday, looks pretty easy to sort out based on the runc change.

@stgraber stgraber self-assigned this Apr 19, 2024
@stgraber stgraber added Bug Easy Good for new contributors labels Apr 19, 2024
@stgraber stgraber added this to the incus-6.1 milestone Apr 19, 2024
@stgraber
Copy link
Member

Got the issue reproduced, I'll try a quick fix now but this may get postponed for a week or so as I'm about to leave on a trip :)

stgraber added a commit to stgraber/incus that referenced this issue Apr 20, 2024
@bazaah
Copy link
Author

bazaah commented Apr 20, 2024

Thanks for the fast turnaround, I appreciate it.

stgraber added a commit that referenced this issue May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Good for new contributors
Development

Successfully merging a pull request may close this issue.

2 participants