Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PL-132122 upgrade hardware to 24.11 #1122

Merged
merged 1,112 commits into from
Dec 20, 2024
Merged

Conversation

ctheune
Copy link
Member

@ctheune ctheune commented Oct 8, 2024

@flyingcircusio/release-managers

Release process

Impact:

Changelog:

  • Port changes to support our physical infrastructure from the 21.05 release branch. (PL-132122)

PR release workflow (internal)

  • PR has internal ticket
  • internal issue ID (PL-…) part of branch name
  • internal issue ID mentioned in PR description text
  • ticket is on Platform agile board
  • ticket state set to Pull request ready
  • if ticket is more urgent than within the next few days, directly contact a member of the Platform team

Design notes

  • Provide a feature toggle if the change might need to be adjusted/reverted quickly depending on context. Consider whether the default should be on or off. Example: rate limiting.

n/a

  • All customer-facing features and (NixOS) options need to be discoverable from documentation. Add or update relevant documentation such that hosted and guided customers can understand it as well.

n/a

Security implications

zagy and others added 30 commits August 21, 2024 13:23
Establish a regular sensu check on physical hosts that checks certain
LUKS header parameters of all encrypted devices, to discover unexpected
diversions.

- tie together the individual condition checks
- integrate checks into fc-luks tooling with discovery of encrypted
  volumes
- add sensu check
- extend checks with plausibility checks for proper dump input
- add unit and integration tests for checks with proper mocking
This makes sense to expose the check to potential changes in to output
format of `cryptsetup luksDump`.
Detailed unit tests are done against mock outputs only.
Not all physical machines actually have encrypted volumes as of now,
e.g. KVM hosts. The check does not need to run on them.
…cleanup-unused-vms

devhost: cleanup unused VMs (PL-132737)
By running fc-luks commands as a sensu check, we suddenly do not inherit
the full system PATH anymore, breaking access to necessary external
tools used like `lvs`.
We can adopt the `fc-ceph.conf` approach already used by `fc-ceph`
subsystems, for now even the default PATH is sufficient.

Requires some extra mocking in unit tests; the NixOS integration test
now explicitly empties the path beforehands.
[21.05] full-disk encryption: finalisation, improvements, checks, routine tooling
This test artificially causes the FIB and RIB to get out of sync, and
checks that the sensu script correctly detects these problems.

PL-132595
…control-plane

[21.05] Sensu monitoring for EVPN control plane
…ng-restore

[21.05] restore-single-files: reduce chance of UUID collision
Due to performance reasons we did not always change the passwords,
only for new users.

This caused inconsistencies more often than we can tolerate.

This now fixes it properly by parallelizing the updates, limiting to
CPU_COUNT-1 and also deleting superfluous users.

Fixes PL-132945
…ser-updates

[21.05] fix rabbitmq user updates for sensu
We've seen instances where the / endpoint was responding but the
radosgw was completely stuck.

We hope to notice those better now with checks that actually hit a
real object.

Re PL-132070
…nitoring

ceph/radosgw: improve health monitoring
@osnyx osnyx force-pushed the PL-132122-upgrade-hardware-to-24.11 branch from 5970d8d to 1066610 Compare December 13, 2024 13:03
ctheune and others added 19 commits December 13, 2024 15:47
We cannot override the test harness node IDs as they are read-only.
AFAICT this isn't how we set things up in production and it causes
spurious/flaky issues in the FIB.
We do not enforce or guarantee a certain specific reboot time. Under
load, the test slightly missed the 10 seconds timeout. During multiple
invocations I could not make out any specific structural problem with
NFS unmounts during shoutdown, so let's just be a bit more generous here
with the timeouts.
Since 24.11, the innovation and lts names are deprecated aliases.
@osnyx osnyx force-pushed the PL-132122-upgrade-hardware-to-24.11 branch from 3847caa to c4653a7 Compare December 20, 2024 09:38
@osnyx osnyx merged commit 615dd10 into fc-24.11-dev Dec 20, 2024
0 of 2 checks passed
@osnyx osnyx deleted the PL-132122-upgrade-hardware-to-24.11 branch December 20, 2024 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants