Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos.gitlab: loosen the coupling of gitlab services to postgresql and redis #286532

Merged

Conversation

osnyx
Copy link
Contributor

@osnyx osnyx commented Feb 5, 2024

Description of changes

This reduces the strong degree of coupling between Gitlab and its supporting 3rd party services redis and postgresql.
It's a potential solution for #286530.

Background:
When Gitlab and its database postgresql are running on the same system, its systemd services are strongly coupled via binds-to relations to ensure start/stop/restart of the individual services propagates to the other services.
We already had some slight issues with this in the past and fixed the reliability of coupled restarts in #245240. But since then, we still have seen some additional hard-to-reproduce race conditions that show the approach of strongly-coupled service restarts is brittle here. In one case when postgresql needed to restart, Gitlab was shut down while trying to start and it stayed inactive.
As a consequence, I propose to drop the approach of strong coupling between these services altogether and let the services postgresql or redis restart without forcing a restart of gitlab.service. This just causes the running service to temporarily loose its database connections, but regain them after a successful restart of the db services. Given the long startup times of gitlab.service, this actually even reduces the downtime caused by a database restart in many cases.

Reasoning/ why this decoupling is fine:
By design, both redis and postgresql can be hosted on another host than the gitlab service itself. The NixOS module actually supports this for postgresql, but not for redis (as a dependency of sidekiq). Still, Gitlab and Sidekiq are designed to cope with temporary unavailabilities of redis and postgresql and do recover from such situations automatically.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels Feb 5, 2024
@osnyx
Copy link
Contributor Author

osnyx commented Feb 5, 2024

This currently makes the gitlab tests fail, I could use some help figuring out how to resolve this.

gitaly fails to restart due to a missing /var/gitlab/state/gitlab_shell_secret. That file is supposed to be created by gitlab-config.service. gitaly.service has both a bindsTo and an after dependency towards gitlab-config, so I see why that dependency is not setting up things correctly.

gitlab # [  178.450535] systemd[1]: Stopped GitLab static pages daemon.
gitlab # [  178.451116] systemd[1]: gitlab-pages.service: Consumed 33ms CPU time, 14.3M memory peak, read 10.2M from disk, written 0B to disk, no IP traffic.
gitlab # [  178.452526] systemd[1]: gitaly.service: Main process exited, code=exited, status=1/FAILURE
gitlab # [  178.453535] systemd[1]: gitaly.service: Failed with result 'exit-code'.
gitlab # [  178.456455] systemd[1]: Stopped gitaly.service.
gitlab # [  178.456989] systemd[1]: gitaly.service: Consumed 7.688s CPU time, 91.0M memory peak, read 39.6M from disk, written 0B to disk, no IP traffic.
gitlab # [  178.460064] bundle[1811]: {"timestamp":"2024-02-05T17:15:40.155Z","pid":1811,"message":"=== puma shutdown: 2024-02-05 17:15:40 +0000 ==="}
gitlab # [  178.461155] bundle[1811]: {"timestamp":"2024-02-05T17:15:40.155Z","pid":1811,"message":"- Goodbye!"}
gitlab # [  178.461859] bundle[1811]: {"timestamp":"2024-02-05T17:15:40.155Z","pid":1811,"message":"- Gracefully shutting down workers..."}
gitlab # [  178.462828] systemd[1]: Stopping gitlab.service...
gitlab # [  178.475856] dovecot[1083]: imap-login: Login: user=<alice>, method=PLAIN, rip=::1, lip=::1, mpid=3303, secured, session=<76FqnKUQNsAAAAAAAAAAAAAAAAAAAAAB>
gitlab # [  178.483864] dovecot[1083]: imap(alice)<1432><YBw1kqUQFM4AAAAAAAAAAAAAAAAAAAAB>: Disconnected: Connection closed (IDLE finished 38.030 secs ago) in=368 out=5299 deleted=1 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=1 body_bytes=3602
gitlab # [  178.485587] dovecot[1083]: imap(alice)<3303><76FqnKUQNsAAAAAAAAAAAAAAAAAAAAAB>: Disconnected: Connection closed (SELECT finished 0.005 secs ago) in=23 out=717 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=0 body_bytes=0
gitlab # [  178.487279] dovecot[1083]: imap-login: Disconnected: Connection closed (no auth attempts in 0 secs): user=<>, rip=::1, lip=::1, secured, session=<riJrnKUQMMAAAAAAAAAAAAAAAAAAAAAB>
gitlab # [  178.488630] systemd[1]: gitlab-mailroom.service: Deactivated successfully.
gitlab # [  178.500330] systemd[1]: Stopped GitLab incoming mail daemon.
gitlab # [  178.500881] systemd[1]: gitlab-mailroom.service: Consumed 2.133s CPU time, 16.4M memory peak, read 1.0M from disk, written 0B to disk, received 12.9K IP traffic, sent 6.7K IP traffic.
gitlab # [  178.943909] sidekiq[1810]: {"severity":"INFO","time":"2024-02-05T17:15:40.639Z","message":"Pausing to allow jobs to finish..."}
gitlab # [  179.091123] systemd[1]: gitlab.service: Deactivated successfully.
gitlab # [  179.098386] systemd[1]: Stopped gitlab.service.
gitlab # [  179.098685] systemd[1]: gitlab.service: Consumed 50.974s CPU time, 1.5G memory peak, read 460.0K from disk, written 0B to disk, no IP traffic.
gitlab # [  179.100376] systemd[1]: Stopping gitlab-workhorse.service...
gitlab # [  179.101309] workhorse[1166]: time="2024-02-05T17:15:40Z" level=info msg="shutdown initiated" shutdown_timeout_s=0 signal=terminated
gitlab # [  179.101994] workhorse[1166]: time="2024-02-05T17:15:40Z" level=info msg="keywatcher: shutting down"
gitlab # [  179.102267] workhorse[1166]: time="2024-02-05T17:15:40Z" level=fatal msg="shutting down" error="<nil>"
gitlab # [  179.106077] systemd[1]: gitlab-workhorse.service: Main process exited, code=exited, status=1/FAILURE
gitlab # [  179.106698] systemd[1]: gitlab-workhorse.service: Failed with result 'exit-code'.
gitlab # [  179.114450] systemd[1]: Stopped gitlab-workhorse.service.
gitlab # [  179.115290] systemd[1]: gitlab-workhorse.service: Consumed 317ms CPU time, 52.0M memory peak, read 39.7M from disk, written 0B to disk, no IP traffic.
gitlab # [  179.136337] postgres[1302]: [1302] LOG:  checkpoint complete: wrote 5792 buffers (35.4%); 0 WAL file(s) added, 0 removed, 3 recycled; write=0.137 s, sync=0.541 s, total=0.693 s; sync files=7056, longest=0.094 s, average=0.001 s; distance=51568 kB, estimate=5
1568 kB
gitlab # [  179.151198] postgres[1301]: [1301] LOG:  database system is shut down
gitlab # [  179.152956] systemd[1]: postgresql.service: Deactivated successfully.
gitlab # [  179.162281] systemd[1]: Stopped PostgreSQL Server.
gitlab # [  179.162816] systemd[1]: postgresql.service: Consumed 19.565s CPU time, no IP traffic.
gitlab: must succeed: find /var/gitlab/state -mindepth 1 -maxdepth 1 -not -name backup -execdir rm -r {} +
(finished: must succeed: find /var/gitlab/state -mindepth 1 -maxdepth 1 -not -name backup -execdir rm -r {} +, in 0.13 seconds)
gitlab: must succeed: systemd-tmpfiles --create
(finished: must succeed: systemd-tmpfiles --create, in 0.06 seconds)
gitlab: must succeed: rm -rf /var/lib/postgresql/15
(finished: must succeed: rm -rf /var/lib/postgresql/15, in 0.14 seconds)
gitlab # [  182.946773] sidekiq[1810]: {"severity":"INFO","time":"2024-02-05T17:15:44.642Z","message":"Bye!"}
gitlab # [  182.947571] sidekiq[1810]: {"severity":"INFO","time":"2024-02-05T17:15:44.643Z","message":"stopped","memwd_reason":"background task stopped","memwd_handler_class":"Gitlab::Memory::Watchdog::Handlers::SidekiqHandler","memwd_sleep_time_s":3,"pid":1810,"worker_
id":"sidekiq","memwd_rss_bytes":979329024,"retry":0}
gitlab # [  183.122430] systemd[1]: gitlab-sidekiq.service: Deactivated successfully.
gitlab # [  183.131415] systemd[1]: Stopped gitlab-sidekiq.service.
gitlab # [  183.132115] systemd[1]: gitlab-sidekiq.service: Consumed 50.624s CPU time, 816.0M memory peak, read 10.9M from disk, written 0B to disk, received 3.5K IP traffic, sent 14.0K IP traffic.
gitlab # [  183.145506] systemd[1]: Started gitaly.service.
gitlab # [  183.146977] systemd[1]: Starting PostgreSQL Server...
gitlab # [  183.171908] postgresql-pre-start[3333]: The files belonging to this database system will be owned by user "postgres".
gitlab # [  183.172935] postgresql-pre-start[3333]: This user must also own the server process.
gitlab # [  183.173713] postgresql-pre-start[3333]: The database cluster will be initialized with locale "en_US.UTF-8".
gitlab # [  183.174639] postgresql-pre-start[3333]: The default database encoding has accordingly been set to "UTF8".
gitlab # [  183.175498] postgresql-pre-start[3333]: The default text search configuration will be set to "english".
gitlab # [  183.176278] postgresql-pre-start[3333]: Data page checksums are disabled.
gitlab # [  183.176820] postgresql-pre-start[3333]: fixing permissions on existing directory /var/lib/postgresql/15 ... ok
gitlab # [  183.177720] postgresql-pre-start[3333]: creating subdirectories ... ok
gitlab # [  183.178310] postgresql-pre-start[3333]: selecting dynamic shared memory implementation ... posix
gitlab # [  183.188888] postgresql-pre-start[3333]: selecting default max_connections ... 100
gitlab # [  183.209985] postgresql-pre-start[3333]: selecting default shared_buffers ... 128MB
gitlab # [  183.366517] gitaly[3324]: time="2024-02-05T17:15:45.062Z" level=info msg="grpc prometheus histograms enabled" latencies="[0.001 0.005 0.025 0.1 0.5 1 10 30 60 300 1500]" pid=3324
gitlab # [  183.367288] gitaly[3324]: time="2024-02-05T17:15:45.062Z" level=info msg="Starting Gitaly" pid=3324 version=16.8.1
gitlab # [  183.367924] gitaly[3324]: time="2024-02-05T17:15:45.062Z" level=info msg="finished initializing cgroups" duration_ms=0 pid=3324
gitlab # [  183.368368] postgresql-pre-start[3333]: selecting default time zone ... UTC
gitlab # [  183.370173] postgresql-pre-start[3333]: creating configuration files ... ok
gitlab # [  183.379379] gitaly[3324]: time="2024-02-05T17:15:45.075Z" level=info msg="finished unpacking auxiliary binaries" duration_ms=12 pid=3324
gitlab # [  183.380059] gitaly[3324]: time="2024-02-05T17:15:45.075Z" level=info msg="finished initializing bootstrap" duration_ms=0 pid=3324
gitlab # [  183.381492] gitaly[3324]: time="2024-02-05T17:15:45.077Z" level=info msg="finished initializing command factory" duration_ms=2 pid=3324
gitlab # [  183.384212] gitaly[3324]: time="2024-02-05T17:15:45.080Z" level=info msg="finished detecting git version" duration_ms=2 pid=3324
gitlab # [  183.391142] gitaly[3324]: unclean Gitaly shutdown: could not create GitLab API client: reading secret file: open /var/gitlab/state/gitlab_shell_secret: no such file or directory
gitlab # [  183.402600] systemd[1]: gitaly.service: Main process exited, code=exited, status=1/FAILURE
gitlab # [  183.403096] systemd[1]: gitaly.service: Failed with result 'exit-code'.
gitlab # [  183.490109] postgresql-pre-start[3333]: running bootstrap script ... ok
gitlab # [  183.834842] postgresql-pre-start[3333]: performing post-bootstrap initialization ... ok
gitlab # [  184.057452] postgresql-pre-start[3333]: syncing data to disk ... ok
gitlab # [  184.057926] postgresql-pre-start[3333]: initdb: warning: enabling "trust" authentication for local connections
gitlab # [  184.058312] postgresql-pre-start[3333]: initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
gitlab # [  184.058590] postgresql-pre-start[3333]: Success. You can now start the database server using:
gitlab # [  184.058818] postgresql-pre-start[3333]:     /nix/store/gz2m222hp7d3pnaf1xbk1r3ympmz752r-postgresql-and-plugins-15.5/bin/pg_ctl -D /var/lib/postgresql/15 -l logfile start
gitlab # [  184.094516] postgres[3349]: [3349] LOG:  starting PostgreSQL 15.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.0, 64-bit
gitlab # [  184.095270] postgres[3349]: [3349] LOG:  listening on IPv6 address "::1", port 5432
gitlab # [  184.095544] postgres[3349]: [3349] LOG:  listening on IPv4 address "127.0.0.1", port 5432
gitlab # [  184.096693] postgres[3349]: [3349] LOG:  listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
gitlab # [  184.100516] postgres[3352]: [3352] LOG:  database system was shut down at 2024-02-05 17:15:45 GMT
gitlab # [  184.103942] postgres[3349]: [3349] LOG:  database system is ready to accept connections
gitlab # [  184.143166] postgresql-post-start[3363]: CREATE ROLE
gitlab # [  184.148267] postgresql-post-start[3365]: ALTER ROLE
gitlab # [  184.149448] systemd[1]: Started PostgreSQL Server.
gitlab # [  184.162463] systemd[1]: Starting gitlab-postgresql.service...
gitlab # [  184.224067] gitlab-postgresql-start[3373]: CREATE DATABASE
gitlab # [  184.243067] gitlab-postgresql-start[3377]: CREATE EXTENSION
gitlab # [  184.280058] gitlab-postgresql-start[3379]: CREATE EXTENSION
gitlab # [  184.292402] systemd[1]: Finished gitlab-postgresql.service.
gitlab: waiting for file '/var/gitlab/state/tmp/sockets/gitaly.socket'

@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Feb 5, 2024
talyz

This comment was marked as duplicate.

Copy link
Contributor

@talyz talyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. Sorry about the amount of separate reviews. I have apparently forgotten how to do this properly on GitHub.. Lastly, though:

The tests fail because gitlab-config.service isn't properly stopped before it's started again. The issue seems to be that stopping a target, in this case gitlab.target, is not a blocking operation. It probably had time to stop when more had to be done before postgresql.service could be stopped, and therefore this wasn't apparent. Adding gitlab-config.service to the list of units to stop before removing the database in the gitlab test solves this issue. Here's a patch:

--- a/nixos/tests/gitlab.nix
+++ b/nixos/tests/gitlab.nix
@@ -419,7 +419,7 @@ in {
       gitlab.systemctl("start gitlab-backup.service")
       gitlab.wait_for_unit("gitlab-backup.service")
       gitlab.wait_for_file("${nodes.gitlab.services.gitlab.statePath}/backup/dump_gitlab_backup.tar")
-      gitlab.systemctl("stop postgresql.service gitlab.target")
+      gitlab.systemctl("stop postgresql.service gitlab-config.service gitlab.target")
       gitlab.succeed(
           "find ${nodes.gitlab.services.gitlab.statePath} -mindepth 1 -maxdepth 1 -not -name backup -execdir rm -r {} +"
       )

@osnyx osnyx force-pushed the PL-131811-gitlab-loose-coupling-upstream branch from 45af7df to 85423d5 Compare February 7, 2024 13:52
@osnyx osnyx marked this pull request as ready for review February 7, 2024 13:55
@osnyx
Copy link
Contributor Author

osnyx commented Feb 7, 2024

@talyz Thanks for catching the bug in the test. The binding of gitlab-config to gitlab-db-config apparently was aimed in the right direction, as I did that to ensure gitlab-config is stopped when gitlab is stopped after backup and thus ready to be started again and initialise files at the next start.
But given the implications you explained well, adjusting the manual stop calls in the test is the better solution indeed.

@osnyx
Copy link
Contributor Author

osnyx commented Feb 7, 2024

Regarding release notes, I consider this to just be a change in the under-the-hood mechanics and not worth a mention there.

…s/ redis to avoid restarts and races

Gitlab stays running at redis and postgresql restarts as if these
components were on a different host anyways. Handling reconnetctions is
part of the application logic.

Co-authored-by: Kim Lindberger <[email protected]>
for formatting fixes and test failure debugging.
@osnyx osnyx force-pushed the PL-131811-gitlab-loose-coupling-upstream branch from 85423d5 to 13ba002 Compare February 7, 2024 17:19
@osnyx
Copy link
Contributor Author

osnyx commented Feb 7, 2024

Tests are now passing for me as well.

@talyz talyz merged commit debe2ca into NixOS:master Feb 7, 2024
21 checks passed
Copy link
Contributor

github-actions bot commented Feb 7, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gitlab: race condition of restarting coupled DB services ends up in gitlab being inactive
2 participants