Ensure robust database connectivity #694

BenjaminPelletier · 2022-02-01T17:28:00Z

Historically, there may have been a problem in the DSS where a DSS instance would fail at least the first operation, due to database connectivity problems, after a long period of idleness and/or had its CRDB server restart after core-service (then grpc-backend) had already started. The exact circumstances, triggers, etc of this possible problem are not well understood, and may have been due to factors unrelated to the DSS design. Around the time of this problem, a feature was added to core-service (then grpc-backend) that would interact with (ping) the database every minute to attempt to avoid the issue -- if there was a problem with the ping, the core-service instance would panic and kill itself so that the Kubernetes orchestrator would then restart it (turn it off and back on again) to reestablish the database connection. More recently, we switched to using connection statistics rather than pinging the database (#679) for the periodic check, but it turned out (#691) that connections could go to zero without actually indicating a bad database connection. So, we changed the panic to a warning message (#692), which means that the DSS no longer restarts itself when there is a bad database connection.

In the long term, this resilience should be built into the database client. If an initial attempt to interact with the database fails due to a now-invalid connection, the database client should transparently attempt to repair the connection and then retry the operation before returning to the caller. Importantly, we should also consider how system maintenance is to be performed. If CRDB nodes can be offline for short periods of time (when, e.g., upgrading to new versions) while their corresponding core-service instances are still online, then core-service should be able to fail over to other CRDB nodes when attempting to fulfill requests.

BenjaminPelletier · 2022-04-07T00:00:20Z

#752 illustrates the usage of haproxy to recover from a loss of a CRDB node

BenjaminPelletier added P2 Normal priority feature Issue would improve software labels Feb 1, 2022

BenjaminPelletier mentioned this issue Feb 1, 2022

[core-service] Do not panic when TotalConns==0 #692

Merged

BenjaminPelletier added the dss Relating to one of the DSS implementations label Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure robust database connectivity #694

Ensure robust database connectivity #694

BenjaminPelletier commented Feb 1, 2022

BenjaminPelletier commented Apr 7, 2022

Ensure robust database connectivity #694

Ensure robust database connectivity #694

Comments

BenjaminPelletier commented Feb 1, 2022

BenjaminPelletier commented Apr 7, 2022