You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Historically, there may have been a problem in the DSS where a DSS instance would fail at least the first operation, due to database connectivity problems, after a long period of idleness and/or had its CRDB server restart after core-service (then grpc-backend) had already started. The exact circumstances, triggers, etc of this possible problem are not well understood, and may have been due to factors unrelated to the DSS design. Around the time of this problem, a feature was added to core-service (then grpc-backend) that would interact with (ping) the database every minute to attempt to avoid the issue -- if there was a problem with the ping, the core-service instance would panic and kill itself so that the Kubernetes orchestrator would then restart it (turn it off and back on again) to reestablish the database connection. More recently, we switched to using connection statistics rather than pinging the database (#679) for the periodic check, but it turned out (#691) that connections could go to zero without actually indicating a bad database connection. So, we changed the panic to a warning message (#692), which means that the DSS no longer restarts itself when there is a bad database connection.
In the long term, this resilience should be built into the database client. If an initial attempt to interact with the database fails due to a now-invalid connection, the database client should transparently attempt to repair the connection and then retry the operation before returning to the caller. Importantly, we should also consider how system maintenance is to be performed. If CRDB nodes can be offline for short periods of time (when, e.g., upgrading to new versions) while their corresponding core-service instances are still online, then core-service should be able to fail over to other CRDB nodes when attempting to fulfill requests.
The text was updated successfully, but these errors were encountered:
Historically, there may have been a problem in the DSS where a DSS instance would fail at least the first operation, due to database connectivity problems, after a long period of idleness and/or had its CRDB server restart after core-service (then grpc-backend) had already started. The exact circumstances, triggers, etc of this possible problem are not well understood, and may have been due to factors unrelated to the DSS design. Around the time of this problem, a feature was added to core-service (then grpc-backend) that would interact with (ping) the database every minute to attempt to avoid the issue -- if there was a problem with the ping, the core-service instance would panic and kill itself so that the Kubernetes orchestrator would then restart it (turn it off and back on again) to reestablish the database connection. More recently, we switched to using connection statistics rather than pinging the database (#679) for the periodic check, but it turned out (#691) that connections could go to zero without actually indicating a bad database connection. So, we changed the panic to a warning message (#692), which means that the DSS no longer restarts itself when there is a bad database connection.
In the long term, this resilience should be built into the database client. If an initial attempt to interact with the database fails due to a now-invalid connection, the database client should transparently attempt to repair the connection and then retry the operation before returning to the caller. Importantly, we should also consider how system maintenance is to be performed. If CRDB nodes can be offline for short periods of time (when, e.g., upgrading to new versions) while their corresponding core-service instances are still online, then core-service should be able to fail over to other CRDB nodes when attempting to fulfill requests.
The text was updated successfully, but these errors were encountered: