Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort HA Realization Logic After Timeout #800

Merged
merged 6 commits into from
Dec 16, 2024
Merged

Conversation

oxzi
Copy link
Member

@oxzi oxzi commented Sep 5, 2024

  • icingadb: Unify select cases for derived contexts
    The main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen.
  • HA: Increase log level for heartbeats from the future
    Timing issues may be the root of future failures. Thus, it is important to be aware if the timing seems to be out of sync.
  • HA: Deferred SQL Transaction Rollback
    Each transaction is created within the retryable function, but this function may be exited prematurely before committing. A deferred rollback ensures that the transaction will be rolled back and cleaned up in this case, or will be a noop when performed after the commit.
  • HA: Insert environment within retryable function
    The HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context.
  • HA/Heartbeat: Use last message's timestamp
    Since the retryable HA function may be executed a few times before succeeding, the inserted heartbeat value will be directly outdated. The heartbeat logic was slightly altered to always use the latest heartbeat time value.
  • HA: Abort Transaction Commit after Timeout
    A strange HA behavior was reported in Competing HA takeover results in both instances becoming active #787, resulting in both instances being active.
    The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received.
    However, as it turns out, executing COMMIT on a database transaction is not bound to the transaction's context, allowing to survive longer. To mitigate this, another context watch was introduced. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database.

Closes #787.

@oxzi oxzi added the bug Something isn't working label Sep 5, 2024
@oxzi oxzi added this to the 1.3.0 milestone Sep 5, 2024
@oxzi oxzi requested a review from julianbrost September 5, 2024 10:34
@cla-bot cla-bot bot added the cla/signed label Sep 5, 2024
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
@oxzi oxzi force-pushed the ha-desperate-timing-fixes branch from 46a9b12 to a03246f Compare September 10, 2024 07:33
@oxzi oxzi requested a review from julianbrost September 10, 2024 07:33
@oxzi oxzi force-pushed the ha-desperate-timing-fixes branch from a03246f to c13752a Compare September 23, 2024 13:16
@oxzi
Copy link
Member Author

oxzi commented Sep 23, 2024

After going over this with @lippserd, I reworked this PR. The changes are, in a nutshell, that the potentially blocking COMMIT query will be guarded with a select watching at the context and that always the latest heartbeat timestamp will be used, as otherwise an already heartbeat will be retried. The description in the initial post was updated accordingly.

@oxzi oxzi marked this pull request as ready for review September 23, 2024 13:19
@oxzi oxzi requested a review from lippserd September 23, 2024 13:20
Copy link
Member

@lippserd lippserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase and add a comment about COMMIT possibly not respecting context deadlines in ExecTx() and NamedExecTx().

pkg/icingadb/ha.go Outdated Show resolved Hide resolved
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
@oxzi oxzi force-pushed the ha-desperate-timing-fixes branch from c13752a to 3e78b92 Compare September 24, 2024 11:59
@oxzi oxzi requested a review from lippserd September 24, 2024 12:00
oxzi added a commit to Icinga/icinga-go-library that referenced this pull request Sep 24, 2024
A key finding of Icinga/icingadb#800 was that committing a transaction
does not necessarily have to respect the context of the transaction,
depending on the database driver. As @lippserd suggested there, I have
added notes to the documentation of all relevant database functions.
lippserd
lippserd previously approved these changes Sep 30, 2024
oxzi added a commit to Icinga/icinga-go-library that referenced this pull request Oct 8, 2024
A key finding of Icinga/icingadb#800 was that committing a transaction
does not necessarily have to respect the context of the transaction,
depending on the database driver. As @lippserd suggested there, I have
added notes to the documentation of all relevant database functions.
pkg/icingaredis/heartbeat.go Outdated Show resolved Hide resolved
pkg/icingaredis/heartbeat.go Outdated Show resolved Hide resolved
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the conflict, everything looks fine to me now. Sorry for that :)

Copy link
Member

@lippserd lippserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I approved earlier, I now have something that I had previously forgotten. It's not about the code itself, but about the commit(s). The changes here are subject to multiple commits:

  • Changes in main
  • Increasing the "heartbeat from future" log to warning
  • Explicit rollback
  • Inserting the environment inside the transaction
  • Introduction and use of LastMessageTime() (plus the corresponding godoc paragraph of realize)
  • Executing commit in a separate Go routine (plus its godoc paragraph of realize)

The main loop select cases for hactx.Done() and ctx.Done() were unified,
as hactx is a derived ctx. A closed ctx case may be lost as the hactx
case could have been chosen.
oxzi added 5 commits October 25, 2024 10:29
Timing issues may be the root of future failures. Thus, it is important
to be aware if the timing seems to be out of sync.
Each transaction is created within the retryable function, but this
function may be exited prematurely before committing. A deferred
rollback ensures that the transaction will be rolled back and cleaned up
in this case, or will be a noop when performed after the commit.
The HA.insertEnvironment() method was inlined into the retryable
function to use the deadlined context. Otherwise, this might block
afterwards, as it was used within HA.realize(), but without the passed
context.
Since the retryable HA function may be executed a few times before
succeeding, the inserted heartbeat value will be directly outdated. The
heartbeat logic was slightly altered to always use the latest heartbeat
time value.
A strange HA behavior was reported in #787, resulting in both instances
being active.

The logs contained an entry of the previous active instance exiting the
HA.realize() method successfully after 1m9s. This, however, should not
be possible as the method's context is deadlined to a minute after the
heartbeat was received.

However, as it turns out, executing COMMIT on a database transaction is
not bound to the transaction's context, allowing to survive longer. To
mitigate this, another context watch was introduced. Doing so allows
directly handing over, while the other instance can now take over due to
the expired heartbeat in the database.
@oxzi oxzi force-pushed the ha-desperate-timing-fixes branch from 15f7be6 to 8b95d25 Compare October 25, 2024 08:49
@oxzi oxzi requested review from lippserd and yhabteab October 25, 2024 08:49
Copy link
Member

@lippserd lippserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks a lot @oxzi!

@yhabteab
Copy link
Member

Shouldn't this pull request close issue #787?

@oxzi oxzi linked an issue Oct 25, 2024 that may be closed by this pull request
@oxzi
Copy link
Member Author

oxzi commented Oct 25, 2024

Shouldn't this pull request close issue #787?

This was the initial motivation, yes. However, as I was unable to reproduce the reported behavior, I cannot verify this. My intention was to make the observed bug impossible with this change, but maybe there is still another bug lurking around the corner.

However, with a bit more trust in this change, I have linked this PR to the issue. Otherwise, we can reopen it.

@lippserd lippserd removed the request for review from julianbrost December 16, 2024 14:09
@lippserd lippserd merged commit 63d51df into main Dec 16, 2024
32 checks passed
@lippserd lippserd deleted the ha-desperate-timing-fixes branch December 16, 2024 14:09
oxzi added a commit that referenced this pull request Dec 17, 2024
During #800, the commit dd0ca8f inlined
the HA.insertEnvironment method into the retryable realization function.
However, while doing so, I forgot to change the query execution context
from h.db to tx. This resulted in an error when being used together with
a single database connection, as introduced in #828.

Co-Authored-By: Yonas Habteab <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla/signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Competing HA takeover results in both instances becoming active
4 participants