Cut-over should wait for heartbeat lag to be low enough to succeed #921

ccoffey · 2021-02-02T10:36:46Z

Description

Related issue: #799

In the above issue, we see migrations which fail at the cut-over phase with ERROR Timeout while waiting for events up to lock. These migrations fail cut-over many times and eventually exhaust all retries.

Root cause

Lag experienced by an external replica is not the same as lag experienced by gh-ost while processing the binlog.

Lag (external replica lag): Is normally computed by executing show slave status against an external replica and extracting the value of Seconds_Behind_Master.
Gh-ost Lag: Is how long it takes Gh-ost to intercept binlog events for updates (for example heartbeats) that it performed.

For example: Imagine that both of these lags were ~0 seconds. Then imagine that you throttle gh-ost for N minutes. At this point the external replica's lag will still be ~0 seconds, but gh-ost's lag will be N minutes.

This is important because its gh-ost's lag (not the external replica's lag) that determines if cut-over succeeds or times out.

More Detail

During cut-over:

A token AllEventsUpToLockProcessed:time.Now() is inserted into the changelog table
The table being migrated is locked
Gh-ost waits for up to --cut-over-lock-timeout-seconds (default: 3 seconds) for this token to appear on the binlog
If the token was not read within the timeout then gh-ost aborts the cutover

Problem: It's possible to enter this cut-over phase when gh-ost is so far behind on processing the binlog that it could not possibly catch-up during the timeout.

What this PR proposes

Maintain a new value CurrentHeartbeatLag in the MigrationContext
Update CurrentHeartbeatLagevery time we intercept a binlog event for the changelog table of type heartbeat
Before cut-over, wait until CurrentHeartbeatLag is less than --max-lag-millis

Note: This PR is best-reviewed commit by commit

An example

It's best to demonstrate the value of this change by example.

I am able to reliably reproduce the cut-over problem (40+ failed cut-over attempts) when running gh-ost against an Amazon RDS Aurora DB.

Test setup:

Create an Aurora cluster with a single writer and no readers
Create two "large" tables (200 million rows & 100 million rows respectively)

Test process:

Run two table migrations in parallel using gh-ost

Both migrations are run using the following params:

gh-ost \
  --host='<REDACTED>' \
  --user='<REDACTED>' \
  --password='<REDACTED>' \
  --database='<REDACTED>' \
  --table='<TABLE>' \
  --alter='<NOT_IMPORTANT>' \
  --allow-on-master \
  --max-load=Threads_running=30 \
  --assume-rbr \
  --skip-foreign-key-checks \
  --chunk-size=500 \
  --cut-over=default \
  --serve-socket-file=/tmp/gh-ost.<TABLE>.sock \
  --replica-server-id=<UNIQUE_ID> \
  --verbose \
  --execute

Note: <TABLE> and <UNIQUE_ID> must be different per migration.

The following logs came from one of the many experiments I ran.

This log was output by the smaller of the two migrations when it got to 13.9% for row copy:

I, [2021-02-01T15:20:19.572790 #1056173]  INFO -- : GhostWorker: Copy: 321000/2308807 13.9%; Applied: 0; Backlog: 0/1000; Time: 43s(total), 36s(copy); streamer: mysql-bin-changelog.046679:64796762; Lag: 0.01s, HeartbeatLag: 17.92s, State: migrating; ETA: 3m42s

Important: Notice that Lag is 0.01s but HeartbeatLag is 17.92s. The value of Lag is actually meaningless here because we are running with --allow-on-master so we are computing Lag by reading a heartbeat row directly from the table which we wrote it to. This explains the extremely low value of 0.01s.

A few minutes later, when row copy completed, Lag was 0.01s and HeartbeatLag was 100.79s:

I, [2021-02-01T15:24:18.335164 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:24:18 INFO Row copy complete
I, [2021-02-01T15:24:18.359951 #1056173]  INFO -- : GhostWorker: Copy: 2401160/2401160 100.0%; Applied: 0; Backlog: 0/1000; Time: 4m42s(total), 4m34s(copy); streamer: mysql-bin-changelog.046681:84235524; Lag: 0.01s, HeartbeatLag: 100.79s, State: migrating; ETA: due

This PR causes gh-ost to wait until the heartbeat lag is less than --max-lag-millis before continuing with the cut-over.

I, [2021-02-01T15:24:18.360111 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:24:18 INFO Waiting for heartbeat lag to be low enough to proceed
I, [2021-02-01T15:24:18.360140 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:24:18 INFO current HeartbeatLag (100.79s) is too high, it needs to be less than --max-lag-millis (1.50s)
I, [2021-02-01T15:24:19.360063 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:24:19 INFO current HeartbeatLag (100.93s) is too high, it needs to be less than --max-lag-millis (1.50s)
I, [2021-02-01T15:24:20.360161 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:24:20 INFO current HeartbeatLag (101.07s) is too high, it needs to be less than --max-lag-millis (1.50s)

Note: If we had tried to cut-over during this period where HeartbeatLag was greater than 100 seconds then we would have failed many times.

The heartbeat lag only started to reduce (a few minutes later) when the larger migration's row copy completed.

I, [2021-02-01T15:27:41.472574 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:41 INFO current HeartbeatLag (90.48s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue
I, [2021-02-01T15:27:42.473628 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:42 INFO current HeartbeatLag (67.26s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue
I, [2021-02-01T15:27:43.472813 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:43 INFO current HeartbeatLag (46.01s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue
I, [2021-02-01T15:27:44.472932 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:44 INFO current HeartbeatLag (22.47s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue

At this point the following message was outputted:

I, [2021-02-01T15:27:45.473009 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Heartbeat lag is low enough, proceeding

And then the table cut-over succeeded in a single attempt:

I, [2021-02-01T15:27:45.473642 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Grabbing voluntary lock: gh-ost.70930.lock
I, [2021-02-01T15:27:45.473918 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Setting LOCK timeout as 6 seconds
I, [2021-02-01T15:27:45.474239 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Looking for magic cut-over table
I, [2021-02-01T15:27:45.474734 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Creating magic cut-over table `***`.`_***_del`
I, [2021-02-01T15:27:45.517945 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Magic cut-over table created
I, [2021-02-01T15:27:45.518084 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Locking `***`.`***`, `***`.`_***_del`
I, [2021-02-01T15:27:45.518736 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Tables locked
I, [2021-02-01T15:27:45.518807 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Session locking original & magic tables is 70930
I, [2021-02-01T15:27:45.519266 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Writing changelog state: AllEventsUpToLockProcessed:1612193265518649636
I, [2021-02-01T15:27:45.527817 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Intercepted changelog state AllEventsUpToLockProcessed
I, [2021-02-01T15:27:45.527900 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Handled changelog state AllEventsUpToLockProcessed
I, [2021-02-01T15:27:45.531266 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:45 INFO Waiting for events up to lock
I, [2021-02-01T15:27:46.419921 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:46 INFO Waiting for events up to lock: got AllEventsUpToLockProcessed:1612193265518649636
I, [2021-02-01T15:27:46.420076 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:46 INFO Done waiting for events up to lock; duration=901.073705ms
I, [2021-02-01T15:27:46.420108 #1056173]  INFO -- : GhostWorker: # Migrating `***`.`***`; Ghost table is `***`.`_***_gho`
I, [2021-02-01T15:27:46.420131 #1056173]  INFO -- : GhostWorker: # Migrating ***; inspecting ***; executing on ***
I, [2021-02-01T15:27:46.420151 #1056173]  INFO -- : GhostWorker: # Migration started at Mon Feb 01 15:19:36 +0000 2021
I, [2021-02-01T15:27:46.420172 #1056173]  INFO -- : GhostWorker: # chunk-size: 500; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: Threads_running=30; critical-load: ; nice-ratio: 0.000000
I, [2021-02-01T15:27:46.420192 #1056173]  INFO -- : GhostWorker: # throttle-additional-flag-file: /tmp/gh-ost.throttle
I, [2021-02-01T15:27:46.420213 #1056173]  INFO -- : GhostWorker: # Serving on unix socket: /tmp/gh-ost.***.***.sock
I, [2021-02-01T15:27:46.423808 #1056173]  INFO -- : GhostWorker: Copy: 2401160/2401160 100.0%; Applied: 0; Backlog: 0/1000; Time: 8m10s(total), 4m34s(copy); streamer: mysql-bin-changelog.046683:133419216; Lag: 0.01s, HeartbeatLag: 0.00s, State: migrating; ETA: due
I, [2021-02-01T15:27:46.424814 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:46 INFO Setting RENAME timeout as 3 seconds
I, [2021-02-01T15:27:46.424897 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:46 INFO Session renaming tables is 70935
I, [2021-02-01T15:27:46.425570 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:46 INFO Issuing and expecting this to block: rename /* gh-ost */ table `***`.`***` to `***`.`_***_del`, `***`.`_***_gho` to `***`.`***`
I, [2021-02-01T15:27:46.560987 #1056173]  INFO -- : GhostWorker: Copy: 2401160/2401160 100.0%; Applied: 0; Backlog: 0/1000; Time: 8m10s(total), 4m34s(copy); streamer: mysql-bin-changelog.046683:133420865; Lag: 0.01s, HeartbeatLag: 0.00s, State: migrating; ETA: due 
I, [2021-02-01T15:27:47.426402 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
I, [2021-02-01T15:27:47.426543 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Checking session lock: gh-ost.70930.lock
I, [2021-02-01T15:27:47.427267 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Connection holding lock on original table still exists
I, [2021-02-01T15:27:47.427353 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Will now proceed to drop magic table and unlock tables
I, [2021-02-01T15:27:47.427377 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Dropping magic cut-over table
I, [2021-02-01T15:27:47.440892 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Releasing lock from `***`.`***`, `***`.`_***_del`
I, [2021-02-01T15:27:47.441353 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tables unlocked
I, [2021-02-01T15:27:47.476893 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tables renamed
I, [2021-02-01T15:27:47.477039 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Lock & rename duration: 1.958935784s. During this time, queries on `***` were blocked
I, [2021-02-01T15:27:47.477066 #1056173]  INFO -- : GhostWorker: [2021/02/01 15:27:47] [info] binlogsyncer.go:164 syncer is closing...
I, [2021-02-01T15:27:47.547745 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Closed streamer connection. err=<nil>
I, [2021-02-01T15:27:47.547891 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Dropping table `***`.`_***_ghc`
I, [2021-02-01T15:27:47.547917 #1056173]  INFO -- : GhostWorker: [2021/02/01 15:27:47] [error] binlogstreamer.go:77 close sync with err: sync is been closing...
I, [2021-02-01T15:27:47.547938 #1056173]  INFO -- : GhostWorker: [2021/02/01 15:27:47] [info] binlogsyncer.go:179 syncer is closed
I, [2021-02-01T15:27:47.562784 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Table dropped
I, [2021-02-01T15:27:47.562898 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue: ***
I, [2021-02-01T15:27:47.562939 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO -- drop table `***`.`_***_del`
I, [2021-02-01T15:27:47.563005 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Done migrating `***`.`***`
I, [2021-02-01T15:27:47.563025 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Removing socket file: /tmp/gh-ost.***.***.sock
I, [2021-02-01T15:27:47.563044 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tearing down inspector
I, [2021-02-01T15:27:47.563066 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tearing down applier
I, [2021-02-01T15:27:47.563085 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tearing down streamer
I, [2021-02-01T15:27:47.563102 #1056173]  INFO -- : GhostWorker: 2021-02-01 15:27:47 INFO Tearing down throttler
I, [2021-02-01T15:27:47.563120 #1056173]  INFO -- : GhostWorker: # Done

Final Thoughts

This problem is likely exacerbated by Aurora because readers in Aurora do not use binlog replication. This means there is no external replica lag that gh-ost can use to throttle itself so gh-ost ends up backing up the binlog. If gh-ost is the only consumer of the binlog (typical for applications that use Aurora) then only gh-ost's cut-over would suffer. Any observer looking at key health metrics on Aurora's dashboard would conclude that the DB was completely healthy.
The specific problem of backing up the binlog could be solved by getting gh-ost to copy rows much slower. However, gh-ost would still be susceptible to this problem in other ways, for example, if gh-ost was throttled heavily just before cut-over. Also, why should we artificially slow down if the DB is perfectly capable of handling the extra load without hurting our SLAs.

go/logic/migrator.go

ccoffey · 2021-02-03T16:40:54Z

Another day passed by and we learned something new about this problem. I added details to the original issue #799 (comment) about how the Aurora setting aurora_binlog_replication_max_yield_seconds is responsible for the abnormally high HeartbeatLag.

My team are not convinced that it's safe for us to change the Aurora default value of aurora_binlog_replication_max_yield_seconds in our production environment so we still believe that this PR is a good change to gh-ost which will make it safer by default.

shlomi-noach · 2021-02-03T16:50:03Z

I'm planning to look into this early next week. Thank you for the good writeup and followups! I don't have any Aurora feedback as I've never worked with Aurora.

shlomi-noach

While I don't own this repo, I'm curious to address this PR (also submitted as openark#14 downstream).

First, I'd like to point out that this PR and associated issue were presented so eloquently and with such detail that the review process was a pleasure. Hats off!

The issue/analysis/fix make perfect sense. This explains a lot, and Aurora users are not the only ones who have experienced this issue. Back at my previous workplace we were very aggressive about keeping replication lag low and so this wasn't something we saw.

Please see two suggestions inline on improving the logic at time of cut-over. Otherwise this is looking very good.

shlomi-noach · 2021-02-07T13:27:00Z

go/logic/migrator.go

+			heartbeatLag := this.migrationContext.TimeSinceLastHeartbeatOnChangelog()
+			maxLagMillisecondsThrottleThreshold := atomic.LoadInt64(&this.migrationContext.MaxLagMillisecondsThrottleThreshold)
+			if heartbeatLag > time.Duration(maxLagMillisecondsThrottleThreshold)*time.Millisecond {
+				this.migrationContext.Log.Debugf("current HeartbeatLag (%.2fs) is too high, it needs to be less than --max-lag-millis (%.2fs) to continue", heartbeatLag.Seconds(), (time.Duration(maxLagMillisecondsThrottleThreshold) * time.Millisecond).Seconds())
+				return true, nil


Running two distinct sleepWhileTrue loops poses a problematic scenario where possibly the conditions tested in the first loop can re-appear while sleeping on the 2nd loop. I can imagine a user running a echo "postpone" | socat - "/tmp/gh-ost.sock" while sleeping on the newly introduced heartbeat lag loop.

My suggestion is to consolidate the two tests inside one loop, and I think it's simple: remove this new sleepWhileTrue and move the above five rows to be the fist condition tested in the original loop. Overall, there will be one sleepWhileTrue loop that'll look like this:

this.migrationContext.MarkPointOfInterest() this.migrationContext.Log.Debugf("checking for cut-over postpone") this.sleepWhileTrue( func() (bool, error) { heartbeatLag := this.migrationContext.TimeSinceLastHeartbeatOnChangelog() maxLagMillisecondsThrottleThreshold := atomic.LoadInt64(&this.migrationContext.MaxLagMillisecondsThrottleThreshold) if heartbeatLag > time.Duration(maxLagMillisecondsThrottleThreshold)*time.Millisecond { this.migrationContext.Log.Debugf("current HeartbeatLag (%.2fs) is too high, it needs to be less than --max-lag-millis (%.2fs) to continue", heartbeatLag.Seconds(), (time.Duration(maxLagMillisecondsThrottleThreshold) * time.Millisecond).Seconds()) return true, nil } if this.migrationContext.PostponeCutOverFlagFile == "" { return false, nil } if atomic.LoadInt64(&this.migrationContext.UserCommandedUnpostponeFlag) > 0 { atomic.StoreInt64(&this.migrationContext.UserCommandedUnpostponeFlag, 0) return false, nil } if base.FileExists(this.migrationContext.PostponeCutOverFlagFile) { // Postpone file defined and exists! if atomic.LoadInt64(&this.migrationContext.IsPostponingCutOver) == 0 { if err := this.hooksExecutor.onBeginPostponed(); err != nil { return true, err } } atomic.StoreInt64(&this.migrationContext.IsPostponingCutOver, 1) return true, nil } return false, nil }, ) atomic.StoreInt64(&this.migrationContext.IsPostponingCutOver, 0) this.migrationContext.MarkPointOfInterest() this.migrationContext.Log.Debugf("checking for cut-over postpone: complete")

what do you think?

I'd possibly change:

if heartbeatLag > time.Duration(maxLagMillisecondsThrottleThreshold)*time.Millisecond {

to

if heartbeatLag > time.Duration(base.CutOverLockTimeoutSeconds)*time.Second {

It depends on whether a user's setting of maxLagMillisecondsThrottleThreshold is higher or lower than maxLagMillisecondsThrottleThreshold. In my previous use case, lag threshold was 1sec and cut-over timeout was 3sec, and it made more sense to initiate cut-over if lag was within bounds of 3sec.

Perhaps even the higher of the two values?

Running two distinct sleepWhileTrue loops poses a problematic scenario where possibly the conditions tested in the first loop can re-appear while sleeping on the 2nd loop

I completely missed this, you are absolutely correct. I have pushed a commit Consolidate the two sleepWhileTrue loops to address this.

In my previous use case, lag threshold was 1sec and cut-over timeout was 3sec, and it made more sense to initiate cut-over if lag was within bounds of 3sec.

Perhaps even the higher of the two values?

This makes a lot of sense to me, I think this is a good improvement. I have pushed a commit HeartbeatLag must be < than --max-lag-millis and --cut-over-lock-time to address this.

…out-seconds to continue

shlomi-noach

LGTM.

Could you kindly also submit the recent commits to openark#14 ?

@timvaillancourt I do believe the issue may have happened (once) at GitHub, otherwise community has faced it multiple times. Seems like a popular fix!

shlomi-noach · 2021-02-07T14:36:33Z

Could you kindly also submit the recent commits to openark#14 ?

Heh, that just happened automatically. Clearly I don't know how GitHub works 😜

timvaillancourt · 2021-02-08T12:24:05Z

First, I'd like to point out that this PR and associated issue were presented so eloquently and with such detail that the review process was a pleasure. Hats off!

Echoing this, great writeup @ccoffey 👍!

@timvaillancourt I do believe the issue may have happened (once) at GitHub, otherwise community has faced it multiple times. Seems like a popular fix!

@shlomi-noach yes, this sounds familiar. This should make the cut-over more robust for many use cases 🎉!

Heh, that just happened automatically. Clearly I don't know how GitHub works 😜

Same here 😄

We can probably get this PR pushed to our automated testing replica this week. cc @rashiq

ccoffey · 2021-02-08T17:03:39Z

We can probably get this PR pushed to our automated testing replica this week

I would love to learn more about this automated testing replica. What will be covered by this that was not already covered by the migration tests in CI?

I ask because I am interested, and also because I am trying to figure out if I should wait before using my branch in production at my company 😉

timvaillancourt · 2021-02-08T18:10:14Z

We can probably get this PR pushed to our automated testing replica this week

I would love to learn more about this automated testing replica. What will be covered by this that was not already covered by the migration tests in CI?

I ask because I am interested, and also because I am trying to figure out if I should wait before using my branch in production at my company 😉

@ccoffey we have dedicated replicas for testing github/gh-ost on some clusters in Production. Our automation loops through real production tables, perform a gh-ost migration on them and compares the results of the migration after (with replication paused)

The test case is much simpler than the CI testing that covers more of the functionality

shlomi-noach · 2021-02-08T19:46:30Z

I would love to learn more about this

See https://github.blog/2017-07-06-mysql-testing-automation-at-github/ and https://speakerdeck.com/shlominoach/mysql-infrastructure-testing-automation-at-github

ccoffey · 2021-03-08T11:50:14Z

@timvaillancourt any update on how the testing has been going?

I see that this branch is now out of date, would you like me to rebase?

zdannar · 2021-03-16T22:25:44Z

I am pretty sure I have the same issue on GCP. Hopefully this will be merged soon.

shlomi-noach · 2021-03-17T05:54:43Z

@ccoffey @zdannar in the meantime, and until this is merged, consider using this forked release: https://github.com/openark/gh-ost/releases/tag/v1.1.2

michaelglass · 2021-05-03T10:29:04Z

just used @shlomi-noach's fork to run a migration that wouldn't run with baseline gh-ost. Thank you so much!!

shlomi-noach · 2021-05-03T10:44:54Z

@michaelglass that particular fix was contributed by @ccoffey

timvaillancourt · 2021-05-03T13:49:23Z

@ccoffey / @shlomi-noach: apologies for the delay. We've recently cleared a backlog of integration testing

This PR is now being tested on GitHub's testing replicas in #964 (required for testing)

ccoffey added 5 commits January 31, 2021 18:23

Make it easier to handle different onChangelogEvents

7207bc1

Handle onChangelogHeartbeatEvent and update CurrentHeartbeatLag

8aee288

Progress should print HeartbeatLag

a4218cd

Don't cut-over until it is safe to do so

8a26c9e

Move 'heartbeat is too high' to Debug logs

4efd156

ccoffey commented Feb 2, 2021

View reviewed changes

go/logic/migrator.go Outdated Show resolved Hide resolved

Store lastHeartbeatOnChangelogTime instead of CurrentHeartbeatLag

48ce087

ccoffey mentioned this pull request Feb 3, 2021

ERROR Timeout while waiting for events up to lock #799

Open

Move 'heartbeat is too high' back to Debug logs again

d5c2414

shlomi-noach reviewed Feb 7, 2021

View reviewed changes

ccoffey added 2 commits February 7, 2021 13:52

Consolidate the two sleepWhileTrue loops

503b7b0

HeartbeatLag must be < than --max-lag-millis and --cut-over-lock-time…

3135a25

…out-seconds to continue

ccoffey force-pushed the cathal/safer_cut_over branch from f05baf2 to 3135a25 Compare February 7, 2021 14:32

shlomi-noach approved these changes Feb 7, 2021

View reviewed changes

timvaillancourt added this to the v1.1.1 milestone Feb 8, 2021

timvaillancourt added enhancement needs-testing labels Feb 8, 2021

timvaillancourt modified the milestones: v1.1.1, v1.1.2 Feb 8, 2021

Merge branch 'master' into cathal/safer_cut_over

d3bf3cd

timvaillancourt mentioned this pull request May 3, 2021

Merge PR: Cut-over should wait for heartbeat lag to be low enough to succeed #964

Closed

2 tasks

timvaillancourt added the bug label May 3, 2021

Merge branch 'master' into cathal/safer_cut_over

0dc6475

timvaillancourt merged commit 29b8cfa into github:master May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cut-over should wait for heartbeat lag to be low enough to succeed #921

Cut-over should wait for heartbeat lag to be low enough to succeed #921

ccoffey commented Feb 2, 2021

ccoffey commented Feb 3, 2021

shlomi-noach commented Feb 3, 2021

shlomi-noach left a comment

shlomi-noach Feb 7, 2021 •

edited

Loading

shlomi-noach Feb 7, 2021 •

edited

Loading

ccoffey Feb 7, 2021

ccoffey Feb 7, 2021

shlomi-noach left a comment •

edited

Loading

shlomi-noach commented Feb 7, 2021

timvaillancourt commented Feb 8, 2021

ccoffey commented Feb 8, 2021

timvaillancourt commented Feb 8, 2021 •

edited

Loading

shlomi-noach commented Feb 8, 2021

ccoffey commented Mar 8, 2021

zdannar commented Mar 16, 2021

shlomi-noach commented Mar 17, 2021

michaelglass commented May 3, 2021

shlomi-noach commented May 3, 2021

timvaillancourt commented May 3, 2021

Cut-over should wait for heartbeat lag to be low enough to succeed #921

Cut-over should wait for heartbeat lag to be low enough to succeed #921

Conversation

ccoffey commented Feb 2, 2021

Description

Root cause

More Detail

What this PR proposes

An example

Final Thoughts

ccoffey commented Feb 3, 2021

shlomi-noach commented Feb 3, 2021

shlomi-noach left a comment

Choose a reason for hiding this comment

shlomi-noach Feb 7, 2021 • edited Loading

Choose a reason for hiding this comment

shlomi-noach Feb 7, 2021 • edited Loading

Choose a reason for hiding this comment

ccoffey Feb 7, 2021

Choose a reason for hiding this comment

ccoffey Feb 7, 2021

Choose a reason for hiding this comment

shlomi-noach left a comment • edited Loading

Choose a reason for hiding this comment

shlomi-noach commented Feb 7, 2021

timvaillancourt commented Feb 8, 2021

ccoffey commented Feb 8, 2021

timvaillancourt commented Feb 8, 2021 • edited Loading

shlomi-noach commented Feb 8, 2021

ccoffey commented Mar 8, 2021

zdannar commented Mar 16, 2021

shlomi-noach commented Mar 17, 2021

michaelglass commented May 3, 2021

shlomi-noach commented May 3, 2021

timvaillancourt commented May 3, 2021

shlomi-noach Feb 7, 2021 •

edited

Loading

shlomi-noach Feb 7, 2021 •

edited

Loading

shlomi-noach left a comment •

edited

Loading

timvaillancourt commented Feb 8, 2021 •

edited

Loading