changefeedccl: stop sending messages for webhook sink upon receiving single error #67772

spiffyy99 · 2021-07-19T21:10:01Z

The concurrency implementation for webhook sink has an edge case that potentially results in out-of-order messages being delivered to the HTTP endpoint.

Consider these two rows sent via HTTP sink (with the same primary key) and the responses:

{"after":{"col1":"val1","rowid":1000},"key":[1001],"topic:":"foo"} -> 500 Internal Server Error
{"after":{"col1":"val1","rowid":1002},"key":[1001],"topic:":"foo"} -> 200 OK

Assuming the first 500 is just a transient error, it will not be propagated to the changefeed until Flush() is called after sending the second message, resulting in the second message being sent before the first. The solution here is to check for errors before sending the second message, terminating upon finding one, and allowing the changefeed to restart and send the messages in proper order.

The text was updated successfully, but these errors were encountered:

Previously, the sink waited until flushing to acknowledge HTTP errors, leaving any messages between the initial error and flush to potentially be out of order. Now, errors are checked before each message is sent and the sink is restarted if one is detected to maintain ordering. Resolves cockroachdb#67772 Release note: None

@dt

67526: roachtest: make timeout obvious in posted issues r=stevendanna a=tbg When a test times out, roachtest will rip the cluster out from under it to try to force it to terminate. This is essentially guaranteed to produce a posted issue that sweeps the original reason of the failure (the timeout) under the rug. Instead, such issues now plainly state that there was a timeout and refer the readers to the artifacts. See here for an example issue without this fix: #67464 cc @dt, who pointed this out [internally] [internally]: https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1626098863019500 Release note: None 67824: dev: teach `dev` how to do cross builds r=rail a=rickystewart Closes #67709. Release note: None 67825: changefeedccl: immediately stop sending webhook sink rows upon error r=spiffyyeng a=spiffyyeng Previously, the sink waited until flushing to acknowledge HTTP errors, leaving any messages between the initial error and flush to potentially be out of order. Now, errors are checked before each message is sent and the sink is restarted if one is detected to maintain ordering. Resolves #67772 Release note: None 67894: sql: add support for unique expression indexes r=mgartner a=mgartner Release note: None 67916: roachtest: fix replicagc-changed-peers r=aliher1911 a=tbg The test ends up in the following situation: n1: down, no replicas n2: down, no replicas n3: alive, with constraint that wants all replicas to move, and there may be a few ranges still on n3 n4-n6: alive where the ranges predominantly 3x-replicated. The test is then verifying that the replica count (as in, replicas on n3, in contrast to replicas assigned via the meta ranges) on n3 drops to zero. However, system ranges cannot move in this configuration. The number of cluster nodes is six (decommission{ing,ed} nodes would be excluded, but no nodes are decommission{ing,ed} here) and so the system ranges operate at a replication factor of five. There are only four live nodes here, so if n3 is still a member of any system ranges, they will stay there and the test fails. This commit attempts to rectify that by making sure that while n3 is down earlier in the test, all replicas are moved from it. That was always the intent of the test, which is concerned with n3 realizing that replicas have moved elsewhere and initiating replicaGC; however prior to this commit it was always left to chance whether n3 would or would not have replicas assigned to it by the time the test moved to the stage above. The reason the test wasn't previously waiting for all replicas to be moved off n3 while it was down was that it required checking the meta ranges, which wasn't necessary for the other two nodes. This commit passed all five runs of replicagc-changed-peers/restart=false, so I think it reliably addresses the problem. There is still the lingering question of why this is failing only now (note that both flavors of the test failed on master last night, so I doubt it is rare). We just merged #67319 which is likely somehow related. Fixes #67910. Fixes #67914. Release note: None 67961: bazel: use `action_config`s over `tool_path`s in cross toolchains r=rail a=rickystewart This doesn't change much in practice, but does allow us to use the actual `g++` compiler for C++ compilation, which wasn't the case before. The `tool_path` constructor is actually [deprecated](https://github.com/bazelbuild/bazel/blob/203aa773d7109a0bcd9777ba6270bd4fd0edb69f/tools/cpp/cc_toolchain_config_lib.bzl#L419) in favor of `action_config`s, so this is future-proofing. Release note: None 67962: bazel: start building geos in ci r=rail a=rickystewart Only the most recent commit applies for this review -- the other is from #67961. Closes #66388. Release note: None 68065: cli: skip TestRemoveDeadReplicas r=irfansharif a=tbg Refs: #50977 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Ryan Min <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>

spiffyy99 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc labels Jul 19, 2021

spiffyy99 self-assigned this Jul 19, 2021

spiffyy99 mentioned this issue Jul 20, 2021

changefeedccl: immediately stop sending webhook sink rows upon error #67825

Merged

craig bot closed this as completed in 7659eb2 Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: stop sending messages for webhook sink upon receiving single error #67772

changefeedccl: stop sending messages for webhook sink upon receiving single error #67772

spiffyy99 commented Jul 19, 2021 •

edited

Loading

changefeedccl: stop sending messages for webhook sink upon receiving single error #67772

changefeedccl: stop sending messages for webhook sink upon receiving single error #67772

Comments

spiffyy99 commented Jul 19, 2021 • edited Loading

spiffyy99 commented Jul 19, 2021 •

edited

Loading