Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: stop sending messages for webhook sink upon receiving single error #67772

Closed
spiffyy99 opened this issue Jul 19, 2021 · 0 comments · Fixed by #67825
Closed

changefeedccl: stop sending messages for webhook sink upon receiving single error #67772

spiffyy99 opened this issue Jul 19, 2021 · 0 comments · Fixed by #67825
Assignees
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc

Comments

@spiffyy99
Copy link
Contributor

spiffyy99 commented Jul 19, 2021

The concurrency implementation for webhook sink has an edge case that potentially results in out-of-order messages being delivered to the HTTP endpoint.

Consider these two rows sent via HTTP sink (with the same primary key) and the responses:

{"after":{"col1":"val1","rowid":1000},"key":[1001],"topic:":"foo"} -> 500 Internal Server Error
{"after":{"col1":"val1","rowid":1002},"key":[1001],"topic:":"foo"} -> 200 OK

Assuming the first 500 is just a transient error, it will not be propagated to the changefeed until Flush() is called after sending the second message, resulting in the second message being sent before the first. The solution here is to check for errors before sending the second message, terminating upon finding one, and allowing the changefeed to restart and send the messages in proper order.

@spiffyy99 spiffyy99 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc labels Jul 19, 2021
@spiffyy99 spiffyy99 self-assigned this Jul 19, 2021
spiffyy99 added a commit to spiffyy99/cockroach that referenced this issue Jul 21, 2021
Previously, the sink waited until flushing to acknowledge HTTP errors, leaving
any messages between the initial error and flush to potentially be out of
order. Now, errors are checked before each message is sent and the sink is
restarted if one is detected to maintain ordering.

Resolves cockroachdb#67772

Release note: None
spiffyy99 added a commit to spiffyy99/cockroach that referenced this issue Jul 21, 2021
Previously, the sink waited until flushing to acknowledge HTTP errors, leaving
any messages between the initial error and flush to potentially be out of
order. Now, errors are checked before each message is sent and the sink is
restarted if one is detected to maintain ordering.

Resolves cockroachdb#67772

Release note: None
spiffyy99 added a commit to spiffyy99/cockroach that referenced this issue Jul 21, 2021
Previously, the sink waited until flushing to acknowledge HTTP errors, leaving
any messages between the initial error and flush to potentially be out of
order. Now, errors are checked before each message is sent and the sink is
restarted if one is detected to maintain ordering.

Resolves cockroachdb#67772

Release note: None
craig bot pushed a commit that referenced this issue Jul 26, 2021
67526: roachtest: make timeout obvious in posted issues r=stevendanna a=tbg

When a test times out, roachtest will rip the cluster out from under it
to try to force it to terminate. This is essentially guaranteed to
produce a posted issue that sweeps the original reason of the failure
(the timeout) under the rug. Instead, such issues now plainly state
that there was a timeout and refer the readers to the artifacts.

See here for an example issue without this fix: #67464

cc @dt, who pointed this out [internally]

[internally]: https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1626098863019500

Release note: None


67824: dev: teach `dev` how to do cross builds r=rail a=rickystewart

Closes #67709.

Release note: None

67825: changefeedccl: immediately stop sending webhook sink rows upon error r=spiffyyeng a=spiffyyeng

Previously, the sink waited until flushing to acknowledge HTTP errors, leaving
any messages between the initial error and flush to potentially be out of
order. Now, errors are checked before each message is sent and the sink is
restarted if one is detected to maintain ordering.

Resolves #67772

Release note: None

67894: sql: add support for unique expression indexes r=mgartner a=mgartner

Release note: None

67916: roachtest: fix replicagc-changed-peers r=aliher1911 a=tbg

The test ends up in the following situation:

n1: down, no replicas
n2: down, no replicas
n3: alive, with constraint that wants all replicas to move,
    and there may be a few ranges still on n3
n4-n6: alive

where the ranges predominantly 3x-replicated.

The test is then verifying that the replica count (as in, replicas on
n3, in contrast to replicas assigned via the meta ranges) on n3 drops to
zero.

However, system ranges cannot move in this configuration. The number of
cluster nodes is six (decommission{ing,ed} nodes would be excluded, but
no nodes are decommission{ing,ed} here) and so the system ranges operate
at a replication factor of five. There are only four live nodes here, so
if n3 is still a member of any system ranges, they will stay there and
the test fails.

This commit attempts to rectify that by making sure that while n3 is
down earlier in the test, all replicas are moved from it. That was
always the intent of the test, which is concerned with n3 realizing
that replicas have moved elsewhere and initiating replicaGC; however
prior to this commit it was always left to chance whether n3 would
or would not have replicas assigned to it by the time the test moved
to the stage above. The reason the test wasn't previously waiting
for all replicas to be moved off n3 while it was down was that it
required checking the meta ranges, which wasn't necessary for the
other two nodes.

This commit passed all five runs of
replicagc-changed-peers/restart=false, so I think it reliably addresses
the problem.

There is still the lingering question of why this is failing only now
(note that both flavors of the test failed on master last night, so
I doubt it is rare). We just merged
#67319 which is likely
somehow related.

Fixes #67910.
Fixes #67914.

Release note: None


67961: bazel: use `action_config`s over `tool_path`s in cross toolchains r=rail a=rickystewart

This doesn't change much in practice, but does allow us to use the
actual `g++` compiler for C++ compilation, which wasn't the case
before.

The `tool_path` constructor is actually [deprecated](https://github.com/bazelbuild/bazel/blob/203aa773d7109a0bcd9777ba6270bd4fd0edb69f/tools/cpp/cc_toolchain_config_lib.bzl#L419)
in favor of `action_config`s, so this is future-proofing.

Release note: None

67962: bazel: start building geos in ci r=rail a=rickystewart

Only the most recent commit applies for this review --
the other is from #67961.

Closes #66388.

Release note: None

68065: cli: skip TestRemoveDeadReplicas r=irfansharif a=tbg

Refs: #50977

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Ryan Min <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
@craig craig bot closed this as completed in 7659eb2 Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant