-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
View: expose view and connection state #248
Comments
I agree, we can easily enhance the view state to provide more details about the current state including what we have already, and adding a Reconnect/Connecting state that will be entered whenever the view is reconnecting. A state for "lagging behind" is harder to achieve, because we don't have criteria to decide at which point you're lagging. Currently it's like this: before the partition recovers, it retrieves the highwatermark and as soon as it gets to highwatermark it is considered "recovered", even if it is lagging in future. I mean technically it is always lagging, because by the time it reaches hwm, that actual value is incremented anyway. So the question is, what is considered "being behind". Is it a fixed value? Then the stats should be used to detect that as you described.
So adding a Recovering-step before the running in case of reconnection. Would that seem more logical? |
My main priority is definitely the Connecting state. A single boolean allowing me to fail a healthcheck and/or set a gauge metric for alerting on and graphing. If I'm successfully consuming all partitions, I don't care so much if I'm behind (Recovering vs Running) for my use case (just a low volume compacted topic at the moment). That said I can see a case for being able to detect lagging behind. I think for the initial state transition for If I have offset lag and Connecting state, I can already determine Either way, |
Awesome, agreed. So let's add a connecting-State which is entered when connecting/reconnecting. As you said, the rest can be achieved with Stats for now. |
Hey @frairon, would love to see this implemented. I think I would struggle to add this myself. I'm wondering whether this is likely to be implemented soon? I'm looking forward to the retries with the connected status access. Once they're in a release I'm fairly happy with this new version and we can integrate without any regressions in our metrics and availability. I don't need recovering status, just Connecting |
Sorry, our goka engagement comes in bursts :) |
No worries, I understand that :) Thanks you for all of your engagement so far, it's really encouraging for our team to see :) |
First though (because I started that already earlier: I added more docs to the behavior change of view.Run when using restartable/autoreconnect). |
…wn the signal/observers, created example
@mattburman there's progress, turned out to be more tricky than I thought. I tried to solve it by observing the partition states and merging them into a view state. Kind of reactive, maybe we should use rxgo at some point :). Also I realized, that an autoreconnecting view fails when the kafka cluster is not available at startup. This might be an issue and I'll have a look if I can solve that somehow. Would that be critical for your use case? In the old style (backoff around view.Run), I guess a view would even tolerate a missing kafka at startup and just serve the local cache. Didn't think of that before. |
Thank you @frairon!
I've tested it out and
I just tested this out, and it looks like NewView construction blocks until it's connected? I don't want NewView to block. We are currently pinned to |
Hi @mattburman, sorry for the long delay. But let's solve the NewView issue. When I start the example without a running kafka cluster, it does not block, but fail with "kafka: client has run out of available brokers to talk to". Is that call really blocking for you? Could you give me a code snippet to reproduce that? This fail-fast behavior when calling the constructor is the same in the old version. |
Hey @frairon :) Thanks for all your work so far
I have tested this manually and works well :) .
Tested it again. I didn't realise this before, but it does return after 2 minutes. Is this configurable? I'd need apps to exit within seconds if it can't connect. This is just running
This is logs from an example from our wrapper library pinned to
Unfortunately our wrapper is coupled to internal tooling/libraries at the moment so not so easy to share.
I've also replicated
i.e. commented out the emitter and processor and added some logs, yielding the following output:
|
Hi @mattburman , sorry again (I guess it's time to stop apologizing for being late), glad the status updates are working now, then let's consider this done. We had a little brainstorming how to get the View tolerate a startup without a running Cluster. Since it seems that this would be too confusing and inconsistent to integrate, we decided it's best to leave it as it is. So let's concentrate on how to get the view failing faster. If I'm understanding correctly, the delay only occurs when the internal library is used, because above you mentioned that the plain examples run fast. Is it possible that the library modifies sarama.Config and increases some timeouts that are lower by default in the example configuration? Maybe the network behaves differently by just blocking the request if you have no kafka cluster running? Another idea: the sarama implementation first tries to get the metadata, by querying all brokers. If a brokers cannot be connected to, it will get removed from the list of available brokers. On other errors, however, the client will just retry with another random broker. If some deadlines/timeouts are sufficiently high, that could probably also take a while, because it's trying the nonexisting brokers over and over again. You could maybe get some insights by activating sarama's logger, which is disabled by default.
Let me know if that helps. |
@frairon My bad, I wasn't clear at all I've narrowed it down much further now. I'm running this on a mac: I modified the broker in I've also re-tested my example in our wrapper (which connects to a broker in the docker network at host I've also worked out I was confused about the behaviour of I think my concerns are resolved to be honest. Sorry for the confusion. I think I am happy with this version. I should be able to set metrics using this view state functionality once this is merged and a new version is pinned |
Awesome, glad we figured it out :) So the new behavior would be |
I'll close this issue and create another one where we can discuss details if we want to change that. We'll try to release goka to get rid of the huge PR and can fix such things later... |
* Co-authored-by: frairon <[email protected]> Co-authored-by: R053NR07 <[email protected]> * bugfix in shutdown/rebalance: correctly closing joins * run update/request/response stats in own goroutine * fix rebalancing by adding a copartitioning rebalance strategy * updated readme for configuration, added changelog * Open Storage in PartitionTable when performing Setup * return trackOutput if stats are nil * view.get fixed for uninitialized view added lots of godoc fixed many linter errors added Open call when creating storage * context stats tracking: use queueing mechanism to avoid race conditions * Add simpleBackoff and add backoff options for processor and view * added strings to streams helper * #249 view example * issue #249: migration guide, #241 panic documentation of context * #248 exposing partition table's connection state to view * Migration: describe offsetbug * partition_table: implement autoreconnect in recovery mode * bugfix goroutine-leak in statsloop in partition table * #248: refactored state merger, bugfix race condition when shutting down the signal/observers, created example * bugfix partition state notification * remove change in example, updated changelog * fix readme example and add readme as example * restore 1-simplest fix, remove readme-example Co-authored-by: Jan Bickel <[email protected]>
@mattburman in #239
The text was updated successfully, but these errors were encountered: