Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

splunk nozzle now doesn't block, won't disconnect for being slow #284

Closed
wants to merge 1 commit into from

Conversation

Benjamintf1
Copy link
Member

No description provided.

@Benjamintf1
Copy link
Member Author

(I may be able to cleanup some of the code with respect to the eventrouter in testing)

@Benjamintf1
Copy link
Member Author

(I think generally you'll be safer if the dropping buffer is as close to and with as little code in between it and the read call, but IIRC it looked fine, but made the requested changes)

As for "having it be an option", if the nozzle blocks, those envelopes are getting dropped one way or another, there's no "not dropping" option, you're only forcing those envelopes to drop at the doppler due to the nozzle being unable to deliver the envelopes fast enough.

@Benjamintf1
Copy link
Member Author

#285

@Benjamintf1
Copy link
Member Author

Benjamintf1 commented Sep 8, 2021

Spent some time looking at this a bit more, this pr is actually not sufficient. Blocking could occur here

event.AnnotateWithEnvelopeData(msg)
event.AnnotateWithCFMetaData()
.

(In this case, I'm going to define blocking as "a step that can take on the order of milliseconds rather then nanoseconds"

I suspect most of the blocking we see in cases is actually downstream blocking and not capi blocking, I think the code there should be mostly performant, but it's something I think is going to be hard to determine with the code instrumented as such.

@kashyap-splunk
Copy link
Collaborator

kashyap-splunk commented Sep 9, 2021

Hi @Benjamintf1 I get your point. I am trying to think of way to free the main thread to receive the next event immediately after receiving the current event from the doppler. For example, starting a separate Goroutine to route the event in this line and then either limit the Goroutines count (Bounded parallelism) or fix wait time before dropping the events.

https://github.com/cloudfoundry-community/splunk-firehose-nozzle/blob/develop/nozzle/nozzle.go#L69

Let me know your thoughts.

@Benjamintf1
Copy link
Member Author

Benjamintf1 commented Sep 9, 2021

No, both of those will still allow promulgation of backpressure at not 100% cpu. The only way to prevent this is to either balloon(very bad) memory, or drop(less bad).

@Benjamintf1
Copy link
Member Author

Benjamintf1 commented Sep 9, 2021

I can leave more context in the issue(#285), I think we may have alternative or changed prs.

@Benjamintf1
Copy link
Member Author

Benjamintf1 commented Sep 9, 2021

(as a short note on shortening timeouts. 1000 logs per second is 1 log per millisecond. If the processing is too slow, eventually the buffer will fill up, and a even a extremely slow 10ms timeout is 10 lost logs, 5 seconds is 5,000, that's half a doppler buffer. Even if we reduce that by a factor of 5, a slow nozzle will have dropped 1000 logs on doppler for being unable to read before it reports the one metric it tried to send downstream)

@Benjamintf1
Copy link
Member Author

Actually, based on concerns I'm hearing about capi. I'm going to close this in favor of saying I'd highly suggest a solution that looks more like the first pr. Or a merging of the two. I can make a new pr if wanted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants