Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[agent] Make timeout for reporting configurable #1034

Merged
merged 2 commits into from
Nov 17, 2018

Conversation

gouthamve
Copy link
Contributor

@gouthamve gouthamve commented Aug 30, 2018

So, this is less a fix for #1026 but more to kick off the discussion around: #927

While the original proposal has:

reporters:
    - kind: tchannel
      # tchannel-specific parameters, e.g.
      collector:
        hostPort: ...
      discovery:
        minPeers: 10
    - kind: grps
      # grpc-specific parameters

I kind of like:

tchannel:
# tchannel-config

grpc:
# grpc-config

I took a stab at implementing my version and turns out it's super straight forward and is very little changed code. But if the preference is to have the originally proposed format, I'll make the changes. But now I'm curious, can we 2 different tchannel reporters?

Note: This will be a breaking change.

@gouthamve gouthamve changed the title agent: Make timeout for reporting flags configurable agent: Make timeout for reporting configurable Aug 30, 2018
@codecov
Copy link

codecov bot commented Aug 30, 2018

Codecov Report

Merging #1034 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #1034   +/-   ##
======================================
  Coverage     100%    100%           
======================================
  Files         159     159           
  Lines        7126    7136   +10     
======================================
+ Hits         7126    7136   +10
Impacted Files Coverage Δ
cmd/agent/app/reporter/tchannel/flags.go 100% <100%> (ø) ⬆️
cmd/agent/app/reporter/tchannel/builder.go 100% <100%> (ø) ⬆️
cmd/agent/app/reporter/tchannel/reporter.go 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f8d324...34d298d. Read the comment docs.

@yurishkuro yurishkuro changed the title agent: Make timeout for reporting configurable [agent] Make timeout for reporting configurable Sep 3, 2018
@yurishkuro
Copy link
Member

Thanks for looking into this one.

can we 2 different tchannel reporters

We have some deployments at Uber where we're using two reporters of the same type (well, only tchannel reporter is supported anyway today) to send the data to two locations. However, it is done using the otherReporters feature:

func (b *Builder) WithReporter(r reporter.Reporter) *Builder {
b.otherReporters = append(b.otherReporters, r)
return b
}

I believe we have a custom config file format that we parse in a custom main. We debated that we could achieve that functionality by forwarding spans to a dedicated splitter, but that seems like more complicated deployment vs. a small code change in the agent. Ideally, we would like to preserve that forking functionality in the agent.

Have you looked into going with the list approach as proposed? Another alternative is to use nested union-like config, instead of kind: tchannel qualifier, e.g.

  reporters:
    - collectorHostPort: ...
      tchannel:
        # tchannel-specific parameters
        discovery:
          minPeers: 10
    - collectorHostPort: ...
      grpc:
        # grpc-specific parameters

@gouthamve
Copy link
Contributor Author

I think this nested union-like config approach better :) I'll make the changes.

@yurishkuro
Copy link
Member

The tricky part is to decide how cli flags will support a list. If that proves to be too ugly, we could draw a line and say we only support one reporter by default. At Uber we can continue using the custom config and main for multiple reporters.

@gouthamve
Copy link
Contributor Author

Yeah, spent an hour+ trying to make the flags and config match up and not fail tests, couldn't get it to work.

Prefer if there is only reporter being supported.

@yurishkuro
Copy link
Member

let's go with just one reporter, as long as programmatically we can add more (as we can today). Could you post how the CLI / config will look ?

@gouthamve
Copy link
Contributor Author

➜  jaeger git:(add-agent-timeout) ✗ ./agent --help
Jaeger agent is a daemon program that runs on every host and receives tracing data submitted by Jaeger client libraries.

.....

Flags:
      --collector.host-port string                            comma-separated string representing host:ports of a static list of collectors to connect to directly (e.g. when not using service discovery)
      --collector.timeout duration                            sets the timeout used when establishing reporting spans (default 1s)
      --config-file string                                    Configuration file in JSON, TOML, YAML, HCL, or Java properties formats (default none). See spf13/viper for precedence.
      --discovery.conn-check-timeout duration                 sets the timeout used when establishing new connections (default 250ms)
      --discovery.min-peers int                               if using service discovery, the min number of connections to maintain to the backend (default 3)
  -h, --help                                                  help for jaeger-agent
      --http-server.host-port string                          host:port of the http server (e.g. for /sampling point and /baggage endpoint) (default ":5778")
.....

So, now this just adds a simple --collector.timeout flag. But I can make the flags to be: --collector.tchannel.* if you want.

This is how the config will look like:

tchannel:
collectorHostPorts:
- 127.0.0.1:14267
- 127.0.0.1:14268
- 127.0.0.1:14269
minPeers: 4
collectorServiceName: some-collector-service

@pavolloffay
Copy link
Member

The next week I will have time to look at #927

@gouthamve
Copy link
Contributor Author

I'm running this in prod for the past 10 days and everything works great, we used to drop a ton of batches due to timeouts, but right now those occurrences are super rare when running with 5s timeout.

@pavolloffay
Copy link
Member

@gouthamve sorry for a longer delay. If i understand it correctly tchannel flags are prefixed with tchannel.

If we cannot make it work like it's in #1034 (comment). I think this could go in as there are no other options. cc @yurishkuro

@gouthamve
Copy link
Contributor Author

They are not prefixed right now. Changing flags is a breaking change. I could do it if that is preferred though.

@pavolloffay
Copy link
Member

@gouthamve #1092 addresses agent refactoring. Once that is in this will have to be rebased.

@pavolloffay
Copy link
Member

@gouthamve could you please rebase this on the current master? We have refactored agent flags now it should be easy to add new ones.

@gouthamve
Copy link
Contributor Author

@pavolloffay Rebased :) I'll run this internally and report on how it performs.

@gouthamve
Copy link
Contributor Author

gouthamve commented Nov 16, 2018

https://github.com/jaegertracing/jaeger/blob/master/cmd/agent/app/reporter/tchannel/flags.go#L82-L83

would always overwrite this:
https://github.com/jaegertracing/jaeger/blob/master/cmd/agent/app/reporter/tchannel/flags.go#L70-L77
which means people using the deprecated flags are essentially getting the wrong values.

Fix is in the last commit. It wasn't caught before as flag values from the previous iteration are carried forward to the next iteration.

@gouthamve
Copy link
Contributor Author

Hmm, the test failure looks like a flake.

@pavolloffay
Copy link
Member

@gouthamve are you also fixing #1192 in this PR?

@gouthamve
Copy link
Contributor Author

Yes, I need to re-init the flags for my tests to pass

Copy link
Member

@pavolloffay pavolloffay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks also for fixing the broken flags. A bonus would be a test verifying the timeout.

@yurishkuro could you please also have a look? I might lack some tchannel exprience.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall, a couple minor nits

@@ -30,6 +30,8 @@ const (
hostPort = "host-port"
discoveryMinPeers = "discovery.min-peers"
discoveryConnCheckTimeout = "discovery.conn-check-timeout"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove blank line

cmd/agent/app/reporter/tchannel/flags.go Show resolved Hide resolved
}
if v.GetDuration(tchannelPrefix+discoveryConnCheckTimeout) != defaultConnCheckTimeout {
b.ConnCheckTimeout = v.GetDuration(tchannelPrefix + discoveryConnCheckTimeout)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking this does not solve the overriding problem if the manually provided value was the same as the default value, but it's still an improvement. The key thing is that we just need to remove the legacy flags soon.

cmd/agent/app/reporter/tchannel/flags.go Show resolved Hide resolved
@gouthamve
Copy link
Contributor Author

Rebased and feedback incorporated

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yurishkuro yurishkuro merged commit 7d664e6 into jaegertracing:master Nov 17, 2018
pavolloffay pushed a commit that referenced this pull request Nov 23, 2018
* agent: Make timeout for reporting configurable

Signed-off-by: Goutham Veeramachaneni <[email protected]>

* Fix tests and make sure deprecated flags work

Signed-off-by: Goutham Veeramachaneni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants