-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sourcemaps] Support multiple fleet addresses when requesting a sourcemap #5770
Conversation
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪 |
This pull request is now in conflicts. Could you fix it @estolfo? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good overall. I've left a few comments, let me know if you want to sync to discuss.
ea2126d
to
29990c8
Compare
sourcemap/fleet_store.go
Outdated
respReceived := make(chan string, 1) | ||
fetchErrReturned := make(chan error, 1) | ||
|
||
for idx := range idxs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're making requests to all the URLs in parallel, is there a reason to use a random number generator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's still necessary to use a random number generator: though the go routines are executed concurrently, they are started serially. Over time, I would bet that the tiny difference in start times would become significant and we'd see the go routines started for hosts towards the beginning of the slice finish before those towards the end. I need to research go routines more, but if they are terminated when the surrounding function returns, some fleet hosts with higher slice indexes might never receive a request.
I've also read that go programs don't necessarily use multiple os threads and that the GOMAXPROCS
default value is 1. So if this were running using a single os thread, the go routines wouldn't be running in parallel and only the fleet hosts with lower index values (or perhaps just the first one) would serve a request.
Let me know if I'm mistaken, I'm still just a go beginner 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though the go routines are executed concurrently, they are started serially. Over time, I would bet that the tiny difference in start times would become significant and we'd see the go routines started for hosts towards the beginning of the slice finish before those towards the end.
This could be true! In general, I would assume that the time to start a go routine is a lot lower than the time to finish a network request. I can't find any numbers atm for how long it takes to spawn a go routine, but would assume a network request to take somewhere on the order of milliseconds, and spawning a go routine is probably somewhere in the microseconds realm.
I need to research go routines more, but if they are terminated when the surrounding function returns, some fleet hosts with higher slice indexes might never receive a request.
Without coordination, go routines exist outside of the lifecycle of the function that starts them (except for the main goroutine, goroutines have no parents or children, in contrast to an os thread or process). In this case, each index will create a goroutine and execute a network request. Since the code below, which reads the result, is exiting as soon as the first response is received (successful or not), the next network request will send to the buffered channel, and any remaining go routines will be stuck forever attempting to send on the full channel.
I've also read that go programs don't necessarily use multiple os threads and that the GOMAXPROCS default value is 1. So if this were running using a single os thread, the go routines wouldn't be running in parallel and only the fleet hosts with lower index values (or perhaps just the first one) would serve a request.
There may be a single OS thread that performs CPU bound work, but for network i/o the requests will all be started and then the go runtime will most likely context switch to a different, non-waiting go routine (there are a few well-defined places where the go runtime will context switch). Depending on which web request completes first, that go routine will be "woken up" and continue processing.
The default value for GOMAXPROCS
is the number of CPUs on the system: https://pkg.go.dev/runtime#GOMAXPROCS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this is ranging over the index slice indices, rather than the index slice values:
for idx := range idxs {
i.e. even if idxs
contains the values [3, 2, 1]
, that range statement will produce the values 0
, 1
, 2
. To range over the values, use for _, idx := range idxs
.
Anyway, I would prefer we keep it simple and just fire them off in parallel without any randomisation. Later on, if necessary, we can make it smarter.
I think it's still necessary to use a random number generator: though the go routines are executed concurrently, they are started serially. Over time, I would bet that the tiny difference in start times would become significant and we'd see the go routines started for hosts towards the beginning of the slice finish before those towards the end. I need to research go routines more, but if they are terminated when the surrounding function returns, some fleet hosts with higher slice indexes might never receive a request.
Echoing what @stuartnelson3 mentioned above: it is necessary to coordinate cancellation, and starting goroutines is very cheap and fast. Given that it's fast to start them, the order in which they are started is going to be insignificant. We should cancel subsequent requests on the first successful response, but that's only likely to cancel a request mid-flight rather than before it starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub doesn't want to allow me to continue the thread, so responding to #5770 (comment) here:
I'm not clear on why it would be necessary to test that a host is selected randomly. I'm using a package function rand.Perm, which I can assume works correctly. So if we get a successful response, we can assume we are selecting a host randomly. Let me know if there's something important I'm missing about the way I've written the code.
Agreed - we shouldn't be testing that rand.Perm
is doing its job. Rather we should test the external expectations of this code. Now that we're making the requests in parallel it's not so relevant to test randomisation. If we were issuing them sequentially but in random order, then a test (not necessarily a unit test, could be a functional test) should have picked up the range
issue I noted, i.e. there was no random ordering at all.
sourcemap/fleet_store.go
Outdated
respReceived := make(chan string, 1) | ||
fetchErrReturned := make(chan error, 1) | ||
|
||
for idx := range idxs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this is ranging over the index slice indices, rather than the index slice values:
for idx := range idxs {
i.e. even if idxs
contains the values [3, 2, 1]
, that range statement will produce the values 0
, 1
, 2
. To range over the values, use for _, idx := range idxs
.
Anyway, I would prefer we keep it simple and just fire them off in parallel without any randomisation. Later on, if necessary, we can make it smarter.
I think it's still necessary to use a random number generator: though the go routines are executed concurrently, they are started serially. Over time, I would bet that the tiny difference in start times would become significant and we'd see the go routines started for hosts towards the beginning of the slice finish before those towards the end. I need to research go routines more, but if they are terminated when the surrounding function returns, some fleet hosts with higher slice indexes might never receive a request.
Echoing what @stuartnelson3 mentioned above: it is necessary to coordinate cancellation, and starting goroutines is very cheap and fast. Given that it's fast to start them, the order in which they are started is going to be insignificant. We should cancel subsequent requests on the first successful response, but that's only likely to cancel a request mid-flight rather than before it starts.
This pull request is now in conflicts. Could you fix it @estolfo? 🙏
|
5a7dd1c
to
e48c2c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thanks!
sourcemap/fleet_store_test.go
Outdated
select { | ||
case <-requestReceived: | ||
default: | ||
t.Fatal("No second request received by handler") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is maybe a little bit misleading, and IIANM it's racy.
- since we make the requests in parallel, a second request may received by the handler before the request (context) is cancelled
- even though we wait for
<-fetchReturned
above, we don't wait for all requests to complete in the fetch method if we get a success; so it is possible (if unlikely) that the second request will send torequestReceived
, and this assertion will fail
I'd suggest changing all the above (line 137+) to:
resp, err := f.fetch(context.Background(), name, version, path)
require.NoError(t, err)
assert.Contains(t, resp, "webpack:///bundle.js")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though we wait for <-fetchReturned above, we don't wait for all requests to complete in the fetch method if we get a success
Since the successful handler blocks until a read from requestReceived
, I read this test as verifying that fetch
returns after it receives its first successful response, even if there are other requests that haven't returned yet. I think the handler is supposed to receive both requests.
so it is possible (if unlikely) that the second request will send to requestReceived, and this assertion will fail
I think this is what is being tested, that the second request has sent to requestReceived
. The successful handler is blocked by a send to requestReceived
, so a single receive on that and then receive from requestReceived
means fetch()
returned after its first successful response, with the second request is still waiting within the handler? And the final receive is verifying that both requests were sent to the handler.
The ctx
in the handler is not the same as the ctx
in the request, right? So a successful response + cancel()
of the request ctx won't effect the ctx
in the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this check is unreliable because the request following the successful request might not reach the handler
select {
case <-requestReceived:
default:
t.Fatal("No second request received by handler")
}
But I still don't understand why this is not relevant for testing that the first time the handler is reached, the fetch method has returned successfully?
select {
case resp := <-fetchReturned:
assert.Contains(t, resp, "webpack:///bundle.js")
assert.EqualValues(t, 3, called)
case <-time.After(10 * time.Second):
t.Fatal("timed out waiting for Fetch to return")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what is being tested, that the second request has sent to requestReceived. The successful handler is blocked by a send to requestReceived, so a single receive on that and then receive from requestReceived means fetch() returned after its first successful response, with the second request is still waiting within the handler? And the final receive is verifying that both requests were sent to the handler.
The ctx in the handler is not the same as the ctx in the request, right? So a successful response + cancel() of the request ctx won't effect the ctx in the server.
@stuartnelson3 right, but the request may never even make it to the 2nd+ servers. Hypothetical ordering:
- client issues request to ts0, ts1, ts2
- request is transmitted to ts0
- ts0 processes request, responds with success
- client processes response, cancels context
- because context is cancelled, requests to ts1 and ts2 are never transmitted
- test fails on second
case <-requestReceived
It's unlikely, but run it enough times on a slow CI box and it will eventually happen.
But I still don't understand why this is not relevant for testing that the first time the handler is reached, the fetch method has returned successfully?
@estolfo you're referring to the fetchReturned/timeout bit specifically? The existing TestFetchContext
test which uses this pattern is doing so because it's testing that asynchronous context cancellation works -- we're not doing that here (at the moment).
I was suggesting the simplification to expedite merging the PR, but I think it would be good to test at this level of detail - it just might take a bit longer. Something like:
- call fetch in a goroutine (like you're doing)
- block until all servers have received their request
- have ts0 respond with failure
- have ts1 respond with success
- block until ts2's request is cancelled
sourcemap/fleet_store_test.go
Outdated
select { | ||
case resp := <-fetchReturned: | ||
assert.Contains(t, resp, "webpack:///bundle.js") | ||
assert.EqualValues(t, 3, called) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no guarantee that all request have been received by the time fetch returns, since we return early if we see a successful response from one server
…he results channel
Co-authored-by: Andrew Wilkins <[email protected]>
…emap (#5770) * Send request to multiple fleet servers when fetching sourcemap * Handle case when context is canceled before any results are sent to the results channel * Update sourcemap/fleet_store.go Co-authored-by: Andrew Wilkins <[email protected]> * Combine test for single and multiple fleet servers * Use atomic.AddInt32 to increment shared request counter var * Add test for a mix of successful and failed fleet server sourcemap requests * Simplify test for failed and successful fleet hosts fetch * Context should be first argument * Lock setting shared variable in test * Remove unnecessary called var Co-authored-by: Andrew Wilkins <[email protected]> (cherry picked from commit 212f099) # Conflicts: # sourcemap/fleet_store_test.go
@Mergifyio backport 7.x |
Command
|
…emap (elastic#5770) * Send request to multiple fleet servers when fetching sourcemap * Handle case when context is canceled before any results are sent to the results channel * Update sourcemap/fleet_store.go Co-authored-by: Andrew Wilkins <[email protected]> * Combine test for single and multiple fleet servers * Use atomic.AddInt32 to increment shared request counter var * Add test for a mix of successful and failed fleet server sourcemap requests * Simplify test for failed and successful fleet hosts fetch * Context should be first argument * Lock setting shared variable in test * Remove unnecessary called var Co-authored-by: Andrew Wilkins <[email protected]>
…emap (#5770) (#5849) * Send request to multiple fleet servers when fetching sourcemap * Handle case when context is canceled before any results are sent to the results channel * Update sourcemap/fleet_store.go Co-authored-by: Andrew Wilkins <[email protected]> * Combine test for single and multiple fleet servers * Use atomic.AddInt32 to increment shared request counter var * Add test for a mix of successful and failed fleet server sourcemap requests * Simplify test for failed and successful fleet hosts fetch * Context should be first argument * Lock setting shared variable in test * Remove unnecessary called var Co-authored-by: Andrew Wilkins <[email protected]> Co-authored-by: Emily S <[email protected]>
Tested with BC1, looks good.
|
These changes support fetching sourcemaps from one of many fleet servers. It sends the request to all the fleet servers in parallel and returns the first successful response.
I've made some adjustments to the
fleetStore
struct to keep a map of the(service.name, service.version, bundle.filepath)
combinations to thesourcemap.url
. I've also added a separate slice to store the list of base fleet server addresses, which represent the configured protocol with the host and port of a given fleet server.The full url is constructed when the sourcemap is fetched using a base fleet server url and the sourcemap url (path).
Closes #5514
How to test these changes
Taking the test plan from #5410:
Run elasticsearch and multiple elastic-agents in fleet mode. 2 or 3 should be sufficient for this test.
Start kibana and apm-server with the fleet-server sourcemap code.