[sourcemaps] Support multiple fleet addresses when requesting a sourcemap #5770

estolfo · 2021-07-20T16:31:33Z

These changes support fetching sourcemaps from one of many fleet servers. It sends the request to all the fleet servers in parallel and returns the first successful response.

I've made some adjustments to the fleetStore struct to keep a map of the (service.name, service.version, bundle.filepath) combinations to the sourcemap.url. I've also added a separate slice to store the list of base fleet server addresses, which represent the configured protocol with the host and port of a given fleet server.

The full url is constructed when the sourcemap is fetched using a base fleet server url and the sourcemap url (path).

Update CHANGELOG.asciidoc
Documentation has been updated

Closes #5514

How to test these changes

Taking the test plan from #5410:

Run elasticsearch and multiple elastic-agents in fleet mode. 2 or 3 should be sufficient for this test.

Start kibana and apm-server with the fleet-server sourcemap code.

Upload a sourcemap via kibana
Confirm an updated config with the new source_maps block has been pushed to apm-server
Confirm ingesting rum traces for the uploaded sourcemap get mapped appropriately

apmmachine · 2021-07-20T16:49:14Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2021-07-30T10:15:39.967+0000
Duration: 40 min 1 sec
Commit: a3da98b

Test stats 🧪

Test	Results
Failed	0
Passed	5978
Skipped	14
Total	5992

Trends 🧪

mergify · 2021-07-21T01:08:57Z

This pull request is now in conflicts. Could you fix it @estolfo? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b sourcemap-fetch upstream/sourcemap-fetch
git merge upstream/master
git push upstream sourcemap-fetch

axw

Code looks good overall. I've left a few comments, let me know if you want to sync to discuss.

sourcemap/fleet_store.go

sourcemap/fleet_store_test.go

sourcemap/fleet_store.go

stuartnelson3 · 2021-07-26T14:12:38Z

sourcemap/fleet_store.go

+	respReceived := make(chan string, 1)
+	fetchErrReturned := make(chan error, 1)
+
+	for idx := range idxs {


Since we're making requests to all the URLs in parallel, is there a reason to use a random number generator?

I think it's still necessary to use a random number generator: though the go routines are executed concurrently, they are started serially. Over time, I would bet that the tiny difference in start times would become significant and we'd see the go routines started for hosts towards the beginning of the slice finish before those towards the end. I need to research go routines more, but if they are terminated when the surrounding function returns, some fleet hosts with higher slice indexes might never receive a request.

I've also read that go programs don't necessarily use multiple os threads and that the GOMAXPROCS default value is 1. So if this were running using a single os thread, the go routines wouldn't be running in parallel and only the fleet hosts with lower index values (or perhaps just the first one) would serve a request.
Let me know if I'm mistaken, I'm still just a go beginner 😃

though the go routines are executed concurrently, they are started serially. Over time, I would bet that the tiny difference in start times would become significant and we'd see the go routines started for hosts towards the beginning of the slice finish before those towards the end.

This could be true! In general, I would assume that the time to start a go routine is a lot lower than the time to finish a network request. I can't find any numbers atm for how long it takes to spawn a go routine, but would assume a network request to take somewhere on the order of milliseconds, and spawning a go routine is probably somewhere in the microseconds realm.

I need to research go routines more, but if they are terminated when the surrounding function returns, some fleet hosts with higher slice indexes might never receive a request.

Without coordination, go routines exist outside of the lifecycle of the function that starts them (except for the main goroutine, goroutines have no parents or children, in contrast to an os thread or process). In this case, each index will create a goroutine and execute a network request. Since the code below, which reads the result, is exiting as soon as the first response is received (successful or not), the next network request will send to the buffered channel, and any remaining go routines will be stuck forever attempting to send on the full channel.

I've also read that go programs don't necessarily use multiple os threads and that the GOMAXPROCS default value is 1. So if this were running using a single os thread, the go routines wouldn't be running in parallel and only the fleet hosts with lower index values (or perhaps just the first one) would serve a request.

There may be a single OS thread that performs CPU bound work, but for network i/o the requests will all be started and then the go runtime will most likely context switch to a different, non-waiting go routine (there are a few well-defined places where the go runtime will context switch). Depending on which web request completes first, that go routine will be "woken up" and continue processing.

The default value for GOMAXPROCS is the number of CPUs on the system: https://pkg.go.dev/runtime#GOMAXPROCS

FYI, this is ranging over the index slice indices, rather than the index slice values:

for idx := range idxs {

i.e. even if idxs contains the values [3, 2, 1], that range statement will produce the values 0, 1, 2. To range over the values, use for _, idx := range idxs.

Anyway, I would prefer we keep it simple and just fire them off in parallel without any randomisation. Later on, if necessary, we can make it smarter.

I think it's still necessary to use a random number generator: though the go routines are executed concurrently, they are started serially. Over time, I would bet that the tiny difference in start times would become significant and we'd see the go routines started for hosts towards the beginning of the slice finish before those towards the end. I need to research go routines more, but if they are terminated when the surrounding function returns, some fleet hosts with higher slice indexes might never receive a request.

Echoing what @stuartnelson3 mentioned above: it is necessary to coordinate cancellation, and starting goroutines is very cheap and fast. Given that it's fast to start them, the order in which they are started is going to be insignificant. We should cancel subsequent requests on the first successful response, but that's only likely to cancel a request mid-flight rather than before it starts.

axw

GitHub doesn't want to allow me to continue the thread, so responding to #5770 (comment) here:

I'm not clear on why it would be necessary to test that a host is selected randomly. I'm using a package function rand.Perm, which I can assume works correctly. So if we get a successful response, we can assume we are selecting a host randomly. Let me know if there's something important I'm missing about the way I've written the code.

Agreed - we shouldn't be testing that rand.Perm is doing its job. Rather we should test the external expectations of this code. Now that we're making the requests in parallel it's not so relevant to test randomisation. If we were issuing them sequentially but in random order, then a test (not necessarily a unit test, could be a functional test) should have picked up the range issue I noted, i.e. there was no random ordering at all.

sourcemap/fleet_store.go

axw · 2021-07-27T02:21:30Z

sourcemap/fleet_store.go

+	respReceived := make(chan string, 1)
+	fetchErrReturned := make(chan error, 1)
+
+	for idx := range idxs {


FYI, this is ranging over the index slice indices, rather than the index slice values:

for idx := range idxs {

i.e. even if idxs contains the values [3, 2, 1], that range statement will produce the values 0, 1, 2. To range over the values, use for _, idx := range idxs.

Anyway, I would prefer we keep it simple and just fire them off in parallel without any randomisation. Later on, if necessary, we can make it smarter.

I think it's still necessary to use a random number generator: though the go routines are executed concurrently, they are started serially. Over time, I would bet that the tiny difference in start times would become significant and we'd see the go routines started for hosts towards the beginning of the slice finish before those towards the end. I need to research go routines more, but if they are terminated when the surrounding function returns, some fleet hosts with higher slice indexes might never receive a request.

Echoing what @stuartnelson3 mentioned above: it is necessary to coordinate cancellation, and starting goroutines is very cheap and fast. Given that it's fast to start them, the order in which they are started is going to be insignificant. We should cancel subsequent requests on the first successful response, but that's only likely to cancel a request mid-flight rather than before it starts.

sourcemap/fleet_store.go

sourcemap/fleet_store_test.go

mergify · 2021-07-27T09:28:01Z

This pull request is now in conflicts. Could you fix it @estolfo? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b sourcemap-fetch upstream/sourcemap-fetch
git merge upstream/master
git push upstream sourcemap-fetch

axw

Looking good, thanks!

sourcemap/fleet_store.go

sourcemap/fleet_store_test.go

axw · 2021-07-29T01:10:34Z

sourcemap/fleet_store_test.go

+	select {
+	case <-requestReceived:
+	default:
+		t.Fatal("No second request received by handler")
+	}


I think this is maybe a little bit misleading, and IIANM it's racy.

since we make the requests in parallel, a second request may received by the handler before the request (context) is cancelled

even though we wait for <-fetchReturned above, we don't wait for all requests to complete in the fetch method if we get a success; so it is possible (if unlikely) that the second request will send to requestReceived, and this assertion will fail

I'd suggest changing all the above (line 137+) to:

resp, err := f.fetch(context.Background(), name, version, path) require.NoError(t, err) assert.Contains(t, resp, "webpack:///bundle.js")

even though we wait for <-fetchReturned above, we don't wait for all requests to complete in the fetch method if we get a success

Since the successful handler blocks until a read from requestReceived, I read this test as verifying that fetch returns after it receives its first successful response, even if there are other requests that haven't returned yet. I think the handler is supposed to receive both requests.

so it is possible (if unlikely) that the second request will send to requestReceived, and this assertion will fail

I think this is what is being tested, that the second request has sent to requestReceived. The successful handler is blocked by a send to requestReceived, so a single receive on that and then receive from requestReceived means fetch() returned after its first successful response, with the second request is still waiting within the handler? And the final receive is verifying that both requests were sent to the handler.

The ctx in the handler is not the same as the ctx in the request, right? So a successful response + cancel() of the request ctx won't effect the ctx in the server.

I agree that this check is unreliable because the request following the successful request might not reach the handler

select { case <-requestReceived: default: t.Fatal("No second request received by handler") }

But I still don't understand why this is not relevant for testing that the first time the handler is reached, the fetch method has returned successfully?

select { case resp := <-fetchReturned: assert.Contains(t, resp, "webpack:///bundle.js") assert.EqualValues(t, 3, called) case <-time.After(10 * time.Second): t.Fatal("timed out waiting for Fetch to return") }

I think this is what is being tested, that the second request has sent to requestReceived. The successful handler is blocked by a send to requestReceived, so a single receive on that and then receive from requestReceived means fetch() returned after its first successful response, with the second request is still waiting within the handler? And the final receive is verifying that both requests were sent to the handler.

The ctx in the handler is not the same as the ctx in the request, right? So a successful response + cancel() of the request ctx won't effect the ctx in the server.

@stuartnelson3 right, but the request may never even make it to the 2nd+ servers. Hypothetical ordering:

client issues request to ts0, ts1, ts2

request is transmitted to ts0

ts0 processes request, responds with success

client processes response, cancels context

because context is cancelled, requests to ts1 and ts2 are never transmitted

test fails on second case <-requestReceived

It's unlikely, but run it enough times on a slow CI box and it will eventually happen.

But I still don't understand why this is not relevant for testing that the first time the handler is reached, the fetch method has returned successfully?

@estolfo you're referring to the fetchReturned/timeout bit specifically? The existing TestFetchContext test which uses this pattern is doing so because it's testing that asynchronous context cancellation works -- we're not doing that here (at the moment).

I was suggesting the simplification to expedite merging the PR, but I think it would be good to test at this level of detail - it just might take a bit longer. Something like:

call fetch in a goroutine (like you're doing)

block until all servers have received their request

have ts0 respond with failure

have ts1 respond with success

block until ts2's request is cancelled

axw · 2021-07-29T01:11:11Z

sourcemap/fleet_store_test.go

+	select {
+	case resp := <-fetchReturned:
+		assert.Contains(t, resp, "webpack:///bundle.js")
+		assert.EqualValues(t, 3, called)


there's no guarantee that all request have been received by the time fetch returns, since we return early if we see a successful response from one server

…he results channel

Co-authored-by: Andrew Wilkins <[email protected]>

…quests

…emap (#5770) * Send request to multiple fleet servers when fetching sourcemap * Handle case when context is canceled before any results are sent to the results channel * Update sourcemap/fleet_store.go Co-authored-by: Andrew Wilkins <[email protected]> * Combine test for single and multiple fleet servers * Use atomic.AddInt32 to increment shared request counter var * Add test for a mix of successful and failed fleet server sourcemap requests * Simplify test for failed and successful fleet hosts fetch * Context should be first argument * Lock setting shared variable in test * Remove unnecessary called var Co-authored-by: Andrew Wilkins <[email protected]> (cherry picked from commit 212f099) # Conflicts: # sourcemap/fleet_store_test.go

axw · 2021-08-02T05:35:47Z

@Mergifyio backport 7.x

mergify · 2021-08-02T05:40:18Z

Command backport 7.x: success

Backports have been created

#5841 [7.x] [sourcemaps] Support multiple fleet addresses when requesting a sourcemap (backport #5770) has been created for branch 7.x

…emap (elastic#5770) * Send request to multiple fleet servers when fetching sourcemap * Handle case when context is canceled before any results are sent to the results channel * Update sourcemap/fleet_store.go Co-authored-by: Andrew Wilkins <[email protected]> * Combine test for single and multiple fleet servers * Use atomic.AddInt32 to increment shared request counter var * Add test for a mix of successful and failed fleet server sourcemap requests * Simplify test for failed and successful fleet hosts fetch * Context should be first argument * Lock setting shared variable in test * Remove unnecessary called var Co-authored-by: Andrew Wilkins <[email protected]>

…emap (#5770) (#5849) * Send request to multiple fleet servers when fetching sourcemap * Handle case when context is canceled before any results are sent to the results channel * Update sourcemap/fleet_store.go Co-authored-by: Andrew Wilkins <[email protected]> * Combine test for single and multiple fleet servers * Use atomic.AddInt32 to increment shared request counter var * Add test for a mix of successful and failed fleet server sourcemap requests * Simplify test for failed and successful fleet hosts fetch * Context should be first argument * Lock setting shared variable in test * Remove unnecessary called var Co-authored-by: Andrew Wilkins <[email protected]> Co-authored-by: Emily S <[email protected]>

axw · 2021-08-24T10:02:13Z

Tested with BC1, looks good.

created stack using apm-integration-testing --with-elastic-agent
ran elastic-agent on my laptop with apm integration
ran a small reverse proxy on two ports, with one returning 500 to each request and the other one passing through to fleet-server
uploaded a source map to kibana
send a RUM error to apm-server
checked that both reverse proxies were queried
checked that the error was indexed with source mapping applied

axw reviewed Jul 21, 2021

View reviewed changes

sourcemap/fleet_store.go Outdated Show resolved Hide resolved

sourcemap/fleet_store.go Outdated Show resolved Hide resolved

sourcemap/fleet_store_test.go Outdated Show resolved Hide resolved

estolfo force-pushed the sourcemap-fetch branch 3 times, most recently from ea2126d to 29990c8 Compare July 26, 2021 09:12

estolfo commented Jul 26, 2021

View reviewed changes

sourcemap/fleet_store.go Outdated Show resolved Hide resolved

stuartnelson3 reviewed Jul 26, 2021

View reviewed changes

sourcemap/fleet_store.go Show resolved Hide resolved

stuartnelson3 reviewed Jul 26, 2021

View reviewed changes

axw reviewed Jul 27, 2021

View reviewed changes

estolfo force-pushed the sourcemap-fetch branch 2 times, most recently from 5a7dd1c to e48c2c0 Compare July 27, 2021 20:29

axw requested changes Jul 28, 2021

View reviewed changes

sourcemap/fleet_store.go Show resolved Hide resolved

sourcemap/fleet_store_test.go Outdated Show resolved Hide resolved

sourcemap/fleet_store_test.go Outdated Show resolved Hide resolved

estolfo force-pushed the sourcemap-fetch branch from 9482d2f to 867cfd6 Compare July 28, 2021 14:46

axw requested changes Jul 29, 2021

View reviewed changes

estolfo and others added 7 commits July 30, 2021 09:30

Send request to multiple fleet servers when fetching sourcemap

9e1cc51

Handle case when context is canceled before any results are sent to t…

a17cad5

…he results channel

Update sourcemap/fleet_store.go

3e19e34

Co-authored-by: Andrew Wilkins <[email protected]>

Combine test for single and multiple fleet servers

8877139

Use atomic.AddInt32 to increment shared request counter var

7039305

Add test for a mix of successful and failed fleet server sourcemap re…

c5c43d4

…quests

Simplify test for failed and successful fleet hosts fetch

e18c429

estolfo force-pushed the sourcemap-fetch branch from 867cfd6 to e18c429 Compare July 30, 2021 07:48

estolfo added 2 commits July 30, 2021 10:13

Context should be first argument

ea1762b

Lock setting shared variable in test

d2034fa

axw approved these changes Jul 30, 2021

View reviewed changes

Remove unnecessary called var

a3da98b

estolfo merged commit 212f099 into elastic:master Jul 30, 2021

estolfo mentioned this pull request Jul 30, 2021

Add more detailed test for fetching sourcemap from multiple fleet servers #5836

Closed

axw added backport-7.x v7.15.0 labels Aug 2, 2021

mergify bot mentioned this pull request Aug 2, 2021

[7.x] [sourcemaps] Support multiple fleet addresses when requesting a sourcemap (backport #5770) #5841

Closed

axw mentioned this pull request Aug 2, 2021

[7.x] [sourcemaps] Support multiple fleet addresses when requesting a sourcemap (#5770) #5849

Merged

simitt added the test-plan label Aug 23, 2021

axw self-assigned this Aug 24, 2021

axw added the test-plan-ok label Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sourcemaps] Support multiple fleet addresses when requesting a sourcemap #5770

[sourcemaps] Support multiple fleet addresses when requesting a sourcemap #5770

estolfo commented Jul 20, 2021 •

edited

Loading

apmmachine commented Jul 20, 2021 •

edited

Loading

Build stats

Test stats 🧪

Trends 🧪

mergify bot commented Jul 21, 2021

axw left a comment

stuartnelson3 Jul 26, 2021

estolfo Jul 26, 2021 •

edited

Loading

stuartnelson3 Jul 26, 2021

axw Jul 27, 2021

axw left a comment

axw Jul 27, 2021

mergify bot commented Jul 27, 2021

axw left a comment

axw Jul 29, 2021

stuartnelson3 Jul 29, 2021

estolfo Jul 29, 2021

axw Jul 30, 2021

axw Jul 29, 2021

axw commented Aug 2, 2021

mergify bot commented Aug 2, 2021

axw commented Aug 24, 2021

[sourcemaps] Support multiple fleet addresses when requesting a sourcemap #5770

[sourcemaps] Support multiple fleet addresses when requesting a sourcemap #5770

Conversation

estolfo commented Jul 20, 2021 • edited Loading

How to test these changes

apmmachine commented Jul 20, 2021 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

mergify bot commented Jul 21, 2021

axw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

estolfo Jul 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jul 27, 2021

axw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axw commented Aug 2, 2021

mergify bot commented Aug 2, 2021

axw commented Aug 24, 2021

estolfo commented Jul 20, 2021 •

edited

Loading

apmmachine commented Jul 20, 2021 •

edited

Loading

estolfo Jul 26, 2021 •

edited

Loading