Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JS client closes connection and refuses to accept new identify calls #181

Closed
edvinerikson opened this issue Oct 14, 2019 · 15 comments
Closed

Comments

@edvinerikson
Copy link
Contributor

edvinerikson commented Oct 14, 2019

Is this a support request?
No

Describe the bug
LaunchDarkly closes stream connection and ignores user updates when beforeunload event is fired.
Our code uses window.location = "some-url-to-trigger-native-mobile-app-launch" to start a native app installed on the phone to authenticate the web session. When we return back to the web app with the session, the launchdarkly connection is closed and the call to identify() is ignored.

To reproduce
Steps to reproduce the behavior.
Call client.identify({key: '1'})
Check that client.getUser() returns expected user.
Call window.location = "bankid:///test"
Call client.identify({key: '2'}).
Check client.getUser() return the new user.
The last step won't work and will return user 1 instead.

Expected behavior
I expect that client.identify() will call start the client again or that the client wasn't closed to begin with.

SDK version
launchdarkly-js-client-sdk JSClient/2.13.0

OS/platform
Google chrome

@eli-darkly
Copy link
Contributor

I'm not sure I completely understand the use case. When you say "we return back to the web app with the session"... at that point, is it loading a new page/reloading the same page, or is the same page still active?

It sounds like you're saying it's the latter, but in that case 1. I'm not clear on how your JS code is receiving information about the new session, 2. I would not have expected to see a beforeunload event, and 3. I'm not sure if there is a defined behavior for how JS code would detect that the JS environment for the page did not actually get torn down following a beforeunload event.

The way the client is currently implemented, if it gets a beforeunload event it behaves as if you called client.close(). It does this for two reasons: if there are any pending analytics events, we need to try to deliver them (synchronously) right away before it's too late, and also, we don't want to leave any timer tasks active. So it's true that trying to use the client after it's received that event will not work. This is, I think, a fairly common pattern with JS libraries that need to close connections or do other kinds of cleanup on page unload. If, in this kind of mobile situation, it's not actually going to unload the page and you want it to keep executing JS code as if nothing had happened, then maybe we need to be listening to unload instead (although I don't know if that event would get fired too in this situation).

@edvinerikson
Copy link
Contributor Author

Yes, the browser keeps the page active.

The whole session thing isn't that relevant, but essentially we receive the session through our servers and not directly from the native app. Both the web app and native app exchange tokens server side.

I am not clear on how to solve it either. But perhaps you can make it a softer close when it's implicit through the beforeunload event, eg only flush events but not closing the client. Or closing it, but make it possible that calls to other sdk methods will restart it again.

Another thing is that we can try to change our code to not rely on location.. and instead use the open method on the window object. But perhaps that has the same problem.

@edvinerikson
Copy link
Contributor Author

Would it be possible to have a option to disable the shutdown and only flush the events?

We have been trying to use alternative apis to launch the native app, but none seems to work as good as window.location.

@eli-darkly
Copy link
Contributor

We're looking at a couple of possible fixes. There's also the possibility that another beforeunload handler (in application code) could deliberately block navigating away from the page.

[tracked internally as 55109]

@eli-darkly
Copy link
Contributor

We've changed this in the 2.15.0 release so that it will flush events, but not close the client. Shutting down the client prior to closing the page was not accomplishing much, since nothing in the current JS environment would persist anyway.

We've tested this change in several ways but may not have covered your exact use case, so please let me know when you've had a chance to retest with the new release. Thanks for bringing this to our attention.

LaunchDarklyCI pushed a commit that referenced this issue Nov 6, 2019
refactor to isolate the sync XHR logic; expose close() method
@eli-darkly
Copy link
Contributor

We're closing this issue since we believe it has been fixed as of 2.15.0. Please feel free to reopen it if you find that there is still a problem.

@edvinerikson
Copy link
Contributor Author

Sorry for not coming back! We have a pending change to update the sdk version. However we are pretty confident that this fix will work! Thank you! :)

@edvinerikson
Copy link
Contributor Author

We finally updated the SDK, and it turns out that this did not solve our issue. But it made it more obvious.

It seems that all Chromium browsers cancel pending requests when someone mutates window.location. Ever since we updated the SDK we have received thousands of errors like this: LD: [error] Expected application/json content type but got "".

I tried to debug this in my browser and I couldn't wrap my head around how it happened.. I couldn't see the SDK doing any requests (devtools didn't display anything).. although the SDK called the http request method, and the response handler in the SDK somehow got a mystery "200 OK" response with nothing but the status code (headers and body were empty).

I would call this a Chromium bug/feature.. and I am wonder what you think is a appropriate action here..

So far I have thought of these:

  1. We investigate and see if we can catch the error in the promise returned by identify() and somehow re-run it all until it works.
  2. The SDK itself adds a layer of resiliency and re-runs the calls if it fails for this reason.
  3. File a bug on Chromium tracker.

  1. Is probably the safest one if its possible to catch the error as a promise rejection.
  2. Is the easiest from our end since we don't really need to deal with it in product code. But it might change the SDK behavior in unexpected ways for other consumers.
  3. seems unlikely to have any effect in the short term.

@edvinerikson
Copy link
Contributor Author

I think I managed to get a relatively simple reproduction case.
https://codesandbox.io/s/distracted-zhukovsky-uzljh

@eli-darkly
Copy link
Contributor

@edvinerikson To be clear, when you say "this did not solve our issue" do you mean that it didn't solve every issue, since you are now having unwanted error messages— or that you are actually still experiencing the issue that you originally reported, where the LD client permanently lost the connection and refused to acknowledge further identify() calls after the user had cancelled a navigation attempt?

@edvinerikson
Copy link
Contributor Author

Sorry, it's a additional issue that got revealed after the patch.

@eli-darkly
Copy link
Contributor

I think there are several things we could do to address this, and I'm filing it internally to be investigated soon. How soon, and what we do first, would depend on what the current impact of this is. If the biggest issue is that the extra error output is overwhelming your monitoring, then we would focus on making it able to identify when the error is this particular kind of Chrome error and having it log at a lower level in that case. There might or might not be other things we could do to avoid it happening in the first place— for instance, if the beforeunload event happens before Chrome cuts off the connections, then maybe it would make sense for the SDK to go into a special state at that point where it suppresses error logging and retries requests.

@edvinerikson
Copy link
Contributor Author

To elaborate a bit more. What I meant with "this did not solve our issue" was that it did not make any impact on the "end issue" that we have. That LaunchDarkly runs with stale/invalid data and our users sees the wrong flags.

However the patch in .15 got us one step closer to fixing that issue. From our end this issue is blocking our development of a experimentation platform that runs on top of LaunchDarkly. Since a large portion of our user base runs with invalid flag values the data export becomes very unpredictable.

Perhaps it's worth opening a new issue on GitHub to track this as I think it might grow even larger.
So far I suspect these cases to play a role in this:

  • SDK shutsdown on beforeunload event - fixed in .15
  • identify api call fails (causes the SDK to run on stale data) - verified by repro case, not fixed.
  • identify api call sets the user regardless if it fails or not (causes the data export to associate events with the wrong user) - not verified yet.
  • Transmission of analytical events fail due to chrome killing of the connection (We're seeing user count mismatch and that would indicate that some events does not reach your ingestion endpoint at all). - not verified yet.

@eli-darkly
Copy link
Contributor

@edvinerikson I do think a new Github issue would be a good idea, since this is a somewhat different problem that doesn't have the same fix.

I'm still slightly unclear on a couple of your observations. If the result is that "LaunchDarkly runs with stale/invalid data" and that this is affecting a large number of users - I'm not seeing exactly how one would get to that result from the specific behavior you observed.

That is, my understanding so far has been that if the user starts to navigate away from the page, Chrome kills any pending request that is happening at that moment. And it sounds like the pending request you're talking about is from an identify() call that presumably happened very shortly before the navigation attempt. And it would only matter if the user then cancelled the navigation attempt so that the same JS code remained running on the page. Is that sequence of events a very common occurrence? I wouldn't expect it to be, but you said "a large portion of our user base runs with invalid flag values" so I'm wondering if there is a more frequent scenario involved that I'm not understanding.

Also: "identify api call sets the user regardless if it fails or not (causes the data export to associate events with the wrong user)" - that is not the current behavior as far as I know. You previously identified that issue and submitted a PR for it, we accepted the PR, and the fix has been in the SDK since the 2.16.0 release.

If the desired behavior is for identify() to retry on a failed request (regardless of what caused the request to fail) then we would have to think about the rules and limits for retries. I believe it doesn't silently fail, though— the promise returned by identify() should be rejected in this case.

@eli-darkly
Copy link
Contributor

But, again, let's take this to a new issue. If I reopened this one, the title would be pretty misleading at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants