-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After recent changes in bbs selection process, passive nodes randomly causing errors #3109
Comments
Hi there - author of #3048 here. We're investigating this too, and also suspect some kind of link to this PR, though we haven't pinned down the exact issue yet. The method the Cloud Controller's |
That's good to know, thanks @will-gant I've no idea why changing We don't get any push errors when this logic is introduced and that's even when fiddling with diego-api to simulate bbs outage. I've used bosh to stop a diego-api node to see if the new logic cloudcontroller_ng still works when it fails over. It worked nicely, and then I brought the downed diego-api node back up, cloudcontroller_ng didn't notice or go back to using the original diego-api so I used bosh to stop the newly active one to force failback. App pushes worked throughout this process but I could just be lucky. This feature is quite intermittent. Hopefully have more evidence/logs by Monday |
Feedback, I was just very lucky. Pushes still randomly error with same sorts of error messages Thanks for the regression warning on capi-release |
One of my more knowledgeable colleagues has suggested doing the following
instead of the simpler
I'm re-testing in my env |
got a new error
|
Looks like the original change has been reverted and new cut of capi-release has been released @will-gant , you ok for me to close this? |
Yes, feel free. Many thanks indeed for your efforts, and apologies for the hassle! We expect to revisit this on our side in the new year. |
Issue
Recently moved to cf-deployment 23.5.0 with an ops-file to bump to capi-release 1.142.0. We found that some of our jobs which test regular app pushes were reporting errors that we haven't seen before.
Things like
and
Context
We think it's related to #3048
We looked into the cloud_controller_ng logs and inserted more logging. When the request goes is made to diego-api (BBS), occasionally it now goes to the passive BBS node which seems to cause these errors.
The passive BBS node looks like it's rejecting the connection. Here's the cloudcontroller logs for the request when it attempts to go to the passive node.
Steps to Reproduce
Push apps frequently and review the errors. you'll need to be using capi-release 1.142.0, bosh v276.1.1 and bosh-dns using the
/features/use_dns_addresses
flag set to true. Failure rate is v intermittent. You could have 20 successful pushes before the errors start coming in.Expected result
Successful pushes 100% of the time (I know I expect a lot)
Current result
Staging failures during app pushes.
and
Possible Fix
I managed to workaround this different behaviour by adding a
.sort
to https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib/diego/client.rb#L143 making it readips_remaining.sort.first
This may just be working because my primary happens to be the lower IP so it's probably not a good fix.
Not sure if it's related but it looks like it's relying on Bosh DNS (bbs.service.cf.internal:8889) to report the active diego-api (BBS) node as the
.first
entry. This isn't always the case. Bosh DNS just gives a list of possibles, not necessarily active nodes and not necessarily in the "right" order.The text was updated successfully, but these errors were encountered: