After recent changes in bbs selection process, passive nodes randomly causing errors #3109

thelangley · 2022-12-16T10:20:13Z

Issue

Recently moved to cf-deployment 23.5.0 with an ops-file to bump to capi-release 1.142.0. We found that some of our jobs which test regular app pushes were reporting errors that we haven't seen before.

Things like

Staging app and tracing logs...

Stager is unavailable: ipaddr value changed, but session already started

FAILED

and

Waiting for app to start...

Runner is unavailable: Process Guid: f4c41835-eaa0-44c9-9c96-026ae08e295e-7ff13535-d702-4997-b0c4-4a063c426a1a: ipaddr value changed, but session already started

FAILED

Context

We think it's related to #3048

We looked into the cloud_controller_ng logs and inserted more logging. When the request goes is made to diego-api (BBS), occasionally it now goes to the passive BBS node which seems to cause these errors.

The passive BBS node looks like it's rejecting the connection. Here's the cloudcontroller logs for the request when it attempts to go to the passive node.

ERROR: #<Errno::ECONNREFUSED: Failed to open TCP connection to passive_node_ip:8889 (Connection refused - connect(2) for passive_node_ip:8889)>

Steps to Reproduce

Push apps frequently and review the errors. you'll need to be using capi-release 1.142.0, bosh v276.1.1 and bosh-dns using the /features/use_dns_addresses flag set to true. Failure rate is v intermittent. You could have 20 successful pushes before the errors start coming in.

Expected result

Successful pushes 100% of the time (I know I expect a lot)

Current result

Staging failures during app pushes.

Staging app and tracing logs...

Stager is unavailable: ipaddr value changed, but session already started

FAILED

and

Waiting for app to start...

Runner is unavailable: Process Guid: f4c41835-eaa0-44c9-9c96-026ae08e295e-7ff13535-d702-4997-b0c4-4a063c426a1a: ipaddr value changed, but session already started

FAILED

Possible Fix

I managed to workaround this different behaviour by adding a .sort to https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib/diego/client.rb#L143 making it read ips_remaining.sort.first

This may just be working because my primary happens to be the lower IP so it's probably not a good fix.

Not sure if it's related but it looks like it's relying on Bosh DNS (bbs.service.cf.internal:8889) to report the active diego-api (BBS) node as the .first entry. This isn't always the case. Bosh DNS just gives a list of possibles, not necessarily active nodes and not necessarily in the "right" order.

The text was updated successfully, but these errors were encountered:

will-gant · 2022-12-16T11:22:54Z

Hi there - author of #3048 here. We're investigating this too, and also suspect some kind of link to this PR, though we haven't pinned down the exact issue yet.

The method the Cloud Controller's Diego::Client used to reach bbs already resulted in it regularly reaching the inactive node. The previous behaviour was that it would try hitting the bbs server's on its domain three times, re-resolving DNS afresh each time, when trying to reach the active server. The intended new behaviour differs in that it resolves DNS once ahead of the attempts, and tries each IP in turn (the idea being to avoid a situation where we hit the wrong one three times in a row and then error).

thelangley · 2022-12-16T12:18:06Z

That's good to know, thanks @will-gant

I've no idea why changing ips_remaining.first to ips_remaining.sort.first makes it work more cleanly

We don't get any push errors when this logic is introduced and that's even when fiddling with diego-api to simulate bbs outage.

I've used bosh to stop a diego-api node to see if the new logic cloudcontroller_ng still works when it fails over. It worked nicely, and then I brought the downed diego-api node back up, cloudcontroller_ng didn't notice or go back to using the original diego-api so I used bosh to stop the newly active one to force failback.

App pushes worked throughout this process but I could just be lucky. This feature is quite intermittent. Hopefully have more evidence/logs by Monday

thelangley · 2022-12-19T07:47:55Z

Feedback, I was just very lucky. Pushes still randomly error with same sorts of error messages

Thanks for the regression warning on capi-release

thelangley · 2022-12-19T08:14:30Z

One of my more knowledgeable colleagues has suggested doing the following

if http_client.ipaddr != bbs_ip
  http_client.finish if http_client.started? # close any open sessions
  http_client.ipaddr = bbs_ip # tell the HTTP client which exact IP to target
end

instead of the simpler

http_client.ipaddr = bbs_ip # tell the HTTP client which exact IP to target

I'm re-testing in my env

thelangley · 2022-12-19T09:29:30Z

got a new error

Runner is unavailable: Process Guid: 34edc69e-9c2b-445a-82a8-f9fbc1784ee7-87fee70d-1ef7-475a-a216-e808b44df9b3: undefined method `peeraddr' for #<Socket:fd 31>


        to_io.peeraddr

             ^^^^^^^^^

thelangley · 2022-12-20T06:37:05Z

Looks like the original change has been reverted and new cut of capi-release has been released

@will-gant , you ok for me to close this?

will-gant · 2022-12-20T08:38:11Z

Yes, feel free. Many thanks indeed for your efforts, and apologies for the hassle! We expect to revisit this on our side in the new year.

cf-gitbot added the unscheduled label Dec 16, 2022

moleske mentioned this issue Dec 19, 2022

/v2/spaces/<spaceguid>/summary produces "-1" running_instances for all apps, instead of actual # instances #3112

Closed

thelangley closed this as completed Dec 20, 2022

cf-gitbot removed the unscheduled label Dec 20, 2022

This was referenced Feb 1, 2023

Use Net::HTTP::Persistent in Diego client #3170

Merged

Patch HTTPClient to accept socket_connect_timeout #3174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After recent changes in bbs selection process, passive nodes randomly causing errors #3109

After recent changes in bbs selection process, passive nodes randomly causing errors #3109

thelangley commented Dec 16, 2022

will-gant commented Dec 16, 2022 •

edited

Loading

thelangley commented Dec 16, 2022 •

edited

Loading

thelangley commented Dec 19, 2022

thelangley commented Dec 19, 2022

thelangley commented Dec 19, 2022

thelangley commented Dec 20, 2022

will-gant commented Dec 20, 2022

After recent changes in bbs selection process, passive nodes randomly causing errors #3109

After recent changes in bbs selection process, passive nodes randomly causing errors #3109

Comments

thelangley commented Dec 16, 2022

Issue

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

will-gant commented Dec 16, 2022 • edited Loading

thelangley commented Dec 16, 2022 • edited Loading

thelangley commented Dec 19, 2022

thelangley commented Dec 19, 2022

thelangley commented Dec 19, 2022

thelangley commented Dec 20, 2022

will-gant commented Dec 20, 2022

will-gant commented Dec 16, 2022 •

edited

Loading

thelangley commented Dec 16, 2022 •

edited

Loading