-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh tries to connect to old NSE name and fails indefinitely #1357
Comments
@yuraxdrumz Thanks for the report! We'll have look. |
@denis-tingaikin Any idea why healing isn't simply getting a DOWN event and trying for a different NSE instance? |
My guesses are:
To reduce the area we need to get more information Questions:
|
@yuraxdrumz We would be very grateful for additional information:
|
Hey, Sorry for the late reply. The issue attached regarding the healing is also an issue I am experiencing while using multiple NSC's. Regarding the current one, it looks like I ran a I will create a new cluster and try everything again just to be sure. I am using NSM 1.5.0 without interdomain scenario. If the problem persists, I will add logs here. Thanks |
Hi, For the specified scenario of having only 1 broken component, like NSE, for instance, However, when having multiple broken components, my NSC attempts to request an already deleted Pod and it does not reference the new one. If you need logs, let me know. |
Thanks guys! @or-adar |
So the current status: @yuraxdrumz
In any case, it will become more clear only with the logs :) |
Your error is slightly different, but I think I was able to reproduce it under load. The reason is that under load, refreshes can return an error (for example, the The next refresh will happen later (usually in a couple of minutes). And if during these 2 minutes the NSE dies, the client will not know about it, and will give an error on requests: |
So we have several possible solutions:
Thoughts? |
Around 09:54 I deleted Also, sorry for the confusion but @yuraxdrumz and I work together on the same setup, |
@or-adar It is very interesting. A few questions:
|
|
Interesting... Is the fork public? Is it possible to share? It'd be nice to use for reproducing |
Also values for |
Hey @denis-tingaikin @glazychev-art We use a a slightly modified forwarder-vpp v1.5.0, that uses this PR I opened, not v1.3.0 like @or-adar mentioned. Regarding liveness
|
@denis-tingaikin @glazychev-art NSM_REQUEST_TIMEOUT=60s
NSM_LIVENESSCHECKINTERVAL=10s
NSM_LIVENESSCHECKTIMEOUT=40s |
@yuraxdrumz , @or-adar Thanks! Question: Why you don't use default values? Is something going wrong? |
@yuraxdrumz @or-adar Please, check that after 80 seconds everything will be fine. But I would recommend using the default values (or if they are not enough, slightly increase by 1-5 seconds) |
@glazychev-art Thanks for the information! We ran some various commands manually during development and the pinger failed so fast, we could not check anything, so we increased the limits. If I understand, we are hitting a scenario in between deaths of NSMGR and NSE and it should resolve itself after Decreasing the values only narrows the window for this error to occur, but it does not resolve it completely, no? |
@yuraxdrumz But on the last logs we see another error - |
@denis-tingaikin @glazychev-art |
We seem to have found a way to reproduce this issue. Steps to reproduce:
Attached the fix above |
Should be fixed. Feel free to reopen if problem is still reproducible :) |
Expected Behavior
NSC should be able to connect to an NSE according to the NS definition after NSE reschedule on k8s
Current Behavior
I have 3 NSC's running that try to connect to the old NSE even though it does not exist anymore after a k8s node running the NSE got terminated and the NSE got rescheduled to a different node.
Failure Information (for bugs)
The forwarder that tries to connect to the nse always fails with
network service candidate for nse name XXX was not found
.It looks like when the node went down, healing should have worked when the nse got rescheduled on another node.
Looking at the
discoverCandidatesServer
code, I see that if the nse name already exists, we try to find it and continue, but in this case it does not exist anymore.Steps to Reproduce
Context
Failure Logs
The text was updated successfully, but these errors were encountered: