Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healing with two NSEs could keep previous IPContext values #9888

Closed
denis-tingaikin opened this issue Sep 25, 2023 · 20 comments · Fixed by networkservicemesh/sdk#1531 or networkservicemesh/sdk#1582
Assignees
Labels
ASAP The issue should be completed as soon as possible bug Something isn't working

Comments

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Sep 25, 2023

Steps to reproduce

  1. Deploy NSM
  2. Deploy NSE-1 with subnet 172.16.1.100 with ns-1
  3. Deploy NSE-2 with subnet 172.16.2.100 with ns-1
  4. Deploy NSC for ns-1 (ns-1 should match only with nse-1)
  5. Check connection
  6. Update ns (ns-1 should match only with nse-2)

Actual: Reconnect is slow and nse2 & nsc have old IP context values
Expected: Reconnect is quick and nse2 & nsc havn't old IP context values

Additional information

What was done:

After test onstall

network service: connects client to nse-1

apiVersion: v1
items:

  • apiVersion: networkservicemesh.io/v1
    kind: NetworkService
    metadata:
    ...
    spec:
    matches:
    • routes:
      • destination_selector:
        app: nse-kernel-1
        source_selector: null
        payload: ETHERNET
        ...
client:
Defaulted container "app" out of: app, cmd-nsc, cmd-nsc-init (init)
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: nsm-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:ff:7f:5e:ab brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.101/32 scope global nsm-1
       valid_lft forever preferred_lft forever
    inet6 fe80::fe:ffff:fe7f:5eab/64 scope link
       valid_lft forever preferred_lft foreve
972: eth0@if973: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ae:7a:b7:36:bc:ae brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.0.14/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ac7a:b7ff:fe36:bcae/64 scope link
       valid_lft forever preferred_lft forever

nse-1:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: kernel2ker-d2c2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:23:55:9d:52 brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.100/32 scope global kernel2ker-d2c2
       valid_lft forever preferred_lft forever
    inet6 fe80::fe:23ff:fe55:9d52/64 scope link
       valid_lft forever preferred_lft forever
970: eth0@if971: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 4e:3e:c4:21:9d:1a brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.0.145/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::4c3e:c4ff:fe21:9d1a/64 scope link
       valid_lft forever preferred_lft forever

nse-2:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
974: eth0@if975: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether da:46:1f:8a:dc:92 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.0.98/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::d846:1fff:fe8a:dc92/64 scope link
       valid_lft forever preferred_lft forever
All is OK

Now edit network service so client connects to nse-2:

apiVersion: v1
items:
- apiVersion: networkservicemesh.io/v1
  kind: NetworkService
  metadata:
  ...
  spec:
    matches:
    - routes:
      - destination_selector:
          app: nse-kernel-2
      source_selector: null
    payload: ETHERNET
  ...
after ~40 sec

client:
Defaulted container "app" out of: app, cmd-nsc, cmd-nsc-init (init)
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: nsm-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:0d:f4:0e:26 brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.101/32 scope global nsm-1
       valid_lft forever preferred_lft forever
    inet 172.16.2.101/32 scope global nsm-1
       valid_lft forever preferred_lft forever
    inet6 fe80::fe:dff:fef4:e26/64 scope link
       valid_lft forever preferred_lft forever
972: eth0@if973: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ae:7a:b7:36:bc:ae brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.0.14/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ac7a:b7ff:fe36:bcae/64 scope link
       valid_lft forever preferred_lft forever

nse-1
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
970: eth0@if971: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 4e:3e:c4:21:9d:1a brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.0.145/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::4c3e:c4ff:fe21:9d1a/64 scope link
       valid_lft forever preferred_lft forever

nse-2
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: kernel2ker-828b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:e9:e4:26:7c brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.100/32 scope global kernel2ker-828b
       valid_lft forever preferred_lft forever
    inet 172.16.2.100/32 scope global kernel2ker-828b
       valid_lft forever preferred_lft forever
    inet6 fe80::fe:e9ff:fee4:267c/64 scope link
       valid_lft forever preferred_lft forever
974: eth0@if975: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether da:46:1f:8a:dc:92 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.0.98/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::d846:1fff:fe8a:dc92/64 scope link
       valid_lft forever preferred_lft forever

This is not what expected:

client and nse-2 has 2 ips from nse-1 and nse-2 expected from nse-2 only
It tooks too much time to switch

@denis-tingaikin denis-tingaikin self-assigned this Sep 25, 2023
@denis-tingaikin denis-tingaikin changed the title Healing of two NSEs could keep previous IPContext values Healing with two NSEs could keep previous IPContext values Sep 25, 2023
@dualBreath dualBreath moved this to In Progress in Release v1.11.0 Sep 26, 2023
@denis-tingaikin denis-tingaikin moved this from In Progress to Moved to next release in Release v1.11.0 Oct 8, 2023
@denis-tingaikin denis-tingaikin moved this to In Progress in Release v1.12.0 Oct 9, 2023
@denis-tingaikin denis-tingaikin moved this from In Progress to Under review in Release v1.12.0 Oct 16, 2023
@github-project-automation github-project-automation bot moved this from Moved to next release to Done in Release v1.11.0 Nov 22, 2023
@github-project-automation github-project-automation bot moved this from Under review to Done in Release v1.12.0 Nov 22, 2023
@denis-tingaikin
Copy link
Member Author

Hello @barby1138

The problem with re-connecting has been fixed. It'd be perfect if you could test it with NSM v1.11.2-rc.1. 
Let us know about any issues.

@denis-tingaikin
Copy link
Member Author

It's done

@LionelJouin
Copy link
Member

Hi @denis-tingaikin, as part of v1.12.0-rc, I experienced some issue with the solution.
I think the NSC should be allowed to set the Context by itself, and then the NSE should verify everything is correct and modify if needed.
So in the example, the nse-2 should modify the IPs of the connection when ns-1 gets moved to nse-2.
If it is done that way, the NSC should not have to set the context to nil on reselect/healing.

@LionelJouin LionelJouin reopened this Jan 19, 2024
@denis-tingaikin
Copy link
Member Author

@LionelJouin might be. Let's consider this a bit deeper.

Could you say more about the issue that you caught?

@LionelJouin
Copy link
Member

I have a custom NSC requesting VIP addresses, the internal addresses (communication between NSC and NSE) are handled by the NSE. The NSE detects the VIPs and keep them in the context, and reset the internal IPs if they are incorrect.
With the fix, the VIPs are removed, so I get only the internal IPs injected in my pod.

@denis-tingaikin
Copy link
Member Author

Hmm.. I suspect that the VIP address was set by NSC on the first request, right?

@denis-tingaikin
Copy link
Member Author

If so I think we should do reselect always with the initial request parameters.

What do you think about this solution?

@LionelJouin
Copy link
Member

no, the VIPs can be set during the first request but can be updated with additional request (mostly that case).

@denis-tingaikin
Copy link
Member Author

Got it. Am I getting it correct that the VIP address could also be updated by the NSC during runtime?

@LionelJouin
Copy link
Member

yes, this is correct

@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Jan 19, 2024

OK, good. I feel that fixing it on the NSE might not be effective because NSEs are not under our control, and we don't know if it will keep the VIP address or not.

So the solution, from my point of view, should be in the NSC. For example, we can do re-election by doing this:
Reselect request = (Latest request from NSC) - (Diff received from the NSE from the last response) 

@denis-tingaikin denis-tingaikin added bug Something isn't working ASAP The issue should be completed as soon as possible labels Jan 19, 2024
@LionelJouin
Copy link
Member

In my case, the NSE is a custom one too. But in case of your example, the nse-2 should remove the IP since they are not assigned for the NSC in its IPAM and they don't belong to the same subnet.
The NSE should control what the NSC is setting in the context and modify the context or refuse the connection.

@denis-tingaikin
Copy link
Member Author

Currently, I see these options for how to resolve the problem:

  1. Quick fix solution: just add an option for the NSC to be able to not delete previous IP context on the reselection. For example, begin.WithReselectionStrategy(...). Then we will be able to store previous IP context for some NSCs, like in the case.
  2. Add the diff calculation as described at Healing with two NSEs could keep previous IPContext values #9888 (comment).
  3. Add some criteria on how to zero IP parameters on the NSC/NSE side, as suggested at Healing with two NSEs could keep previous IPContext values #9888 (comment). For this option @LionelJouin Do you see any variants on how can we check if the incooming IP address is VIP or if it's been allocated by the previous NSE?

I think we can start with option 1 and implement option2/option3 in release v1.13.0.

Thoughts?

@LionelJouin
Copy link
Member

For the 3rd solution, I see it more like a policy for the IPAM. In the current NSM NSE and IPAM, the default behavior should just be to replace remove all IPs that does not belong to the correct subnet or that have not been allocated for the particular NSC.
Otherwise, in my opinion, it should be up to the user to decide and implement their own IPAM and policy on how the properties in the context should be accepted/modified.

@denis-tingaikin denis-tingaikin moved this from Done to Blocked in Release v1.12.0 Jan 23, 2024
@denis-tingaikin
Copy link
Member Author

In the current NSM NSE and IPAM, the default behavior should just be to replace remove all IPs that does not belong to the correct subnet or that have not been allocated for the particular NSC.

Question.. How can we detect that the address from the IPContext is VIP and we should not replace it?

@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Jan 23, 2024

Also, a question: Is it possible for you to add a chain element that appends the VIP address to the request?

For example, we can simply add a new chain element that simply appends to the request the VIP address.

package addvipaddress

type client struct{
    
}

func (c *client) Request(ctx context.Context, request *networkservice.NetworkServiceRequest, opts ...grpc.CallOption) (conn *networkservice.Connection, err error) {
		request.GetConnection().GetIpAddress().SrcIpAddrs = append( ... ) //TODO: check contains
		return next.Client(ctx).Request(ctx, request, opts...)
}
func (c *client) Close(ctx context.Context, request *networkservice.Connection, opts ...grpc.CallOption) (*emptry.Empty, err error) {

		return next.Client(ctx).Close(ctx, request, opts...)
}
nsmClient := client.NewClient(ctx, client.WithAdditionalFunctionality(addvipaddress.NewClient())

UPD:  Also, if we go with this solution, we'll be able to update the VIP address immediately via using begin.FromContext(ctx).Request() function. With something like that:

go func() {
<-vipAddressChangeCh
begin.FromContext(ctx).Request()
}()

@LionelJouin
Copy link
Member

In my case, the NSE takes care only about the IPs that belongs to the same subnet and checks if they have been allocated for that particular requesting NSC.
If not allocated, then the IPs are being removed and new ones allocated and added.
Other IPs (ones that don't belong to the NSE subnet) are considered as VIP.

VIPs could be an example, but this would also apply to MAC addresses and routes. Isn't the NSE supposed to control these properties and correct them if they contains incorrect information?

Yes, a chain element could be a workaround.

@edwarnicke
Copy link
Member

edwarnicke commented Jan 23, 2024

@denis-tingaikin Is it possible that when we fixed this initial problem by setting the ConnectionContext on reselect in the NSC to nil, we should have instead had the ipam module in the NSE check to see if the IP address provided was either:

a) Understood to be associated with the connection id or
b) Available

before permitting it, and reassigned it if not?

@LionelJouin what are your thoughts?

@barby1138 what are your thoughts?

@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Jan 23, 2024

@edwarnicke Sure, it's possible, and if you feel that it's a better way, we can do that. 

The problem that I am seeing here is that NSM users will need to manage the IPAM module, which adds complexity.

@NikitaSkrynnik NikitaSkrynnik moved this to In Progress in Release v1.13.0 Jan 25, 2024
@denis-tingaikin denis-tingaikin moved this from Blocked to Moved to next release in Release v1.12.0 Jan 30, 2024
@denis-tingaikin denis-tingaikin moved this from In Progress to Under review in Release v1.13.0 Feb 2, 2024
@github-project-automation github-project-automation bot moved this from Moved to next release to Done in Release v1.12.0 Feb 5, 2024
@github-project-automation github-project-automation bot moved this from Under review to Done in Release v1.13.0 Feb 5, 2024
@denis-tingaikin
Copy link
Member Author

Seems like done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASAP The issue should be completed as soon as possible bug Something isn't working
Projects
Status: Done
Status: Done
Status: Done
3 participants