Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does danm work with calico? #106

Closed
hymgg opened this issue Jul 11, 2019 · 20 comments
Closed

Does danm work with calico? #106

hymgg opened this issue Jul 11, 2019 · 20 comments
Labels
support How? And why?

Comments

@hymgg
Copy link

hymgg commented Jul 11, 2019

Hello,

First attempt to try out DANM. Followed readme to build, started netwatcher. Modified / simplified example from project, but test pod failed to start.

apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
name: cali-mgmt
namespace: example-sriov
spec:
NetworkID: calico-mgmt
NetworkType: calico

$ kubectl get dn -n example-sriov
NAME AGE
cali-mgmt 29m

[root@mtx-bld08 net.d]# cat calico-mgmt.conf
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "mtx-huawei2-bld08",
"mtu": 1440,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
}
]
}

apiVersion: v1
kind: Pod
metadata:
name: sriov-pod
namespace: example-sriov
labels:
env: test
annotations:
danm.k8s.io/interfaces: |
[
{"network":"calico-mgmt", "ip":"dynamic"}
]
spec:
containers:

  • name: sriov-pod
    image: busybox:latest
    args:
    • sleep
    • "1000"

Events:
Type Reason Age From Message


Normal Scheduled 3s default-scheduler Successfully assigned example-sriov/sriov-pod to mtx-huawei2-bld04
Warning FailedCreatePodSandBox 2s kubelet, mtx-huawei2-bld04 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "6831107ccc88762a8d717bbcaf0ee5cd62c83576076adcfd62cb156d4ac31732" network for pod "sriov-pod": NetworkPlugin cni failed to set up pod "sriov-pod_example-sriov" network: CNI network could not be set up: CNI operation for network: failed with:failed to get network object for Pod:sriov-pod's connection no.:0 due to:requested network:calico-mgmt of type:DanmNet in namespace:example-sriov does not exist

Environment:

  • danm version

Where to find this?

  • Kubernetes version (use kubectl version):

[mtx@mtx-bld08 danm]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

  • danm configuration:

[root@mtx-bld08 net.d]# cat 00-danm.conf
{
"name": "meta_cni",
"name_comment": "Mandatory parameter, but can be anything",
"type": "danm",
"type_comment": "Mandatory parameter according to CNI spec, MUST be set to danm",
"kubeconfig": "/etc/cni/net.d/danm-kubeconfig",
"kubeconfig_comment": "Mandatory parameter, must point to a valid kubeconfig file containing the necessary RBAC setting for DANM's user",
"cniDir": "/etc/cni/net.d",
"cniDir_comment": "Optional parameter, if defined CNI config files for static delegates are searched here. Default value is /etc/cni/net.d",
"namingScheme": "awesome",
"namingScheme_comment": "Optional parameter, if it is set to legacy container network interface names are set exactly to DanmNet.Spec.Options.container_prefix, otherwise prefix simply behaves as a prefix and is suffixed with a sequence ID. Default value is empty (e.g. not legacy)"
}


kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: caas:danm
rules:

  • apiGroups:
    • danm.k8s.io
      resources:
    • danmnets
    • danmeps
    • tenantnetworks
    • clusternetworks
      verbs: [ "*" ]
  • apiGroups: [ "" ]
    resources: [ "pods" ]
    verbs: [ "get","watch","list"]

kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: caas:danm
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: caas:danm
subjects:

  • kind: ServiceAccount
    namespace: kube-system
    name: danm

  • OS (e.g. from /etc/os-release):

NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.6"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.6:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.6"

  • Kernel (e.g. uname -a):

Linux mtx-huawei2-bld08 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Thanks. -Jessica

@hymgg
Copy link
Author

hymgg commented Jul 11, 2019

Updated dn so metadata.name matches spec.NetworkID -- didn't know they have to match

apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
name: calico-mgmt
namespace: example-sriov
spec:
NetworkID: calico-mgmt
NetworkType: calico

Pod still failed to start but with new error:

Warning FailedCreatePodSandBox 7s kubelet, mtx-huawei2-bld02 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "834f6e9a1d195a6d410a3e39d1ddb8333d71874801414ce84ba2c04b492086bf" network for pod "sriov-pod": NetworkPlugin cni failed to set up pod "sriov-pod_example-sriov" network: CNI network could not be set up: CNI operation for network:calico-mgmt failed with:CNI delegation failed due to error:Error delegating ADD to CNI plugin:calico because:OS exec call failed:no etcd endpoints specified

@Levovar
Copy link
Collaborator

Levovar commented Jul 12, 2019

glad you really decided to try DANM :)
yes, generally speaking Calico should work, I think we had multiple users successfully using it in the past
in your case it simply a typo:
name: cali-mgmt

danm.k8s.io/interfaces: |
[
{"network":"cali_co_-mgmt", "ip":"dynamic"}
]

@Levovar
Copy link
Collaborator

Levovar commented Jul 12, 2019

ah sorry, only saw the update now. I'm still on my morning coffee :)
NetworkID and name: they don't need to match, but you need to provide the name of the network in the connection definition section of your Pod manifest. However you can name your networks anything!

and for the error: it is thrown by the Calico code, after DANM has delegated the operation. I guess Calico expects some configuration to be present in its backend which is missing. But I confess I'm not that big of a Calico expert, so not sure exactly what's missing.
But for sure the error is not coming from DANM.
summoning @rospring and @clivez , AFAIK they have some Calico experience: guys, any idea what could be the issue here?

@Levovar Levovar added the support How? And why? label Jul 12, 2019
@Levovar
Copy link
Collaborator

Levovar commented Jul 12, 2019

after some doc reading:
https://docs.projectcalico.org/v3.5/usage/calicoctl/configure/etcd
I guess you are missing the ETCD_ENDPOINTS environment variable, or config file option so the Calico CNI cannot find its own backend

@clivez
Copy link
Contributor

clivez commented Jul 12, 2019

Quite agree with Levo, seems the problem come from the configuration file for calico, in my opinion, at least "etcd_endpoints" "etcd_key_file" "etcd_cert_file" and "etcd_ca_cert_file" are needed.

@hymgg
Copy link
Author

hymgg commented Jul 12, 2019

Thank you guys. Without DANM, Calico has been working on the k8s cluster as the overlay network, I just reused / renamed its config file to calico-mgmt.conf for danm, so wasn't sure why / where to add the additional config info when it's used as a delegate?

(btw, I've used calico with multus, reusing the same config file, didn't have this kind of issue...)

@Levovar
Copy link
Collaborator

Levovar commented Jul 12, 2019

Hmm, interesting. We need to go deeper then :)
Two things come to my mind:

  • can you share with us how i the ETCD store configured for Calico in your cluster? Is it through environment variables, or via config file / ConfigMap?
  • can you try it with a CNI config file which purely contains Calico's config? the current one has plugin chaining which we don't really do, as we have a 1:1 mapping of interfaces and CNI delegation operations.
    That might be the root cause

@hymgg
Copy link
Author

hymgg commented Jul 15, 2019

I followed kubeadm doc to apply calico on k8s,
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/

Back then it used 2 files,
https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/rbac-kdd.yaml
https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml

In calico daemonset it specified k8s for datastore, not etcd.

        # Use Kubernetes API as the backing datastore.
        - name: DATASTORE_TYPE
          value: "kubernetes"

When running multus with calico, I used the same option, "datastore_type": "kubernetes",

cat /etc/cni/net.d/05-multus.conf

{
"name": "multus-cni-network",
"type": "multus",
"delegates": [
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "mtx-huawei2-bld08",
"mtu": 1440,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
}
]
}
],
"kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig"
}

So DANM's way of delegating with calico is more restricted?

Thanks. -Jessica

@Levovar
Copy link
Collaborator

Levovar commented Jul 15, 2019

well, when handling static delegates you can say it like that. we don't support chaining together plugins, because chaining is usually simply not needed.
so, questions arises: what is "portmap" CNI even used for? :)
Until now we never had a customer who needed the "standard" plugin chaining CNI feature to get something done- simply because we can configure all the features required by a user through our user friendly management API. So, because we have something better, we don't do the less flexible approach of customizing interface provisioning.
If you tell me what portmpapping CNI is required for, I might give you an alternative which you only need to configure into the dynamic network management API, and not into static files.
Alternatively we can also support chaining if required.

When it comes to dynamic delegates everything is configured through the same dynamic, centralized REST API. Therefore I would say these delegates are actually way less restrictive than sticking to the component specific static CNI files.

So, trying to come up with some takeaways, and next steps:

  • do you really need chaining, or this is just the default provisioning and "portmapping" is not really required?
  • if it is required, maybe we already have a dynamically configurable feature substituting it in a friendlier way
  • if not, maybe we can develop one :)
  • or support chaining, if absolutely required for your use-case!
  • but please try it out first with a CNI config which is not a chained one (i.e. without "plugins", only containing the Calico CNI config), because it is still just a hunch that the chaining is the root cause of your issue

@hymgg
Copy link
Author

hymgg commented Jul 15, 2019

The portmap cni was there by default in calico-config, not sure why, k8s doc just says it's required to support hostPort. our apps don't use that. Gonna remove and see.

Thanks. -Jessica

@hymgg
Copy link
Author

hymgg commented Jul 15, 2019

Thank you sir, worked w/o cni chaining,

cat calico-mgmt.conf

{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "mtx-huawei2-bld08",
"mtu": 1440,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
}

Gonna move on to add sriov network.

Thanks. -Jessica

ps. will soon be away for 2 weeks

@Levovar
Copy link
Collaborator

Levovar commented Jul 15, 2019

Cool!
We are not running anywhere, no worries :) Feel free to open follow up issues if you encounter anything out of ordinary during your SRIOV trial!

@Levovar Levovar closed this as completed Jul 15, 2019
@hymgg
Copy link
Author

hymgg commented Jul 15, 2019

Please let me know if should put this in a new issue.

continue to follow example/device_plugin_demo

$ cat sriov_net.yaml
apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
name: calico-mgmt
namespace: example-sriov
spec:
NetworkID: calico-mgmt
NetworkType: calico

apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
name: sriov-a
namespace: example-sriov
spec:
NetworkID: sriov-a
NetworkType: sriov
Options:
device_pool: "intel.com/sriov_net_A"
container_prefix: data_net
rt_tables: 250
vlan: 300
cidr: 10.100.20.0/24
allocation_pool:
start: 10.100.20.10
end: 10.100.20.100

$ cat sriov_pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: sriov-pod
namespace: example-sriov
labels:
env: test
annotations:
danm.k8s.io/interfaces: |
[
{"network":"calico-mgmt", "ip":"dynamic"},
{"network":"sriov-a", "ip":"none"}
]
spec:
containers:

  • name: sriov-pod
    image: busybox:latest
    args:
    • sleep
    • "1000"
      resources:
      requests:
      intel.com/sriov_net_A: '1'
      limits:
      intel.com/sriov_net_A: '1'
      nodeSelector:
      sriov: enabled

Events:
Type Reason Age From Message


Normal Scheduled 4s default-scheduler Successfully assigned example-sriov/sriov-pod to mtx-huawei2-bld03
Warning FailedCreatePodSandBox 1s kubelet, mtx-huawei2-bld03 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b91e13fddfdcfeb7a421efbb1b592f24fe2ec5ebdf2862a25ddcff6a78c139af" network for pod "sriov-pod": NetworkPlugin cni failed to set up pod "sriov-pod_example-sriov" network: CNI network could not be set up: CNI operation for network:sriov-a failed with:CNI delegation failed due to error:Error delegating ADD to CNI plugin:sriov because:OS exec call failed:failed to set up IPAM plugin type "fakeipam" from the device "eno31": No IP was passed to fake IPAM
Normal SandboxChanged 1s kubelet, mtx-huawei2-bld03 Pod sandbox changed, it will be killed and re-created.

$ kubectl get node mtx-huawei2-bld03 -o json | jq '.status.allocatable'
{
"cpu": "64",
"ephemeral-storage": "48294789041",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"intel.com/sriov_net_A": "16",
"intel.com/sriov_net_B": "0",
"memory": "196389160Ki",
"pods": "110"
}

Tried dynamic, instead of none, {"network":"sriov-a", "ip":"none"}
The error was "IPv4 address cannot be dynamically allocated for an L2 network!"

Should it be static? how could the example have worked?

Thanks. -Jessica

@Levovar
Copy link
Collaborator

Levovar commented Jul 16, 2019

not picky when it comes to number of issues, no KPIs for it :) so we can continue it in this thread if you want!

so, two issues.
the first one is a regression we have introduced recently: "none" type IP allocation does not currently work with SR-IOV. See related issue: #107
It is scheduled to be corrected in DANM 4.1

The second is a config issue in the manifest, but it is actually the desired result: CIDR is not defined in the network manifest, meaning that the network represents a L2 network. So, if you want L3 VFs (with IP), add the "cidr" attribute to the manifest to define the subnet from which IPs can be allocated to a Pod

@Levovar Levovar reopened this Jul 16, 2019
@hymgg
Copy link
Author

hymgg commented Jul 16, 2019

Is this line not enough in above dn?
cidr: 10.100.20.0/24

Could you find a complete example of sriov with dynamic? Even better if it also has routing across nodes...

Thanks. -Jessica

@Levovar
Copy link
Collaborator

Levovar commented Jul 16, 2019

ah my bad, did not notice yours already has a CIDR! yes it should be enough.
are you running 3.3, or 4.0? In 3.3 the networks were only validated after their creation, so it can happen that failed. In 4.0 we validate them already at the time of their creation with the "webhook" component.
However, if you run 4.0 "webhook" is a mandatory component. If you run 4.0, but without the webhook, that would explain this behaviour.

if you are running 3.3: can you send me the exact output of "kubectl describe sriov-a -n example-sriov", and the the output of kubectl logs of any netwatcher Pod?
then I can tell you more

regarding routing: well, with SR-IOV you are basically building a good, old-fashioned L2 domain. so assuming you have configured the VLAN tag in the DanmNet for all of your PFs of all of your computes in your switch, connectivity between nodes is achieved by the simple in-subnet switching
if you want to connect to other IPs belonging to other subnets, you can provision IP routes via the "routes" parameter in the DanmNet, or policy-based IP routes via the "proutes" parameter in the connection annotation

@Levovar
Copy link
Collaborator

Levovar commented Jul 16, 2019

Meanwhile: if you do use 4.0 I corrected the "none" type issue in #110
If you change the CNI binary on your cluster to the new one you could give it a go

@Levovar
Copy link
Collaborator

Levovar commented Jul 22, 2019

Let's leave consider this thread closed from the perspective of the original issue, but if you still have any questions related to SR-IOV feel free to open a new one!

@Levovar Levovar closed this as completed Jul 22, 2019
@tcnieh
Copy link

tcnieh commented Oct 8, 2019

Quite agree with Levo, seems the problem come from the configuration file for calico, in my opinion, at least "etcd_endpoints" "etcd_key_file" "etcd_cert_file" and "etcd_ca_cert_file" are needed.

@clivez Hello there, I am now utilizing Danm to create the calico networks, but I am facing the same error "CNI operation for network:calico-1 failed with:CNI delegation failed due to error:Error delegating ADD to CNI plugin:calico because:OS exec call failed:no etcd endpoints specified".
You mentioned above that "etcd_endpoints" "etcd_key_file" "etcd_cert_file" and "etcd_ca_cert_file" are minimum needed, then which config file should I setup these arguments, /etc/cni/net.d/calico-1.conf or /etc/cni/net.d/calico-kubeconfig?

In the meanwhile, I try to setup etcd_endpoints IP, referenced from etcd_pod_kube_system, in both /etc/cni/net.d/calico-1.conf and /etc/cni/net.d/calico-kubeconfig, it seems not working.

Sorry, If I should not reply an closed issue, I'll open another new one or ask on slack.

@Levovar
Copy link
Collaborator

Levovar commented Oct 8, 2019

I think the problem here was similar to what you have experienced with your Flannel config, i.e. the Calico config in this case was also in "chained" format
have you verified it yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support How? And why?
Projects
None yet
Development

No branches or pull requests

4 participants