Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Sysdig agent support and testing automations #315

Merged
merged 25 commits into from
Nov 26, 2024

Conversation

manuelbcd
Copy link
Contributor

Sysdig Agent for EKS anywhere / EKS hybrid / EKS baremetal
Automation tasks for QA

Copy link
Contributor

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manuelbcd Looks like the namespace is incorrect. Our testing is failing. Please make sure to test it fully and PR updates. Also share your license keys via email or slack.

eks-anywhere-common/Addons/Partner/Sysdig/sysdig.yaml Outdated Show resolved Hide resolved
@manuelbcd
Copy link
Contributor Author

Namespace fixed. Thanks for the heads up @elamaran11

Copy link
Contributor

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manuelbcd Have more comments. Also please test it locally. One of the pods is failing

ubuntu@ip-10-226-86-188:~$ kubectl logs -n sysdig-agent sysdig-agent-sysdig-node-analyzer-wsg4c
Defaulted container "sysdig-runtime-scanner" out of: sysdig-runtime-scanner, sysdig-host-scanner
{"level":"info","version":"v1.8.1","time":"2024-11-15T14:05:15Z","message":"Starting Runtime Scanner"}
{"level":"info","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","nodeInfo":{"RuntimeName":"containerd","RuntimeVersion":"1.7.12","Architecture":"amd64","KernelVersion":"5.15.0-122-generic","KubeletVersion":"v1.30.0-eks-036c24b","KubeProxyVersion":"v1.30.0-eks-036c24b","OSImage":"Ubuntu 22.04.4 LTS","OS":"linux","ServerGitVersion":"v1.30.5-eks-ce1d5eb","ServerGoVersion":"go1.22.6"},"platform":"linux/amd64","time":"2024-11-15T14:05:16Z","message":"node info detected"}
{"level":"info","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","ContainerRuntime":"containerd","time":"2024-11-15T14:05:16Z","message":"container runtime client built successfully"}
{"level":"info","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","containerRuntimeName":"containerd","time":"2024-11-15T14:05:16Z","message":"using default vulnerability db version: V2"}
{"level":"info","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","containerRuntimeName":"containerd","time":"2024-11-15T14:05:16Z","message":"starting probes server on :7002"}
{"level":"info","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","containerRuntimeName":"containerd","time":"2024-11-15T14:05:17Z","message":"Starting metrics server on :25001 exposing /metrics ..."}
{"level":"error","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","containerRuntimeName":"containerd","error":"failed to perform keepalive request: agents conf API returned invalid http status: 401","time":"2024-11-15T14:05:17Z","message":"failed to send keepalive"}
{"level":"info","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","containerRuntimeName":"containerd","seconds":106,"time":"2024-11-15T14:05:17Z","message":"startup sleep"}
{"level":"info","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","containerRuntimeName":"containerd","seconds":106,"time":"2024-11-15T14:07:03Z","message":"sleep finished"}
{"level":"info","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","containerRuntimeName":"containerd","time":"2024-11-15T14:07:03Z","message":"getting vulnerabilities DB"}
{"level":"info","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","containerRuntimeName":"containerd","time":"2024-11-15T14:07:03Z","message":"retrieving presigned url for DB"}
{"level":"error","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","containerRuntimeName":"containerd","error":"error refreshing db: failed refreshing vuln DB: failed to retrieve download url for the main db: vulns API returned invalid http status: 401","time":"2024-11-15T14:07:03Z","message":"error during startup"}
{"level":"info","version":"v1.8.1","scannerId":"myclusterName:f0b94697-9379-41ef-a215-4db9fabc57d0:node02","containerRuntimeName":"containerd","scheduler":"keepalive","time":"2024-11-15T14:07:03Z","message":"root context done. CtxErr: context canceled"}

Also your test job is failing too. See the logs below. Plus your job should be a CronJob which runs on schedule:

ubuntu@ip-10-226-86-188:~$ kubectl logs sysdig-agent-test-6t9vt -n sysdig-agent

 # Validation process started #
timed out waiting for the condition on pods/sysdig-agent-sysdig-bmvv2
timed out waiting for the condition on pods/sysdig-agent-sysdig-d2t4v
timed out waiting for the condition on pods/sysdig-agent-sysdig-ng7dx
timed out waiting for the condition on pods/sysdig-agent-sysdig-zzbqr
error: expected 'logs [-f] [-p] (POD | TYPE/NAME) [-c CONTAINER]'.
POD or TYPE/NAME is a required argument for the logs command
See 'kubectl logs -h' for help and examples

 # Error: Sysdig Agent couldn't connect with the server. Please check egress, region and token #aws-cloudsoft-sleek-collaboration 

@elamaran11
Copy link
Contributor

@manuelbcd Readiness probe is failing for all sysdig-agent pods and also the node-analyzer pod is crashing as well. Please see below. I would recommend you to test local with local instructions in our readMe and revert back.

ubuntu@ip-10-226-86-188:~$ kubectl get all -n sysdig-agent
NAME                                          READY   STATUS             RESTARTS        AGE
pod/sysdig-agent-sysdig-5np5r                 0/1     Init:0/1           0               2m44s
pod/sysdig-agent-sysdig-jmzcm                 0/1     Running            0               2m48s
pod/sysdig-agent-sysdig-lw2r9                 0/1     Running            0               2m50s
pod/sysdig-agent-sysdig-node-analyzer-76drd   2/2     Running            4 (3m50s ago)   43m
pod/sysdig-agent-sysdig-node-analyzer-t9bjc   1/2     Running            6 (40s ago)     43m
pod/sysdig-agent-sysdig-node-analyzer-wsg4c   1/2     CrashLoopBackOff   7 (59s ago)     43m
pod/sysdig-agent-sysdig-node-analyzer-wxbm2   2/2     Running            5 (11m ago)     43m
pod/sysdig-agent-sysdig-vmqm4                 0/1     Pending            0               2m53s
pod/sysdig-agent-test-k9b88                   0/1     Error              0               8m32s
pod/sysdig-agent-test-lvjl4                   1/1     Running            0               18s

NAME                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/sysdig-agent-sysdig   ClusterIP   172.20.14.156   <none>        7765/TCP   43m

NAME                                               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/sysdig-agent-sysdig                 4         4         0       4            0           <none>          43m
daemonset.apps/sysdig-agent-sysdig-node-analyzer   4         4         2       4            2           <none>          43m

NAME                          STATUS    COMPLETIONS   DURATION   AGE
job.batch/sysdig-agent-test   Running   0/1           8m32s      8m32s
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  2m44s             default-scheduler  Successfully assigned sysdig-agent/sysdig-agent-sysdig-lw2r9 to node03
  Normal   Pulled     2m43s             kubelet            Container image "quay.io/sysdig/agent-kmodule:13.5.0" already present on machine
  Normal   Created    2m43s             kubelet            Created container sysdig-agent-kmodule
  Normal   Started    2m43s             kubelet            Started container sysdig-agent-kmodule
  Normal   Pulled     2m28s             kubelet            Container image "quay.io/sysdig/agent-slim:13.5.0" already present on machine
  Normal   Created    2m28s             kubelet            Created container sysdig
  Normal   Started    2m28s             kubelet            Started container sysdig
  Warning  Unhealthy  3s (x7 over 53s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

@manuelbcd
Copy link
Contributor Author

Hi. Test results were fine after some tuning. Thanks for your patience.
FYI @elamaran11

 # Validation process started #
pod/sysdig-sysdig-agent-cvhk2 condition met
pod/sysdig-sysdig-agent-jw9fg condition met
Defaulted container "sysdig" out of: sysdig, sysdig-agent-kmodule (init)
2024-11-21 13:20:29.556, 463858.463886, Information, k8s_parser:255: cointerface[463864]: Communication with server successful: v1.30.6-eks-7f9249a
2024-11-21 13:20:42.686, 463987.464015, Information, k8s_parser:255: cointerface[463993]: Communication with server successful: v1.30.6-eks-7f9249a

 # Sysdig Agent connection with server was success #
Defaulted container "sysdig" out of: sysdig, sysdig-agent-kmodule (init)
2024-11-21 13:23:59.031, 463987.464073, Information, endpoint:cm_collector_endpoint:429: Sent msgtype=31 (SECURE_NETSEC_SUMMARY) len=1985 to collector at host=ingest-us2.app.sysdig.com port=6443

 # Sysdig Agent successfully captured the event #

@elamaran11
Copy link
Contributor

@manuelbcd The test ran good now. But you have not addressed my feedback on changing Job to CronJob schedule which runs once each day

```ubuntu@ip-10-226-86-188:~/eks-anywhere-conformance-testing$ kubectl get all -n sysdig
NAME                                    READY   STATUS      RESTARTS   AGE
pod/sysdig-agent-test-vbvq7             0/1     Completed   0          7m11s
pod/sysdig-sysdig-agent-dhq22           1/1     Running     0          28m
pod/sysdig-sysdig-agent-fdwqd           1/1     Running     0          28m
pod/sysdig-sysdig-agent-rzw9v           1/1     Running     0          28m
pod/sysdig-sysdig-agent-sxwrh           1/1     Running     0          28m
pod/sysdig-sysdig-node-analyzer-bdqw4   2/2     Running     0          28m
pod/sysdig-sysdig-node-analyzer-gwjw6   2/2     Running     0          28m
pod/sysdig-sysdig-node-analyzer-mllll   2/2     Running     0          28m
pod/sysdig-sysdig-node-analyzer-nnqns   2/2     Running     0          28m

NAME                          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/sysdig-sysdig-agent   ClusterIP   172.20.161.155   <none>        7765/TCP   28m

NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/sysdig-sysdig-agent           4         4         4       4            4           <none>          28m
daemonset.apps/sysdig-sysdig-node-analyzer   4         4         4       4            4           <none>          28m

NAME                          STATUS     COMPLETIONS   DURATION   AGE
job.batch/sysdig-agent-test   Complete   1/1           6m45s      7m11s

@elamaran11
Copy link
Contributor

Once you make that change i will merge.

@elamaran11
Copy link
Contributor

@manuelbcd The CronJob update is incorrect. Please fix it, run it once and let me know

flux-system   testers                 85m    False   CronJob/sysdig/sysdig-agent-test dry-run failed (Invalid): CronJob.batch "sysdig-agent-test" is invalid: [spec.jobTemplate.spec.template.spec.containers: Required value, spec.jobTemplate.spec.template.spec.restartPolicy: Required value: valid values: "OnFailure", "Never"]...

@manuelbcd
Copy link
Contributor Author

Updated. Please check again, @elamaran11

@elamaran11
Copy link
Contributor

elamaran11 commented Nov 21, 2024

@manuelbcd Appreciate on working tirelessly. The job fails now


 # Validation process started #
pod/sysdig-sysdig-agent-dhq22 condition met
pod/sysdig-sysdig-agent-fdwqd condition met
pod/sysdig-sysdig-agent-rzw9v condition met
pod/sysdig-sysdig-agent-sxwrh condition met
Defaulted container "sysdig" out of: sysdig, sysdig-agent-kmodule (init)

 # Error: Sysdig Agent couldn't connect with the server. Please check egress, region and token 
ubuntu@ip-10-226-86-188:~$ kubectl get all -n sysdig
NAME                                    READY   STATUS    RESTARTS   AGE
pod/sysdig-agent-test-2-59kks           1/1     Running   0          58s
pod/sysdig-agent-test-2-m59rn           0/1     Error     0          2m58s
pod/sysdig-agent-test-2-rxtnd           0/1     Error     0          4m52s
pod/sysdig-sysdig-agent-dhq22           1/1     Running   0          126m
pod/sysdig-sysdig-agent-fdwqd           1/1     Running   0          126m
pod/sysdig-sysdig-agent-rzw9v           1/1     Running   0          126m
pod/sysdig-sysdig-agent-sxwrh           1/1     Running   0          126m
pod/sysdig-sysdig-node-analyzer-bdqw4   2/2     Running   0          126m
pod/sysdig-sysdig-node-analyzer-gwjw6   2/2     Running   0          126m
pod/sysdig-sysdig-node-analyzer-mllll   2/2     Running   0          126m
pod/sysdig-sysdig-node-analyzer-nnqns   2/2     Running   0          126m

NAME                          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/sysdig-sysdig-agent   ClusterIP   172.20.161.155   <none>        7765/TCP   126m

NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/sysdig-sysdig-agent           4         4         4       4            4           <none>          126m
daemonset.apps/sysdig-sysdig-node-analyzer   4         4         4       4            4           <none>          126m

NAME                              SCHEDULE    TIMEZONE   SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/sysdig-agent-test   0 1 * * *   <none>     False     0        <none>          5m27s

NAME                            STATUS    COMPLETIONS   DURATION   AGE
job.batch/sysdig-agent-test-2   Running   0/1           4m53s      4m53s

Copy link
Contributor

@mikemcd3912 mikemcd3912 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pods reach a ready state, testing pod is a 24hr cronjob with the requested focus on functionality and the testers complete successfully in our environments - LGTM

@manuelbcd
Copy link
Contributor Author

This should be good to merge, @elamaran11 FYI

Copy link
Contributor

@mikemcd3912 mikemcd3912 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pods reach a ready state, testing pod is a 24hr cronjob with the requested focus on functionality and the testers complete successfully in our environments after the requested update to work with long running pods. Deployment comments have been remediated - LGTM

@mikemcd3912 mikemcd3912 requested review from elamaran11 and removed request for elamaran11 November 26, 2024 20:12
@mikemcd3912 mikemcd3912 dismissed elamaran11’s stale review November 26, 2024 20:14

Requested Edits have been made, reviewed and tested

@mikemcd3912 mikemcd3912 merged commit db0091a into aws-samples:main Nov 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants