Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test and verify that Elastic-Agent with k8s and system integrations run on Openshift #2065

Closed
ChrsMark opened this issue Oct 29, 2021 · 15 comments
Assignees
Labels
release-pending Team:Integrations Label for the Integrations team

Comments

@ChrsMark
Copy link
Member

ChrsMark commented Oct 29, 2021

We need to verify that both standalone and managed Agent can run properly on Openshift with the proposed manifests. system and kubernetes packages should be running without a problem.

Please take as an example the work done for Metricbeat/Filebeat: elastic/beats#17516
We can verify this on minishift but running on actual Openshift deployment is recommended.

@ChrsMark ChrsMark added release-pending Team:Integrations Label for the Integrations team labels Oct 29, 2021
@elasticmachine
Copy link

Pinging @elastic/integrations (Team:Integrations)

@ChrsMark ChrsMark changed the title Test and verify that Elastic-Agent with k8s and system integrations run on OCP Test and verify that Elastic-Agent with k8s and system integrations run on Openshift Oct 29, 2021
@tetianakravchenko tetianakravchenko self-assigned this Nov 4, 2021
@tetianakravchenko
Copy link
Contributor

tetianakravchenko commented Dec 28, 2021

Testing environments:

Minishift - supports only openshift version 3

minishift version:

minishift version
minishift v1.34.3+4b58f89

minishift started with virtualbox driver, issue - minishift/minishift#3494:

minishift start --vm-driver virtualbox

(works on virtualbox version 6.1.26)
openshift version:

minishift openshift version
openshift v3.11.0+32a500f-598

openshift client (installed from https://mirror.openshift.com/pub/openshift-v3/clients/):

oc version
oc v3.11.587
kubernetes v1.11.0+d4cacc0
features: Basic-Auth

Standalone elastic-agent

How-to: https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-standalone.html

elastic-agent errors:

6 leaderelection.go:329] error initially creating leader election record: the server could not find the requested resource (post leases.coordination.k8s.io)

reason:
to start an agent we are using Role:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: elastic-agent
  # should be the namespace where elastic-agent is running
  namespace: kube-system
  labels:
    k8s-app: elastic-agent
rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs: ["get", "create", "update"]
---

(this change was introduced in elastic/beats#24958).

there is no coordination API in the cluster:

oc api-resources | grep coordination

just could find that Lease API was promoted to v1 in version 1.14 - https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.14.md:

The Lease API type in the coordination.k8s.io API group is promoted to v1

Openshift is actually using kubernetes v1.11

Question here: do we want to support this k8s version? might be related to the elastic/beats#29604

@tetianakravchenko
Copy link
Contributor

tetianakravchenko commented Dec 30, 2021

CRC - openshift version 4

crc version:

crc version
CodeReady Containers version: 1.37.0+3876d27d
OpenShift version: 4.9.10 (bundle installed at /Applications/CodeReady Containers.app/Contents/Resources/crc_hyperkit_4.9.10.crcbundle)

openshift & k8s version:

crc start
eval $(crc oc-env)
% oc version
Client Version: 4.9.10
Server Version: 4.9.10
Kubernetes Version: v1.22.3+ffbb954

Standalone elastic-agent

How-to: https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-standalone.html

oc login -u kubeadmin
oc apply -f elastic-agent-standalone-kubernetes.yaml

kubernetes-node-metrics input:

✅ works:

  • kubernetes.container
  • kubernetes.node
  • kubernetes.pod
  • kubernetes.system
  • kubernetes.volume

❌ doesn't work from the default config:
kubernetes.proxy

error getting processed metrics: error making http request: Get "http://localhost:10249/metrics": dial tcp 127.0.0.1:10249: connect: connection refused

Resolved - ✅ elastic/beats#17863:

hosts:
  # Kubernetes
  # - 'localhost:10249'
  # Openshift
  - 'localhost:29101'
  • kubernetes.controllermanager
    Reason for that: default condition doesn't work
    Resolved - ✅ change condition:
# condition: ${kubernetes.labels.component} == 'kube-controller-manager'
condition: ${kubernetes.labels.app} == 'kube-controller-manager'

must be commented out!

  • kubernetes.scheduler
    Reason for that: default condition doesn't work
    Resolved - ✅ change condition:
# condition: ${kubernetes.labels.component} == 'kube-scheduler'
condition: ${kubernetes.labels.app} == 'openshift-kube-scheduler'

Resource usage for kubernetes-node-metrics input (kubernetes.proxy are commented out):

oc adm top pod
NAME                                  CPU(cores)   MEMORY(bytes)
elastic-agent-8vdns                   740m         475Mi

NOTE: cpu usage is much higher

system-metrics input:

all datastreams works (system.core, system.cpu, system.diskio, system.filesystem, system.fsstat, system.load, system.memory, system.network, system.process, system.process_summary, system.socket_summary)

kubernetes-cluster-metrics input:

✅ works:

  • kubernetes.apiserver
  • kubernetes.event
  • kubernetes.state_container
  • kubernetes.state_cronjob
  • kubernetes.state_daemonset
  • kubernetes.state_deployment
  • kubernetes.state_job
  • kubernetes.state_node
  • kubernetes.state_persistentvolume
  • kubernetes.state_persistentvolumeclaim
  • kubernetes.state_pod
  • kubernetes.state_replicaset
  • kubernetes.state_resourcequota
  • kubernetes.state_service
  • kubernetes.state_statefulset
  • kubernetes.state_storageclass

To all kubernetes.state_* datasets were added:

bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  ssl.certificate_authorities:
    - /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt

to access https://kube-state-metrics.openshift-monitoring.svc:8443

✅ kubernetes.container_logs

❌ kubernetes.audit_logs -

Resolved - ✅ customized log path - /var/log/kube-apiserver/audit.log

system-logs - no sense to enable it, as logs path doesn't exist inside elastic-agent pod: /var/log/auth.log, /var/log/secure*, /var/log/messages*, /var/log/syslog*

❗ For some reason when all enabled, metrics are not received:

2022-01-14T14:48:32.631Z	ERROR	log/reporter.go:36	2022-01-14T14:48:32Z - message: Application: metricbeat--7.16.2[6d04e5ac-1076-4992-afe9-01393daed484]: State changed to FAILED: Missed two check-ins - type: 'ERROR' - sub_type: 'FAILED'

metricbeat errors:

2022-01-27T12:06:06.189Z	ERROR	module/wrapper.go:259	Error fetching data for metricset beat.stats: error making http request: Get "http://unix/stats": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2022-01-27T12:06:06.208Z	ERROR	module/wrapper.go:259	Error fetching data for metricset http.json: error making http request: Get "http://unix/stats": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-01-27T12:06:06.461Z	ERROR	module/wrapper.go:259	Error fetching data for metricset beat.state: error making http request: Get "http://unix/state": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Managed elastic-agent

How-to: https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-managed-by-fleet.html#running-on-kubernetes-managed-by-fleet

  1. With all inputs enabled time to time I am getting using correct conditions for scheduler and controller-manager:
Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

Errors:

ERROR	status/reporter.go:236	Elastic Agent status changed to: 'error'
2022-01-14T15:04:33.405Z	ERROR	log/reporter.go:36	2022-01-14T15:04:33Z - message: Application: filebeat--7.16.2[dc7c110f-3828-4aec-bbfa-777dffad1527]: State changed to FAILED: Missed two check-ins - type: 'ERROR' - sub_type: 'FAILED'
2022-01-14T15:04:33.456Z	ERROR	log/reporter.go:36	2022-01-14T15:04:33Z - message: Application: metricbeat--7.16.2[dc7c110f-3828-4aec-bbfa-777dffad1527]: State changed to FAILED: Missed two check-ins - type: 'ERROR' - sub_type: 'FAILED'
oc adm top pod
NAME                                  CPU(cores)   MEMORY(bytes)
elastic-agent-764rs                   1499m        448Mi

and seems events processing just stuck. dashboards are empty
2. Disabled container logs:

 oc adm top pod
NAME                                  CPU(cores)   MEMORY(bytes)
elastic-agent-764rs                   533m         439Mi

metrics are not consistent:
Screenshot 2022-01-14 at 16 13 19

@ChrsMark
Copy link
Member Author

Regarding kube-proxy there is some feedback at elastic/beats#17863, which proves that it is doable to make kube-proxy metricsets/datastream to work with the proper configuration on k8s side. If we verify this we can document the steps required on k8s side and make kube-proxy available at Openshift both for Metricbeat and Agent.

@tetianakravchenko
Copy link
Contributor

tetianakravchenko commented Jan 25, 2022

I've already checked kube-proxy with the provided config and it works, will add it to the list.

@ChrsMark
Copy link
Member Author

Thank you @tetianakravchenko ! Do you also plan to update the documentation accordingly? We have this section -> running-on-kubernetes.html#_red_hat_openshift_configuration about Openshift specifics so this would be good fit there I think.

@tetianakravchenko
Copy link
Contributor

tetianakravchenko commented Jan 26, 2022

Openshift on GCP:
errors:

  1. Full certificate name should be used:
error making http request: Get "https://name-openshift-v48hb-worker-b-km2m8:10250/stats/summary": x509: certificate is valid for name-openshift-v48hb-worker-b-km2m8.c.project_name.internal, not name-openshift-v48hb-worker-b-km2m8

-> used 'https://${env.NODE_NAME}.c.project_name.internal:10250', but it is specific to the GCP I think
2. When used "https://kube-state-metrics.openshift-monitoring.svc:8443" as a host for kubernetes.state_* datasets, we might miss some prometheus metrics we rely on due to metric deny list - https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/kube-state-metrics/deployment.yaml#L39-L51
Options here: either path kube-state-metrics deployment of cluster-monitoring-operator or install kube-state-metrics aside of elastic-agent (note: resources name shouldn't overrlap/override with the kube-state-metrics of cluster-monitoring-operator)

  1. metrics kubernetes.state_* are collected with gaps:

Screenshot 2022-01-26 at 19 17 52

the only errors I see in metricbeat logs:

2022-01-26T17:55:24.433Z	ERROR	module/wrapper.go:259	Error fetching data for metricset beat.stats: error making http request: Get "http://unix/stats": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2022-01-26T17:55:24.433Z	ERROR	module/wrapper.go:259	Error fetching data for metricset beat.state: error making http request: Get "http://unix/state": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2022-01-26T17:54:06.328Z	ERROR	module/wrapper.go:259	Error fetching data for metricset http.json: error making http request: Get "http://unix/stats": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2022-01-26T17:55:24.567Z	ERROR	module/wrapper.go:259	Error fetching data for metricset beat.stats: error making http request: Get "http://unix/stats": dial unix /usr/share/elastic-agent/state/data/tmp/default/metricbeat/metricbeat.sock: connect: connection refused
2022-01-26T17:55:24.574Z	ERROR	module/wrapper.go:259	Error fetching data for metricset http.json: error making http request: Get "http://unix/stats": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-01-26T17:55:24.575Z	ERROR	module/wrapper.go:259	Error fetching data for metricset beat.state: error making http request: Get "http://unix/state": dial unix /usr/share/elastic-agent/state/data/tmp/default/metricbeat/metricbeat.sock: connect: connection refused
2022-01-26T17:55:24.575Z	ERROR	module/wrapper.go:259	Error fetching data for metricset http.json: error making http request: Get "http://unix/stats": dial unix /usr/share/elastic-agent/state/data/tmp/default/metricbeat/metricbeat.sock: connect: connection refused

created an issue for that - elastic/beats#30033

After rising memory limit don't see this error.
Resources usage

oc adm top pods | grep elastic
elastic-agent-b5hxm                                  38m          540Mi
elastic-agent-bjnrz                                  85m          575Mi
elastic-agent-cnlg4                                  133m         600Mi
elastic-agent-dg9zq                                  277m         684Mi <- this one is a leader here (scrape kube-state-metrics)
elastic-agent-j57tg                                  28m          530Mi
elastic-agent-kls6z                                  33m          597Mi

Started documentation PR - https://github.com/elastic/beats/compare/master...tetianakravchenko:openshift-documentation?expand=1

@mtojek
Copy link
Contributor

mtojek commented Jan 27, 2022

Hey folks! How about adding support for Openshift in elastic-package, similarly to kind? Will it help or not really?

@ChrsMark
Copy link
Member Author

Hey folks! How about adding support for Openshift in elastic-package, similarly to kind? Will it help or not really?

That would be great! In the past we had thought about supporting Openshift in our CI but in Beats CI it was a bit more complicated. Now with elastic-package it should be more straight-forward. I'm wondering if it would make sense to use a local "cluster" or a cloud one on GCP for a more official approach. The second one would be more expensive if the cluster is always up but we can verify this with infra team. I think that ECK team does the same to test the operator.

@mtojek
Copy link
Contributor

mtojek commented Jan 27, 2022

I created the relevant issue and we can take it from there instead of polluting this thread.

@tetianakravchenko
Copy link
Contributor

tetianakravchenko commented Jan 27, 2022

@ChrsMark

Do you also plan to update the documentation accordingly? We have this section -> running-on-kubernetes.html#_red_hat_openshift_configuration about Openshift specifics so this would be good fit there I think.

elastic/beats#30054
it seems elastic-agent documentation is in different repo, would it be the proper place - https://github.com/elastic/observability-docs/blob/7.16/docs/en/ingest-management/elastic-agent/running-on-kubernetes-standalone.asciidoc ? any specific processes here?

@ChrsMark
Copy link
Member Author

@ChrsMark

Do you also plan to update the documentation accordingly? We have this section -> running-on-kubernetes.html#_red_hat_openshift_configuration about Openshift specifics so this would be good fit there I think.

elastic/beats#30054 it seems elastic-agent documentation is in different repo, would it be the proper place - https://github.com/elastic/observability-docs/blob/7.16/docs/en/ingest-management/elastic-agent/running-on-kubernetes-standalone.asciidoc ? any specific processes here?

That's the correct place for the docs yes. Nothing special there, we mostly update this specific docs so you can just go ahead and open a PR there and ask for a review from our team.

@tetianakravchenko
Copy link
Contributor

tetianakravchenko commented Feb 4, 2022

For future references I will keep installation processes of crc and openshift on gcp in this issue:

CRC:

  1. create a redhat account
  2. CRC intalator can be downloaded from https://mirror.openshift.com/pub/openshift-v4/clients/crc/ or from the redhat account.
  3. crc setup
  4. crc start
    adjust cpu/memory configuration for crc

To enable openshift cluster monitoring:

crc config set enable-cluster-monitoring true

this install monitoring stack in openshift monitoring namespace, use "https://kube-state-metrics.openshift-monitoring.svc:8443" to access kube-state-metrics.
Note: you might miss some metrics - https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/kube-state-metrics/deployment.yaml#L39-L51 if used kube-state-metrics provided by openshift/cluster-monitoring-operator

Openshift on GCP:

use https://github.com/openshift/installer
NOTE: use official openshift-install binary, not the one built from master - in my case worker nodes were not created due to the issue when bootstraping master nodes - some daemonset was crashing.
official openshift-install can be found in you account https://console.redhat.com/openshift/create, here can be found pull secrets as well.

  1. Create a google service account with all the required permissions - https://github.com/openshift/installer/blob/master/docs/user/gcp/iam.md, export this key:
export GCLOUD_KEYFILE_JSON=$(pwd)/elastic-obs-integrations-dev-*.json
  1. Create a folder with the install-config.yml:
    $ mkdir GCP
    $ cat GCP/install-config.yml
apiVersion: v1
baseDomain: cf-obs.elastic.dev
compute:
- hyperthreading: Enabled
  name: worker
  platform:
    gcp:
      type: n2-standard-4
  replicas: 4
controlPlane:
  hyperthreading: Enabled
  name: master
  platform:
    gcp:
      type: n2-standard-4
  replicas: 3
metadata:
  name: tetiana-openshift
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  gcp:
    projectID: elastic-obs-integrations-dev
    region: us-central1
sshKey: ...
pullSecret: ...

Note!: add pullSecrets in format '{"auths":{"cloud.openshift.com":{"auth":...
add sshKey in format: ssh-... ... ` - it is your public ssh key to access created node
full list of settings - here

  1. create cluster:
    ./openshift-install create cluster --dir=./GCP --log-level=debug

It creates 1 bootstrap node, 3 master and 3 worker. If you need for some reason ssh to the nodes - use
ssh core@'bootstrap-public-ip'
3. Check installation progress:

oc --kubeconfig=$(pwd)/GCP/auth/kubeconfig get clusterversion
  1. destroy:
./openshift-install destroy cluster --dir=./GCP --log-level=info

@ChrsMark
Copy link
Member Author

ChrsMark commented Feb 7, 2022

Thanks for working on this @tetianakravchenko! It will help a lot to enhance our coverage and supportability for what's around the corner.

@Myasnik2000
Copy link

Hey folks, can anyone help me and provide elastic-agents configmap for atlassian integration ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-pending Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

5 participants