-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Meta]Investigate resource consumption of Elastic Agent with K8s Integration #3801
Comments
Once we've resolved the issues (or earlier, if resolving them is not straightforward and we need to iterate): I think we should also figure out how to reliably reproduce the issues in an ephemeral cluster, ideally with some automation in place to create the cluster and whatever workload is necessary to trigger the issues (e.g. create a bunch of deployments/pods/whatever). Then we can:
|
Thanks @axw , I have updated a bit the section |
As a short-term, can we somehow document the known issues / limitations we're facing until now? |
Is there progress in the latest version or it's still destroying the k8s master? I've disabled elastic in our cluster a while ago, checking if there's any progress so far. I can't really tell if it should've improved if I upgrade. |
We have tracked down the source of the high memory usage on k8s and are working to fix it. #4729 is the tracking issue. |
And what about rate-limiting the k8s apiserver requests? Is any work going on that? |
Regarding rate limiting, the main issue is this which is not yet prioritised in the next iterations. But for sure it is in our backlog Somehow related, we have already merged 3625, in order to minimise any possible effect of leader election api calls. Additionally since 8.14.0, we have done a major refactoring in 37243, which we proved that it will help the overall resource consumption |
Test setupI have run a script to evaluate the performance of our K8s integration. I evaluated all 8.x.0 versions between 8.5.0 and 8.15.0. The test increases the number of pods in a one node cluster at this rhythm: 12, 61, 111, 161, 211, 311, 411, and 511. I annotated the following results after 5min for each cycle:
Once the EA restarts, I stop registering the tests for the upcoming increase of pods, since the performance is no longer stable. This is the script I am running for the tests.setup_cluster () {
kind delete cluster
kind create cluster
# This is so we can execute kubectl top
kubectl apply -f https://raw.githubusercontent.com/pythianarora/total-practice/master/sample-kubernetes-code/metrics-server.yaml
}
test_n_pods () {
# $1 - EA filename to used in kubectl apply
# $2 - filename for the results
# Prepare cluster with EA using kubernetes + system policy
setup_cluster
kubectl apply -f "$1"
echo "| Pods | CPU | Memory | EA pod restarts |" > "$2"
echo "|------|-----|--------|-----------------|" >> "$2"
for replicas in 1 50 100 150 200 300 400 500 ;
do
kubectl delete -f nginx-pod.yaml
sed -i -e "s/ replicas: .*/ replicas: $replicas/g" nginx-pod.yaml
kubectl apply -f nginx-pod.yaml
sleep 5m
top=$(kubectl top pods -n kube-system | grep elastic*)
pods=$(kubectl get pods --no-headers --all-namespaces | wc -l)
line=$(kubectl get pods -o wide --all-namespaces | awk '$2 ~ /^elastic/')
restarts=$(echo "$line" | awk '{print $5}')
print_results_to_file "$pods" "$top" "$restarts" "$2"
done
}
print_results_to_file () {
# Gets arguments:
# $1 = number of pods
# $2 = kubectl top result
# $3 = number of EA restarts
# $4 = results filename
# Parse result of kubectl top (example 'elastic-agent-985zk 16m 583Mi')
cpu=$(echo "$2" | awk '{print $2}')
memory=$(echo "$2" | awk '{print $3}')
echo "| $1 | $cpu | $memory | $3 |" >> "$4"
}
# Test the performance by running test_n_pods. Change the arguments to your own.
test_n_pods <DEPLOYMENT EA FILE GOES HERE> <RESULTS FILENAME GOES HERE> This is the NGINX pod deployment I use in the script.apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 500
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80 8.5Using the default configuration from the agent: resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi Results:
8.6Using the default configuration from the agent: resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi Results:
No longer works at 61 up. 8.7Using the default configuration from the agent: resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi Results:
No longer works at 61 up test. 8.8 - default agent configuration changesUsing the default configuration from the agent: resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi Results:
No longer works at 161 up test. 8.9Using the default configuration from the agent: resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi Results:
No longer works at 161 up test. 8.10Using the default configuration from the agent: resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi Results:
No longer works at 111 up test. 8.11Using the default configuration from the agent: resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi Results:
No longer works at 111 up test. 8.12Using the default configuration from the agent: resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi Results:
No longer works at 111 up test. 8.13Using the default configuration from the agent: resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi Results:
No longer works at 111 up test. 8.14Using the default configuration from the agent: resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi Results:
No longer works at 61 up test. 8.15Using the default configuration from the agent: resources:
limits:
memory: 700Mi
requests:
cpu: 100m
memory: 400Mi Results:
No longer works at 61 up test. NotesFrom 8.5 to 8.6 version, something changed that caused a huge memory increase in the Kubernetes integration, to the point that increasing the number of pods made the agent stop and restart over and over again. From 8.8 version, the number of pods that made the agent stop increase. This is a good sign, but notice that the default memory limits and requests also increase. This surely helps explain this seemingly better performance. From 8.9 to 8.10 version, the number of pods that caused the EA to stop and restart decreased again. Something happened again in the Kubernetes integration that affected the agent performance. From 8.13 to 8.14 version, the number of pods that caused the EA to stop and restart decreased again. Something happened again in the Kubernetes integration that affected the agent performance. Also, from @gizas: 8.13 vs 8.14 is 140Mi diff even with no of 12 pods. It seems Kubernetes memory usage has been getting higher since 8.5, with notable increases in 8.6, 8.10 and 8.14 (discarding the increase of memory resources by default in EA at 8.8 version that helped hide the possible issues in Kubernetes integration). |
@constanca-m Have you also tested if the data is actually send to Elastic? My setup had ~15 pods with 8.15 and the memory ran high and even though the pod itself didn't restart, the K8s data didn't come in (or was very spotty) (I think one of the processes itself was crashing) |
In my case, I can see data in Discover (I am filtering by I did not analyze the logs to know if everything is being sent there, or we are loosing data. This is the logs from running all the tests in 8.15, including the pod restarts. @EvelienSchellekens |
Really useful @constanca-m ! Adding some notes here:
A general comment is that all the above tests just measure the consumption of memory with same k8s load. The identification of memory leak needs watching the trend of memory during time. Just saying that just an increase is not actually bad or good if we observe more k8s resources. Additionally:
|
Thank you @gizas. I think this issue and the scripts to run these tests should be placed somewhere more accessible to the team. Maybe in the future repository you mentioned on Thursday's meeting to help with identifying issues.
It says the limits memory is 700Mi for 8.12.
It looks like it... This was just 1 tests, and values always variate a bit for each test. We could run a test with less increase in pods to capture more the differences between these latest versions. Edit: but since 8.15 has more or less the same values as 8.14, I believe that we do have a significant difference between 8.13 and 8.14 like you pointed out. Thanks, I will include it in the notes of the original comment as well!
Yes. You are correct, we don't include the tests for running just with the System, unfortunately. I agree, with would be good to also have an idea of that, but I don't believe the System here is causing any issues.
This is the hard part! With the agent starting and restarting over and over again... It is very hard, and downloading the diagnosis gets stuck in a loop, and the zip never gets ready. Not sure what is going on there, but I have not payed much attention to it.
Correct. Only the default pods, EA, metrics server and the NGINX pod. I believe the best would be to look at the changelog, see what big changes we had. I can remember the watchers issue, but since that PR has the memory tests there, I don't believe that could cause any influence on the degraded performance, but I could of course be wrong (and biased 😄 ). |
@constanca-m the https://github.com/elastic/k8s-integration-infra?tab=readme-ov-file#put-load-on-the-cluster script mentioning in the call. (public repo) |
I used a different one @gizas, it is local and more simplified (in the comment of the tests results). I think it should be enough for these tests, and that script for more complex tests. |
I also performed some scale tests. I create one node cluster in GKE with ~95 pods running. TBH the 700 mb memory limit suffices in both versions. Only in case Kube-state-metrics are enabled I got one restart which means that in big clusters (note that in Kubernetes 110 pods per node is the limit) the memory limit needs some adjustment. I don't know why @constanca-m got different results. |
…cts (#109) We only use metadata from Jobs and ReplicaSets, but require that full resources are supplied. This change relaxes this requirement, allowing PartialObjectMetadata resources to be used. This allows callers to use metadata informers and avoid having to receive and deserialize non-metadata updates from the API Server. See elastic/elastic-agent#5580 for an example of how this could be used. I'm planning to add the metadata informer from that PR to this library as well. Together, these will allow us to greatly reduce memory used for processing and storing ReplicaSets and Jobs in beats and elastic-agent. This is will help elastic/elastic-agent#5580 and elastic/elastic-agent#4729 specifically, and elastic/elastic-agent#3801 in general.
Backround
The latest issues like 3863, 3991 and 4081, proved that the installation of the default configuration of Elastic Agent with our Kubernetes Integration can lead to situations were our customers result in unfortunate circumstances (even with broken k8s clusters sometimes). There are many details and variables that affect the final setup and installation of our observability solution and we can try to summarise and list them here.
Goals
This issue tries to summarise the next actions we need in order to investigate:
Actions
Current Actions
We have observed until now that:
a) Memory consumption of Elastic Agent had increased from 8.8 to 8.9 versions and later of Elastic Agent (Relevant https://github.com/elastic/sdh-beats/issues/3863#issuecomment-1733750863)
b) Number of API calls towards Kubernetes Control API has increased since 8.9 version (See Salesforce 01507229 regarding Elastic Agent overloading Kubernetes API server.: https://github.com/elastic/sdh-beats/issues/3991#issuecomment-1787648161)
c) CPU consumption (although not such a big issue at the moment and not first priority) has been referred here as a concern.
Unti now:
Next Planned Actions
Future Plans/Actions
The text was updated successfully, but these errors were encountered: