Client Mount Rate #186

behlendorf · 2024-07-30T23:12:06Z

When performing an allocation involving a large number of compute nodes the workflow can spend the majority of its time in the "Setup" phase mounting clients. Based on contents of the nnf-controller-manager logs it looks like the mounts are requested sequentially. And according to the timing information in the log this happens at a rate of between 20-25 mounts/second.

Could this be sped up by issuing the requests asynchronously. The kube-apiserver seems like it's probably not the limiting factor and it could handle the increased load.

matthew-richerson · 2024-07-31T13:48:24Z

The nnf-sos code is creating each ClientMount resource in a separate go thread, so they should be done in parallel at some level:
https://github.com/NearNodeFlash/nnf-sos/blob/master/internal/controller/nnf_access_controller.go#L929

There might be something in the k8s client library that's serializing things underneath our controller, though.

matthew-richerson · 2024-07-31T14:01:59Z

I'll see what we can do here. We might be able to open multiple client connections to the Server, or send the create request from multiple worker nodes, or something else. 20-25 creates/second is too slow.

matthew-richerson · 2024-07-31T14:44:59Z

2024-07-29T21:15:06.842Z        INFO    controllers.NnfAccess   Created ClientMount     {"NnfAccess": {"name":"fluxjob-494649641938190336-0-computes","namespace":"default"}, "name": "elcap7790/default-fluxjob-494649641938190336-0-computes"}
2024-07-29T21:15:06.892Z        INFO    controllers.NnfAccess   Created ClientMount     {"NnfAccess": {"name":"fluxjob-494649641938190336-0-computes","namespace":"default"}, "name": "elcap8444/default-fluxjob-494649641938190336-0-computes"}
2024-07-29T21:15:06.941Z        INFO    controllers.NnfAccess   Created ClientMount     {"NnfAccess": {"name":"fluxjob-494649641938190336-0-computes","namespace":"default"}, "name": "elcap8452/default-fluxjob-494649641938190336-0-computes"}
2024-07-29T21:15:06.991Z        INFO    controllers.NnfAccess   Created ClientMount     {"NnfAccess": {"name":"fluxjob-494649641938190336-0-computes","namespace":"default"}, "name": "elcap8937/default-fluxjob-494649641938190336-0-computes"}
I0729 21:15:07.036854       1 request.go:697] Waited for 1.049968388s due to client-side throttling, not priority and fairness, request: POST:https://10.96.0.1:443/apis/dataworkflowservices.github.io/v1alpha2/namespaces/elcap8951/clientmounts

matthew-richerson · 2024-08-01T20:50:20Z

I think that the first issue to solve here is the client-side throttling. There are QPS and burst settings that are configured on the controllers, and that's why we're only seeing 20-25 creates per second. On our internal system, I'm seeing the same speed. I bumped QPS from 20 (default) to 500, and burst from 30 (default) to 1000. That gave me 300 creates per second when creating 300 clientmounts.

I'll put out a change to expose some environment variables that will let us change those values so we can tune it.

matthew-richerson · 2024-08-02T15:21:50Z

The environment variables are available in master now: NearNodeFlash/nnf-sos@7cd399d

ajfloeder · 2024-09-19T16:39:16Z

@behlendorf Do the environment variables solve this issue?

behlendorf · 2024-09-20T23:24:01Z

@ajfloeder we'll need to retest this. I don't believe we've done any similar scale testing since this was merged.

github-project-automation bot added this to Issues Dashboard Jul 30, 2024

github-project-automation bot moved this to 📋 Open in Issues Dashboard Jul 30, 2024

matthew-richerson self-assigned this Jul 31, 2024

ajfloeder linked a pull request Sep 19, 2024 that will close this issue

Release v0.1.14 NearNodeFlash/nnf-sos#391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client Mount Rate #186

Client Mount Rate #186

behlendorf commented Jul 30, 2024 •

edited

Loading

matthew-richerson commented Jul 31, 2024

matthew-richerson commented Jul 31, 2024

matthew-richerson commented Jul 31, 2024

matthew-richerson commented Aug 1, 2024

matthew-richerson commented Aug 2, 2024

ajfloeder commented Sep 19, 2024

behlendorf commented Sep 20, 2024

Client Mount Rate #186

Client Mount Rate #186

Comments

behlendorf commented Jul 30, 2024 • edited Loading

matthew-richerson commented Jul 31, 2024

matthew-richerson commented Jul 31, 2024

matthew-richerson commented Jul 31, 2024

matthew-richerson commented Aug 1, 2024

matthew-richerson commented Aug 2, 2024

ajfloeder commented Sep 19, 2024

behlendorf commented Sep 20, 2024

behlendorf commented Jul 30, 2024 •

edited

Loading