Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New component: cgroup aware go runtime extension #30289

Open
2 tasks
tomershafir opened this issue Jan 4, 2024 · 14 comments
Open
2 tasks

New component: cgroup aware go runtime extension #30289

tomershafir opened this issue Jan 4, 2024 · 14 comments
Labels
Accepted Component New component has been sponsored

Comments

@tomershafir
Copy link

tomershafir commented Jan 4, 2024

The purpose and use-cases of the new component

Set go runtime variables based on linux cgroupfs automatically, or let the user set a cgroup relative value. For example, set GOMAXPROCS and GOMEMLIMIT by importing https://github.com/uber-go/automaxprocs and https://github.com/KimMachineGun/automemlimit.

Example configuration for the component

Proxy config for https://github.com/uber-go/automaxprocs and https://github.com/KimMachineGun/automemlimit

Telemetry data types supported

It is data type independent

Is this a vendor-specific component?

  • This is a vendor-specific component
  • If this is a vendor-specific component, I am proposing to contribute and support it as a representative of the vendor.

Code Owner(s)

No response

Sponsor (optional)

@mx-psi

Additional context

I think about a single extension that should theoretically cover the entire go runtime, starting with the 2 variables mentioned above.

Core issue ref: open-telemetry/opentelemetry-collector#9203

@mx-psi
Copy link
Member

mx-psi commented Jan 4, 2024

cc @open-telemetry/helm-maintainers would love to know what you think (is this something we would want on the Helm chart? If so, anything specific about the design we should take into account?)

@TylerHelmuth
Copy link
Member

Seems reasonable to me. In the helm chart an extension like this would allow us to remove some templates like https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/_helpers.tpl#L169. A solution like this also helps users running the collector outside of kubernetes.

I'd like to see some example configs of how the extension would be configured.

@mx-psi mx-psi added Accepted Component New component has been sponsored and removed Sponsor Needed New component seeking sponsor needs triage New item requiring triage labels Jan 9, 2024
@tomershafir
Copy link
Author

maybe something like:

extensions:
  cgroupgoruntime:
    gomaxprocs: # hardcoded percentage 100
      enable: true
    gomemlimit:
      percentage: 85 # default 90, enabled by value > 0

@TylerHelmuth
Copy link
Member

Would gomaxprocs or gomemlimit ever have additional fields in their sections?

@tomershafir
Copy link
Author

I assume that the linux and go relevant interfaces should be stable for long, so I guess no. There may be additional sections.

@cforce
Copy link

cforce commented Mar 2, 2024

Setting gomaxprocs as env var does not reduce core usage as expected .

@iblancasa
Copy link
Contributor

It would be interesting to have this extension. Recently, Prometheus added automatic memory limit handling based on the same mechanism.

Copy link
Contributor

github-actions bot commented May 6, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label May 6, 2024
@tomershafir
Copy link
Author

I wont be able to work on this issue. @mx-psi do you have someone to assign?

@mx-psi mx-psi removed the Stale label May 6, 2024
@mx-psi
Copy link
Member

mx-psi commented May 6, 2024

@tomershafir no, but I can remove the Stale marker and leave it open for now in case somebody else volunteers

Copy link
Contributor

github-actions bot commented Jul 8, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Jul 8, 2024
@mx-psi mx-psi removed the Stale label Jul 8, 2024
Copy link
Contributor

github-actions bot commented Sep 9, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@rogercoll
Copy link
Contributor

@tomershafir @mx-psi I can help with the implementation of this extension. I have started a PoC that is working as expected, but it completely relies on the used packages: #35472

My main concern is regarding the testing, after taking a look at the packages, they seem not to provide an easy way to mock the cgroups file system. A mock of the cgroup file system would be needed for a full coverage unit test. Another option would be an integration test by manually creating a cgroup and launch the test in it, we could leverage the https://github.com/containerd/cgroups package to do so. This last option requires root privileges, do you know if we have any tests in the opentelemetry-collector that uses/require a root user?

@mx-psi
Copy link
Member

mx-psi commented Nov 6, 2024

Just to clarify: I am still willing to sponsor this component :)

mx-psi added a commit that referenced this issue Nov 26, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds the initial implementation of a new component to
dynamically set the values of `GOMEMLIMIT` and `GOMAXPROCS` used by the
Go runtime. Those values are normally manually aligned with the cgroup
resource limit to prevent cpu throttling or out of memory scenarios.

The component would ease the manual steps of configuring these
environment variables in K8s deployments (e.g Helm
[templates](https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/_helpers.tpl#L169))
in addition to have fine-grained values (e.g. 90% of the resource memory
limits).

**Link to tracking Issue:** <Issue number if applicable>
#30289

**Testing:** <Describe what testing was performed and which tests were
added.> Unit testing for the component has been added (config and
extension start/stop). But ideally, an integration test that actually
asserts the runtime modifications should be added as well. The extension
relies on "github.com/KimMachineGun/automemlimit/memlimit" and
"go.uber.org/automaxprocs/maxprocs" packages for the runtime
modifications, but they don't provide a way to mock the "cgroups" file
system which is the one they read to get the resource quota limits.

- Automemlimit package tests expect to run in a cgroup environment:
https://github.com/KimMachineGun/automemlimit/blob/main/memlimit/cgroups_test.go#L18
- Automaxprocs does not expose the cpu quota retrieval
https://github.com/uber-go/automaxprocs/blob/master/maxprocs/maxprocs.go#L41

Any suggestion on how to perform this integration tests in the contrib
repository? One possibility is to use the
https://github.com/containerd/cgroups package to set the quota, but this
requires privileged permissions (also in the GHA)


**Documentation:** <Describe the documentation added.>

---------

Co-authored-by: Pablo Baeyens <[email protected]>
shivanthzen pushed a commit to shivanthzen/opentelemetry-collector-contrib that referenced this issue Dec 5, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds the initial implementation of a new component to
dynamically set the values of `GOMEMLIMIT` and `GOMAXPROCS` used by the
Go runtime. Those values are normally manually aligned with the cgroup
resource limit to prevent cpu throttling or out of memory scenarios.

The component would ease the manual steps of configuring these
environment variables in K8s deployments (e.g Helm
[templates](https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/_helpers.tpl#L169))
in addition to have fine-grained values (e.g. 90% of the resource memory
limits).

**Link to tracking Issue:** <Issue number if applicable>
open-telemetry#30289

**Testing:** <Describe what testing was performed and which tests were
added.> Unit testing for the component has been added (config and
extension start/stop). But ideally, an integration test that actually
asserts the runtime modifications should be added as well. The extension
relies on "github.com/KimMachineGun/automemlimit/memlimit" and
"go.uber.org/automaxprocs/maxprocs" packages for the runtime
modifications, but they don't provide a way to mock the "cgroups" file
system which is the one they read to get the resource quota limits.

- Automemlimit package tests expect to run in a cgroup environment:
https://github.com/KimMachineGun/automemlimit/blob/main/memlimit/cgroups_test.go#L18
- Automaxprocs does not expose the cpu quota retrieval
https://github.com/uber-go/automaxprocs/blob/master/maxprocs/maxprocs.go#L41

Any suggestion on how to perform this integration tests in the contrib
repository? One possibility is to use the
https://github.com/containerd/cgroups package to set the quota, but this
requires privileged permissions (also in the GHA)


**Documentation:** <Describe the documentation added.>

---------

Co-authored-by: Pablo Baeyens <[email protected]>
ZenoCC-Peng pushed a commit to ZenoCC-Peng/opentelemetry-collector-contrib that referenced this issue Dec 6, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

This PR adds the initial implementation of a new component to
dynamically set the values of `GOMEMLIMIT` and `GOMAXPROCS` used by the
Go runtime. Those values are normally manually aligned with the cgroup
resource limit to prevent cpu throttling or out of memory scenarios.

The component would ease the manual steps of configuring these
environment variables in K8s deployments (e.g Helm
[templates](https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/_helpers.tpl#L169))
in addition to have fine-grained values (e.g. 90% of the resource memory
limits).

**Link to tracking Issue:** <Issue number if applicable>
open-telemetry#30289

**Testing:** <Describe what testing was performed and which tests were
added.> Unit testing for the component has been added (config and
extension start/stop). But ideally, an integration test that actually
asserts the runtime modifications should be added as well. The extension
relies on "github.com/KimMachineGun/automemlimit/memlimit" and
"go.uber.org/automaxprocs/maxprocs" packages for the runtime
modifications, but they don't provide a way to mock the "cgroups" file
system which is the one they read to get the resource quota limits.

- Automemlimit package tests expect to run in a cgroup environment:
https://github.com/KimMachineGun/automemlimit/blob/main/memlimit/cgroups_test.go#L18
- Automaxprocs does not expose the cpu quota retrieval
https://github.com/uber-go/automaxprocs/blob/master/maxprocs/maxprocs.go#L41

Any suggestion on how to perform this integration tests in the contrib
repository? One possibility is to use the
https://github.com/containerd/cgroups package to set the quota, but this
requires privileged permissions (also in the GHA)


**Documentation:** <Describe the documentation added.>

---------

Co-authored-by: Pablo Baeyens <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Component New component has been sponsored
Projects
None yet
Development

No branches or pull requests

6 participants