-
-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Investigate and fix issue with wrong CPU count for containers #623
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
MaxymVlasov
changed the title
Feat/parallelizm debug logs
feat: Investigate and fix issue with wrong CPU count for containers
Feb 14, 2024
4 tasks
MaxymVlasov
commented
Feb 14, 2024
yermulnik
reviewed
Feb 14, 2024
Co-authored-by: George L. Yermulnik <[email protected]>
Co-authored-by: George L. Yermulnik <[email protected]>
MaxymVlasov
commented
Feb 15, 2024
Co-authored-by: George L. Yermulnik <[email protected]>
yermulnik
approved these changes
Feb 15, 2024
MaxymVlasov
added a commit
that referenced
this pull request
Feb 17, 2024
…ction in README (#620) ### Reasoning We have a GH workflow that runs lockflies updates every week (implementation and reasoning [here](https://grem1.in/post/terraform-lockfiles-maxymvlasov/)). It usually takes from 2h 30min to 3h 15min. That was fine for us, till we found that our GH runners, based on AWS EC2s, started silently failing after 30min "without recent logs", and that was fixed by crutch which sends a dummy log every 10min. However, during the debugging, I spent some time describing why hooks were not utilizing all the provided resources. And that means a waste of time and money, not only for that corner case but for every huge commit, which can cause opting out by `git commit -n` of using hooks locally for changes that affect many directories. ### Description of your changes * Add per-hook `--parallelism-limit` setting to `--hook-config`. Defaults to `number of logical CPUs - 1` * As quick tests show, ~5% of stacks face race condition problem, no matter if any locking mechanism exists or dirs try to init in parallel. I suppose the lock failed as it uses disk when hooks run in memory, so the creation of the lock can take some time as there bunch of caches between Mem and Disk. These milliseconds are enough to allow running a few `t init` in parallel. * Final implementation uses a retry mechanism for cases when race condition failed to `t init` directory. In quick tests, I can say that on big changes: * Up to 2000% speed increase for `terraform_validate`, and up to 500% - for other affected hooks. * When `--parallelism-limit=1` I observed an insignificant increase in time (about 5-10%) compared to v1.86.0 which has no parallelism at all. This may be the cost of maintaining parallelism or the result of external factors since the tests were not conducted in a vacuum. For small changes, improvements are less significant. ----- Other significant findings/solutions included to this PR: * feat: Investigate and fix issue with wrong CPU count for containers (#623) So, I found that `nproc` always shows how many CPUs available is. K8s "limits" and docker `--cpus` are throttling mechanisms, which do not hide the visibility of all cores. There are a few workarounds, but IMO, it is better to implement checks for that than do them >Workaround for docker - set `--cpuset-cpus` >Workaraund for K8s - somehow deal with [kubelet static CPU management policy](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#cpu-management-policies), as [recommend in Reddit](https://news.ycombinator.com/item?id=25224714) * Send all "colorify" logs through stderr, as make able to add user-facing-logs in functions that also need to return same value to the function-caller. Needed for `common::get_cpu_num` err_msg show up ------ * Count --parallelism-ci-cpu-cores only in edge-cases Details: #620 (review) --------- Co-authored-by: George L. Yermulnik <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Put an
x
into the box if that apply:Description of your changes
So, I found that
nproc
always shows how many CPUs available is. K8s "limits" and docker--cpus
are throttling mechanisms, which do not hide the visibility of all cores.There are a few workarounds, but IMO, it is better to implement checks for that than do them