Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Investigate and fix issue with wrong CPU count for containers #623

Merged
merged 26 commits into from
Feb 15, 2024

Conversation

MaxymVlasov
Copy link
Collaborator

@MaxymVlasov MaxymVlasov commented Feb 14, 2024

Put an x into the box if that apply:

  • This PR introduces breaking change.
  • This PR fixes a bug.
  • This PR adds new functionality.
  • This PR enhances existing functionality.

Description of your changes

So, I found that nproc always shows how many CPUs available is. K8s "limits" and docker --cpus are throttling mechanisms, which do not hide the visibility of all cores.
There are a few workarounds, but IMO, it is better to implement checks for that than do them

Workaround for docker - set --cpuset-cpus
Workaraund for K8s - somehow deal with kubelet static CPU management policy, as recommend in Reddit

  • Send all "colorify" logs through stderr, as make able to add user-facing-logs in functions that also need to return same value to function-caller

@MaxymVlasov MaxymVlasov changed the title Feat/parallelizm debug logs feat: Investigate and fix issue with wrong CPU count for containers Feb 14, 2024
hooks/_common.sh Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
hooks/_common.sh Outdated Show resolved Hide resolved
@MaxymVlasov MaxymVlasov merged commit ec22d70 into feat/parallelizm Feb 15, 2024
6 checks passed
@MaxymVlasov MaxymVlasov deleted the feat/parallelizm-debug-logs branch February 15, 2024 16:25
MaxymVlasov added a commit that referenced this pull request Feb 17, 2024
…ction in README (#620)

### Reasoning 

We have a GH workflow that runs lockflies updates every week (implementation and reasoning [here](https://grem1.in/post/terraform-lockfiles-maxymvlasov/)). 
It usually takes from 2h 30min to  3h 15min. That was fine for us, till we found that our GH runners, based on AWS EC2s, started silently failing after 30min "without recent logs", and that was fixed by crutch which sends a dummy log every 10min.

However, during the debugging, I spent some time describing why hooks were not utilizing all the provided resources. 
And that means a waste of time and money, not only for that corner case but for every huge commit, which can cause opting out by `git commit -n` of using hooks locally for changes that affect many directories.

### Description of your changes

* Add per-hook `--parallelism-limit` setting to `--hook-config`. Defaults to `number of logical CPUs - 1`
* As quick tests show, ~5% of stacks face race condition problem, no matter if any locking mechanism exists or dirs try to init in parallel. I suppose the lock failed as it uses disk when hooks run in memory, so the creation of the lock can take some time as there bunch of caches between Mem and Disk. These milliseconds are enough to allow running a few `t init` in parallel.
* Final implementation uses a retry mechanism for cases when race condition failed to `t init` directory. 

In quick tests, I can say that on big changes:
* Up to 2000% speed increase for `terraform_validate`, and up to 500% - for other affected hooks. 
* When `--parallelism-limit=1` I observed an insignificant increase in time (about 5-10%) compared to v1.86.0 which has no parallelism at all. This may be the cost of maintaining parallelism or the result of external factors since the tests were not conducted in a vacuum.

For small changes, improvements are less significant.

-----
Other significant findings/solutions included to this PR:


* feat: Investigate and fix issue with wrong CPU count for containers (#623)

So, I found that `nproc` always shows how many CPUs available is. K8s "limits" and docker `--cpus` are throttling mechanisms, which do not hide the visibility of all cores.
There are a few workarounds, but  IMO, it is better to implement checks for that than do them

>Workaround for docker - set `--cpuset-cpus`
>Workaraund for K8s - somehow deal with [kubelet static CPU management policy](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#cpu-management-policies), as [recommend in Reddit](https://news.ycombinator.com/item?id=25224714)

* Send all "colorify" logs through stderr, as make able to add user-facing-logs in functions that also need to return same value to the function-caller. Needed for `common::get_cpu_num` err_msg show up

------

* Count --parallelism-ci-cpu-cores only in edge-cases

Details:
#620 (review)

---------

Co-authored-by: George L. Yermulnik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants