Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could we increase disk space of runners? #47

Closed
jeongseok-meta opened this issue Nov 2, 2024 · 9 comments · Fixed by #48
Closed

Could we increase disk space of runners? #47

jeongseok-meta opened this issue Nov 2, 2024 · 9 comments · Fixed by #48

Comments

@jeongseok-meta
Copy link
Contributor

I wonder if we could increase the disk space of runners for building large packages, such as pytorch (conda-forge/pytorch-cpu-feedstock#277). One job failed due to insufficient disk space (conda-forge/pytorch-cpu-feedstock#277 (comment)) on the cirun-openstack-gpu-2xlarge instance.

I have enabled the free_disk_space option, but it has caused other build errors due to missing packages (conda-forge/pytorch-cpu-feedstock#277 (comment)). It is unclear whether the conda-forge build scripts need modifications to handle this case though.

cc: @Tobias-Fischer, @hmaarrfk

@jaimergp
Copy link
Collaborator

jaimergp commented Nov 3, 2024

Thank you! I think we might be able to bump the disk a bit in the 2x and 4x runners. Do you think 80 and 100GB respectively should suffice? cc @aktech

The free_disk_space stuff is designed for Github-hosted runners, not our custom VMs, so I'm not surprised it doesn't work here. Maybe conda-smithy should have some checks there...

@jeongseok-meta
Copy link
Contributor Author

I'm not sure TBH, but +20 and 40+ GB sounds reasonable since only one CI job (linux64 + mkl + cuda 11.8) failed while others are still fine! cc: @hmaarrfk

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 3, 2024

The question is, what do the build logs actually give us. @jeongseok-meta built the last problematic build and these were the results of the log:

207956 ####################################################################################^M
207957 Resource usage summary:^M
207958 ^M
207959 Total time: 1:48:31.4^M
207960 CPU usage: sys=2:03:35.3, user=44:56:57.7^M
207961 Maximum memory usage observed: 131.1G^M
207962 Total disk usage observed (not including envs): 12.1G^M
207963 ^M
207964 ^M

So the RAM usage is quite high, but the disk usage isnt..... this is now puzzling me even more....

@jeongseok-meta
Copy link
Contributor Author

I'm not sure how much meaningful data can be obtained from the log for a succeeded build, as it may only show disk and RAM usage for a single job (although I'm not certain). However, the log of a failed build actually states that it was unable to write to the disk ("No space left on device").

2024-10-31T04:21:06.1360921Z   copying build/lib.linux-x86_64-cpython-311/torch/lib/libtorch_cpu.so -> build/bdist.linux-x86_64/wheel/./torch/lib
2024-10-31T04:21:07.4883778Z   error: could not write to 'build/bdist.linux-x86_64/wheel/./torch/lib/libtorch_cpu.so': No space left on device
2024-10-31T04:21:07.5493798Z   error: subprocess-exited-with-error
2024-10-31T04:21:07.5494775Z   
2024-10-31T04:21:07.5505565Z   × python setup.py bdist_wheel did not run successfully.
2024-10-31T04:21:07.5511696Z   │ exit code: 1
2024-10-31T04:21:07.5513539Z   ╰─> See above for output.

https://github.com/conda-forge/pytorch-cpu-feedstock/actions/runs/11600382138/job/32300680847?pr=277

@jeongseok-meta
Copy link
Contributor Author

Could we go with a 20+/40+ GB increase, as suggested by @jaimergp? Thank you!

@jaimergp
Copy link
Collaborator

jaimergp commented Nov 6, 2024

We are currently facing an issue in OpenStack and I can't apply the config change. @aktech will look into it once he's back from PTO.

@aktech
Copy link
Member

aktech commented Nov 7, 2024

I have bumped the disk for gpu_2xlarge and gpu_4xlarge runners to 80GB and 100GB now.

@jeongseok-meta
Copy link
Contributor Author

Awesome! Thank you for the updating the runners! Sorry for a n00b question, but where could I check the increased disk?

@hmaarrfk, let's verify if all pytorch versions (including CUDA 11.8) can build on the new runners without any free disk space issues (in the next build).

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 7, 2024

I think you now have the power to do it all yourself!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants