Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github Actions randomly fail with "Error: No space left on device" #1860

Closed
1 of 6 tasks
f-squirrel opened this issue Oct 20, 2020 · 10 comments
Closed
1 of 6 tasks

Github Actions randomly fail with "Error: No space left on device" #1860

f-squirrel opened this issue Oct 20, 2020 · 10 comments
Assignees
Labels
Area: Image administration investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Ubuntu

Comments

@f-squirrel
Copy link

Description
Github Actions randomly fail with Error: No space left on device

Area for Triage:

Bug:

Virtual environments affected

  • macOS 10.15
  • Ubuntu 16.04 LTS
  • Ubuntu 18.04 LTS
  • Ubuntu 20.04 LTS
  • Windows Server 2016 R2
  • Windows Server 2019

Expected behavior
The jobs have to pass.

Actual behavior
Jobs randomly fails despite usual disk space usage.
I have added sudo df -h after pulling a docker image but before building the project.
After the project is built, the build directory is another ~2.5GB.

Filesystem      Size  Used Avail Use% Mounted on
udev            3.4G     0  3.4G   0% /dev
tmpfs           696M  680K  695M   1% /run
/dev/sda1        84G   65G   19G  78% /
tmpfs           3.4G  8.0K  3.4G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.4G     0  3.4G   0% /sys/fs/cgroup
/dev/sda15      105M  3.6M  101M   4% /boot/efi
/dev/sdb1        14G  4.1G  9.0G  32% /mnt

As you may see, the disk space usage is less than 14GB available for the runners.

Note:

  1. Usually, after the job is re-triggered the problem does not occur.

Repro steps

  1. The link to the failing workflow: https://github.com/vmware/concord-bft/blob/master/.github/workflows/build_and_test.yml
  2. Unfortunately, the failing job was accidentally restarted so I don't have the logs.
@f-squirrel f-squirrel changed the title Github Actions randomly fail with Error: No space left on device Github Actions randomly fail with "Error: No space left on device" Oct 20, 2020
@miketimofeev
Copy link
Contributor

Hi @f-squirrel!
As far as I can see from your snippet there is 19Gb free on /, is it not enough for your needs? Please note that /mnt used for swap therefore it has less than 14Gb.
image

@miketimofeev miketimofeev added Area: Image administration investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Ubuntu and removed needs triage Area: Deployment/Release labels Oct 20, 2020
@miketimofeev miketimofeev self-assigned this Oct 20, 2020
@f-squirrel
Copy link
Author

@miketimofeev , 19GB is more than enough for me, because the only additional space I need is 2.4GB for the build artifacts.
I have tried to print the disk space usage after the problem happens but it did not work because the runner itself fails.
Could you advise how to debug this issue?
Another question is were 65GB usage comes from? I do not install anything on the machine except one docker image 1.64GB.

@miketimofeev
Copy link
Contributor

@f-squirrel let me answer from the end — 65GB is used for Ubuntu itself + a huge list of preinstalled software, which can be found here https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu1804-README.md
Could you please share your workflow and failed run so we can check what else could consume all the remaining disk space?

@miketimofeev
Copy link
Contributor

miketimofeev commented Oct 20, 2020

@f-squirrel actually the build consumes more than 2.5Gb, I've forked the repo and change the step to output disk space every 15 seconds:

        - name: Build and test
          run: |
              (while true; do 
              df -h
              sleep 15
              done) &
              script -q -e -c "make pull"
              sudo df -h
              script -q -e -c "make build \
                              ${{ matrix.compiler}} \
                              CONCORD_BFT_CMAKE_FLAGS=\"\
                              ${{ matrix.ci_build_type }} \
                              -DBUILD_TESTING=ON \
                              -DBUILD_COMM_TCP_PLAIN=FALSE \
                              -DBUILD_COMM_TCP_TLS=FALSE \
                              -DCMAKE_CXX_FLAGS_RELEASE=-O3 -g \
                              -DUSE_LOG4CPP=TRUE \
                              -DBUILD_ROCKSDB_STORAGE=TRUE \
                              ${{ matrix.use_s3_obj_store }} \
                              -DUSE_OPENTRACING=ON \
                              -DOMIT_TEST_OUTPUT=OFF\
                              -DKEEP_APOLLO_LOGS=TRUE\" "\
              && script -q -e -c "make test"

At the end of the build, I had 12Gb free.

Filesystem      Size  Used Avail Use% Mounted on
udev            3.4G     0  3.4G   0% /dev
tmpfs           696M  756K  695M   1% /run
/dev/sda1        84G   73G   12G  87% /
tmpfs           3.4G  8.0K  3.4G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.4G     0  3.4G   0% /sys/fs/cgroup
/dev/sda15      105M  3.6M  101M   4% /boot/efi
/dev/sdb1        14G  4.1G  9.0G  32% /mnt

I wonder if some core dumps were created during your failed run because I saw this step in your yaml:

        - name: Configure core dump location
          run: |
            echo '/cores/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern
            mkdir -p ${{ github.workspace }}/artifact/cores/

@f-squirrel
Copy link
Author

@miketimofeev , good point!
I'll update my workflow with some additional prints, run a few tests, and get back with the results!
Thank you!

@miketimofeev
Copy link
Contributor

@f-squirrel did it help?

@f-squirrel
Copy link
Author

@miketimofeev , I haven't seen the issue again.
I'll continue monitoring!

@miketimofeev
Copy link
Contributor

@f-squirrel I'm going to close the issue for now, but please feel free to contact us if the problem still occurs.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Image administration investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Ubuntu
Projects
None yet
Development

No branches or pull requests

3 participants
@f-squirrel @miketimofeev and others