Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk cache garbage collection fails due to thread creation error #24098

Closed
rsalvador opened this issue Oct 27, 2024 · 7 comments
Closed

Disk cache garbage collection fails due to thread creation error #24098

rsalvador opened this issue Oct 27, 2024 · 7 comments
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: bug untriaged

Comments

@rsalvador
Copy link
Contributor

Description of the bug:

After enabling the new disk cache garbage collection in 7.4.0 (#23833), we noticed it was only deleting a few files and not up to the specified maximum. The java.log show the DiskCacheGarbageCollectorIdleTask starting but not finishing, e.g.:

$ grep -i "Disk cache garbage collection" java.log*
java.log:241025 15:47:02.888:I 478 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
java.log:241025 15:57:47.635:I 18026 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started

The task fails due to this error:

java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
    at java.base/java.lang.Thread.start0(Native Method)
    at java.base/java.lang.Thread.start(Thread.java:1553)
    at java.base/java.lang.System$2.start(System.java:2577)
    at java.base/jdk.internal.vm.SharedThreadContainer.start(SharedThreadContainer.java:152)
    at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:953)
    at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1375)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor.executeWrappedRunnable(AbstractQueueVisitor.java:322)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor.executeWithExecutorService(AbstractQueueVisitor.java:309)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor.execute(AbstractQueueVisitor.java:296)
    at com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollector$EntryDeleter.delete(DiskCacheGarbageCollector.java:271)
    at com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollector.runUnderLock(DiskCacheGarbageCollector.java:190)
    at com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollector.run(DiskCacheGarbageCollector.java:178)
    at com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run(DiskCacheGarbageCollectorIdleTask.java:94)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)

The newCachedThreadPool in

creates over 10,000 threads leading to the above error. Monitoring the created and active threads shows:

...
Thread created: disk-cache-gc-thread-16299, Active Threads: 16347
Thread created: disk-cache-gc-thread-16300, Active Threads: 16348
...

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

It needs a big disk cache. It reproduces on a MacBook Pro M1 with a 20G disk cache and --experimental_disk_cache_gc_max_size=10G.

Which operating system are you running Bazel on?

MacOS 14.7

What is the output of bazel info release?

release 7.4.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@fmeum
Copy link
Collaborator

fmeum commented Oct 27, 2024

@bazel-io fork 7.4.1

@tjgq
Copy link
Contributor

tjgq commented Oct 27, 2024

@bazel-io fork 8.0.0

bazel-io pushed a commit to bazel-io/bazel that referenced this issue Oct 28, 2024
Fixes bazelbuild#24098

With this change the disk cache garbage collection works correctly:
```
241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB
```

Closes bazelbuild#24099.

PiperOrigin-RevId: 690652512
Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24
bazel-io pushed a commit to bazel-io/bazel that referenced this issue Oct 28, 2024
Fixes bazelbuild#24098

With this change the disk cache garbage collection works correctly:
```
241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB
```

Closes bazelbuild#24099.

PiperOrigin-RevId: 690652512
Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24
iancha1992 pushed a commit that referenced this issue Oct 28, 2024
…ction (#24114)

Fixes #24098

With this change the disk cache garbage collection works correctly:
```
241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB
```

Closes #24099.

PiperOrigin-RevId: 690652512
Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24

Commit
3746583

Co-authored-by: Roman Salvador <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Oct 28, 2024
…ction (#24113)

Fixes #24098

With this change the disk cache garbage collection works correctly:
```
241027 09:06:34.732:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241027 09:07:06.123:I 681 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 190243 of 446229 files, reclaimed 5.4 GiB of 15.4 GiB
```

Closes #24099.

PiperOrigin-RevId: 690652512
Change-Id: Ie8d1fa6b2afb0bd5bd85fdb6835871023a64ad24

Commit
3746583

Co-authored-by: Roman Salvador <[email protected]>
@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 8.0.0 RC2. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=8.0.0rc2. Thanks!

@rsalvador
Copy link
Contributor Author

A fix for this issue has been included in Bazel 8.0.0 RC2. Please test out the release candidate and report any issues as soon as possible.

Our code base doesn't work with the 8.0.0 RC. Could we test it with a 7.4.1 RC?

@tjgq
Copy link
Contributor

tjgq commented Oct 29, 2024

We don't have an RC for 7.4.1 yet, but the change has been cherry-picked into the release-7.4.1 branch, so you could either manually build it from there, or use Bazelisk with USE_BAZEL_VERSION=commit_id.

@rsalvador
Copy link
Contributor Author

7.4.1rc1 works ok:

241029 21:10:07.116:I 487 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Disk cache garbage collection started
241029 21:11:42.703:I 487 [com.google.devtools.build.lib.remote.disk.DiskCacheGarbageCollectorIdleTask.run] Deleted 588266 of 778688 files, reclaimed 32.5 GiB of 42.5 GiB in 35.57 seconds (16539 files/s, 937 MB/s)

@iancha1992
Copy link
Member

A fix for this issue has been included in Bazel 7.4.1 RC1. Please test out the release candidate and report any issues as soon as possible.
If you're using Bazelisk, you can point to the latest RC by setting USE_BAZEL_VERSION=7.4.1rc1. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: bug untriaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants