-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/go: trim build cache incrementally #38939
Comments
Interesting. Those backtraces are normal and just show a thread that has no goroutines to run. The next time this happens can you get backtraces from the other threads? |
Also, |
I haven’t been able to reproduce since. I’ll keep trying. |
This happened again. Some new information:
The last line is from zsh, via REPORTTIME, with TIMEFMT set to Observe that the test took ~2s to run, but according to the shell the process was alive well over 2 minutes.
Something is weird about thread #9's backtrace. |
Speculation: Could this just be cmd/go deleting a very large number of expired cache entries? If so, maybe we need to bound the number of entries deleted during one auto-initiated run, and have cleanup happen incrementally. |
Based on the stack and other comments, this seems likely. If I remember correctly, the build cache expiration policy is purely time-based. There are some comments in src/cmd/go/internal/cache/cache.go. The I don't believe there are any limits on the number or total size of files in the cache. It may not make sense to limit the size of the cache, but we might want to trim files more aggressively when it's very large (#29561). Trimming incrementally also seems like a good solution. We could limit the time spent trimming for any command to, say 500 ms. We could save where we left off so the next command could restart from that point. |
I don't think we should limit the trimming time to any fixed duration, because that could lead to a state where the trimming can't keep pace with the rate of creation. That said, I would expect deleting a file to almost always be faster than creating it. Perhaps we could limit the cleaning time to some multiple of the rest of the running time of the |
Another option might be to kick off cleaning the cache at the beginning of the cmd/go run, and let it run concurrently with the rest of the command, except when running benchmarks. |
I like that! But I worry it could be a bit racy — is it possible for part of the command to touch part of the cache and make it non-stale again? |
It’s not that terrible to have to recalculate something that you just deleted—that would have happened anyway with slightly different timing of when you ran the commands. I think the only problem is if you see something is in the cache, decide to use it, and then it gets cleaned up before you can use it. But if you can freshen it in the cache, and if the deletion goroutine always rechecks freshness before deletion, and if there are enough synchronization points, it should be ok. But that is a lot of ifs. :) |
Yes, that's exactly the race I'm worried about. 🙂 I guess we could decide which portions of the cache to use, then freshen them, then kick off the cleaner while we do the remaining work. (But we intentionally don't rely on file-locking in the cache, and even if we did there is no atomic lock-check-and-delete operation, so there is in general no way to prevent the cleaner from racing with the consumer. So I don't think we can avoid the need to at least finish freshening the entries before we start the cleaner.) |
Since this is all in one process, could we do our synchronization internally, in a thin intermediate layer? |
Internally: yes. Thin: that's subjective. 😉 |
Fair enough. Happy to leave all these choices to you. :) I'd be happy enough with a simple 500ms time limit for automatically initialized cache cleanup. |
@josharian, I noticed a similar issue tracking down #39183. I started with a test that ran lots of goroutines to try to provoke failures, and found that the test with N=10000 ran very quickly (~1s) but the I wonder if there is a bottleneck in producing the cache key itself, rather than merely cleaning the stale entries. |
@josharian, are you on a Mac? |
Yep, I'm on a mac. |
This happened to me again and I found that even rm -rf $(go env GOCACHE) was taking a very long time. By very long time I mean it took 20 minutes to delete ~1,250,000 files. On my Mac, fseventsd was writing down all the deleted file names in its own little databases, and then santa (https://github.com/google/santa) was writing down all the file accesses by fseventsd. If those weren't there I think rm would be faster. I tried telling Spotlight preferences that ~/Library/Caches was private and should not be indexed but it seems not to do anything re fseventsd. |
One little comment to this approach. As I understand it, currently the cache trimming is now done at the end? This means it a cache older than 5 days might be used once? I think this actually helps a lot in the case of Continuous Integration systems, they might cache things for more than 5 days (CircleCI for example does 15). If the cache would be invalidated before, purely based on being old, that would mean the CI would not be able to use this cache and it might slow down things unnecessarily. I'm not saying design decisions should be based on this fact, but just something I have come across when trying to optimise CI build/test times and it might be worth being aware of at least. |
This sometimes (quite often, in fact) happens to me as well, with It seems that it has something to do with go cache. See
|
@opennota perhaps Russ's generational cache approach would help here, by leading to a more incremental deletion pattern: #29561 (comment). If you try it, want to report back whether it also helps here? |
@josharian Why would (I believe that the Go users who have their gocache directory on an SSD do not notice any delay, but my gocache resides on a slow spinning disk.) |
I'll try the patch out and let know if it helps. |
Can't reproduce reliably, but...
After running
go test
, I am sometimes seeing the test complete successfully (prints "ok ... pkg ... time"), and then it appears to hang for an indefinite period of time. Sending SIGINT does not terminate the process. Sending SIGTERM does. This happens with std lib tests (strings, crypto/x509). Each time I passed the-v
flag, although I'm not sure that that is necessary. Using commit 5c13cab. This behavior is new, although this is the first time I've run tests in a few days or a week. And it doesn't reproduce 100%, so it's a bit muddled.I attached lldb to two hung binaries and got similar backtraces from each.
First backtrace
__psynch_cvwait + 10 frame #1: 0x00007fff71ea6185 libsystem_pthread.dylib
_pthread_cond_wait + 701frame #2: 0x000000000106ceb0 go
runtime.pthread_cond_wait_trampoline + 16 frame #3: 0x000000000106ad8d go
runtime.asmcgocall + 173frame #4: 0x0000000001054c0c go
runtime.libcCall + 108 frame #5: 0x00000000010557d9 go
runtime.pthread_cond_wait + 57frame #6: 0x0000000001032dcd go
runtime.semasleep + 141 frame #7: 0x000000000100dca7 go
runtime.notesleep + 231frame #8: 0x000000000103c9a5 go
runtime.stopm + 197 frame #9: 0x000000000103e15f go
runtime.findrunnable + 2687frame #10: 0x000000000103edd7 go
runtime.schedule + 727 frame #11: 0x000000000103f37d go
runtime.park_m + 157frame #12: 0x000000000106933b go
runtime.mcall + 91 frame #13: 0x00000000010557d9 go
runtime.pthread_cond_wait + 57frame #14: 0x0000000001032dcd go
runtime.semasleep + 141 frame #15: 0x000000000100dca7 go
runtime.notesleep + 231frame #16: 0x000000000103c9a5 go
runtime.stopm + 197 frame #17: 0x000000000103e15f go
runtime.findrunnable + 2687frame #18: 0x000000000103edd7 go
runtime.schedule + 727 frame #19: 0x000000000103f37d go
runtime.park_m + 157frame #20: 0x000000000106933b go
runtime.mcall + 91 frame #21: 0x0000000001032dcd go
runtime.semasleep + 141frame #22: 0x000000000100dca7 go
runtime.notesleep + 231 frame #23: 0x000000000103c9a5 go
runtime.stopm + 197frame #24: 0x000000000103e15f go
runtime.findrunnable + 2687 frame #25: 0x000000000103edd7 go
runtime.schedule + 727frame #26: 0x000000000103f37d go
runtime.park_m + 157 frame #27: 0x000000000106933b go
runtime.mcall + 91frame #28: 0x000000000100dca7 go
runtime.notesleep + 231 frame #29: 0x000000000103c9a5 go
runtime.stopm + 197frame #30: 0x000000000103e15f go
runtime.findrunnable + 2687 frame #31: 0x000000000103edd7 go
runtime.schedule + 727frame #32: 0x000000000103f37d go
runtime.park_m + 157 frame #33: 0x000000000106933b go
runtime.mcall + 91frame #34: 0x000000000103c9a5 go
runtime.stopm + 197 frame #35: 0x000000000103e15f go
runtime.findrunnable + 2687frame #36: 0x000000000103edd7 go
runtime.schedule + 727 frame #37: 0x000000000103f37d go
runtime.park_m + 157frame #38: 0x000000000106933b go
runtime.mcall + 91 frame #39: 0x000000000103e15f go
runtime.findrunnable + 2687frame #40: 0x000000000103edd7 go
runtime.schedule + 727 frame #41: 0x000000000103f37d go
runtime.park_m + 157frame #42: 0x000000000106933b go
runtime.mcall + 91 frame #43: 0x000000000103edd7 go
runtime.schedule + 727frame #44: 0x000000000103f37d go
runtime.park_m + 157 frame #45: 0x000000000106933b go
runtime.mcall + 91frame #46: 0x000000000103f37d go
runtime.park_m + 157 frame #47: 0x000000000106933b go
runtime.mcall + 91frame #48: 0x000000000106933b go
runtime.mcall + 91 frame #49: 0x000000000101edff go
runtime.gcBgMarkWorker + 255frame #50: 0x000000000106b121 go
runtime.goexit + 1 frame #51: 0x000000000106b121 go
runtime.goexit + 1(lldb) ^D
Second backtrace
__psynch_cvwait + 10 frame #1: 0x00007fff71ea6185 libsystem_pthread.dylib
_pthread_cond_wait + 701frame #2: 0x000000000106ceb0 go
runtime.pthread_cond_wait_trampoline + 16 frame #3: 0x000000000106ad8d go
runtime.asmcgocall + 173frame #4: 0x0000000001054c0c go
runtime.libcCall + 108 frame #5: 0x00000000010557d9 go
runtime.pthread_cond_wait + 57frame #6: 0x0000000001032dcd go
runtime.semasleep + 141 frame #7: 0x000000000100dca7 go
runtime.notesleep + 231frame #8: 0x000000000103c9a5 go
runtime.stopm + 197 frame #9: 0x000000000103e15f go
runtime.findrunnable + 2687frame #10: 0x000000000103edd7 go
runtime.schedule + 727 frame #11: 0x000000000103f37d go
runtime.park_m + 157frame #12: 0x000000000106933b go
runtime.mcall + 91 frame #13: 0x00000000010557d9 go
runtime.pthread_cond_wait + 57frame #14: 0x0000000001032dcd go
runtime.semasleep + 141 frame #15: 0x000000000100dca7 go
runtime.notesleep + 231frame #16: 0x000000000103c9a5 go
runtime.stopm + 197 frame #17: 0x000000000103e15f go
runtime.findrunnable + 2687frame #18: 0x000000000103edd7 go
runtime.schedule + 727 frame #19: 0x000000000103f37d go
runtime.park_m + 157frame #20: 0x000000000106933b go
runtime.mcall + 91 frame #21: 0x0000000001032dcd go
runtime.semasleep + 141frame #22: 0x000000000100dca7 go
runtime.notesleep + 231 frame #23: 0x000000000103c9a5 go
runtime.stopm + 197frame #24: 0x000000000103e15f go
runtime.findrunnable + 2687 frame #25: 0x000000000103edd7 go
runtime.schedule + 727frame #26: 0x000000000103f37d go
runtime.park_m + 157 frame #27: 0x000000000106933b go
runtime.mcall + 91frame #28: 0x000000000100dca7 go
runtime.notesleep + 231 frame #29: 0x000000000103c9a5 go
runtime.stopm + 197frame #30: 0x000000000103e15f go
runtime.findrunnable + 2687 frame #31: 0x000000000103edd7 go
runtime.schedule + 727frame #32: 0x000000000103f37d go
runtime.park_m + 157 frame #33: 0x000000000106933b go
runtime.mcall + 91frame #34: 0x000000000103c9a5 go
runtime.stopm + 197 frame #35: 0x000000000103e15f go
runtime.findrunnable + 2687frame #36: 0x000000000103edd7 go
runtime.schedule + 727 frame #37: 0x000000000103f37d go
runtime.park_m + 157frame #38: 0x000000000106933b go
runtime.mcall + 91 frame #39: 0x000000000103e15f go
runtime.findrunnable + 2687frame #40: 0x000000000103edd7 go
runtime.schedule + 727 frame #41: 0x000000000103f37d go
runtime.park_m + 157frame #42: 0x000000000106933b go
runtime.mcall + 91 frame #43: 0x000000000103edd7 go
runtime.schedule + 727frame #44: 0x000000000103f37d go
runtime.park_m + 157 frame #45: 0x000000000106933b go
runtime.mcall + 91frame #46: 0x000000000103f37d go
runtime.park_m + 157 frame #47: 0x000000000106933b go
runtime.mcall + 91frame #48: 0x000000000106933b go
runtime.mcall + 91 frame #49: 0x000000000101edff go
runtime.gcBgMarkWorker + 255frame #50: 0x000000000106b121 go
runtime.goexit + 1 frame #51: 0x000000000106b121 go
runtime.goexit + 1(lldb) ^D
The text was updated successfully, but these errors were encountered: