-
Notifications
You must be signed in to change notification settings - Fork 20.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db ignores files handle limit (during shutdown) #21877
Comments
cc @rjl493456442 any ideas? |
Regarding the fdlimit, I'm not sure it's Geth doesn't respect the limitation, but we do give a very high number for it // makeDatabaseHandles raises out the number of allowed file handles per process
// for Geth and returns half of the allowance to assign to the database.
func makeDatabaseHandles() int {
limit, err := fdlimit.Maximum()
if err != nil {
Fatalf("Failed to retrieve file descriptor allowance: %v", err)
}
raised, err := fdlimit.Raise(uint64(limit))
if err != nil {
Fatalf("Failed to raise file descriptor allowance: %v", err)
}
return int(raised / 2) // Leave half for networking and other stuff
} |
For the leveldb, it will maintain all the handlers for all "referenced" sstables. So it's the reason why we open 3m files because the ldb has 3m files. First, should we set a hardcap for the openfile cache? The FdCache is really useful for READ, because it caches the metadata and can prevent to open the files over and over again. While the downside is during the shutdown we have to close 3M files. I guess it will take a long time. |
https://github.com/syndtr/goleveldb/blob/master/leveldb/db.go#L1175 Obviously, it's waiting for all the background threads, (e.g. close the fd cache) |
More info
That message is printed in |
Some more info... The file descriptors grows until the OS won't give any more. That does not cause leveldb to crash, however, but the memory consumption grows to ~24Gb The initial boot said:
So we're apparently at 2.4x the memory that we gave it
script: pid=6861
while :
do
echo "Open ldb files:"
lsof -p $pid | grep ldb | wc -l
echo "MB memory used:"
ps -o rss= $pid | awk '{printf "%.0f\n", $1 / 1024}'
done Again: it has not imported a single block, just start + stop, and the shutdown sequence has been going on for 15 minutes at least now |
Shutdown finished after So then the question might be if we can prevent the memory blowup during compaction? |
On the archive node:
Compared with #21648 (comment) : on our aws node,
archive node,
|
Closing this; leveldb is on the way out and no longer the default |
This happens while shutting down the node, as the database is being closed:
The db keeps opening ldb files:
To clarify: despite the geth file handles having been limited to ~500K, the process has 1M+ file handles opened!
This is during shutdown, so there's no block processing happening. There hasn't even been any, because I told it to shut down before it even started getting any blocks to import.
Tangential about networking
Eventually, this causes networking to fail, which incidentally
explains why ssh sessions become torn down (some other ticket).
It also causes a "temporary error" in p2p to happen, which causes a spin-loop:
I've added a change to fix the spin-loop effect (on https://github.com/holiman/go-ethereum/tree/archivefix, commit fd53776).
Back the the root problem
Essentially the problem is:
triggering something (compaction?) which does not respect the given limit on handles,
and which eventually causes OOM.
Actual number of ldb-files are close to
3M
:I eventually killed it. Probably related parts of the stack:
The text was updated successfully, but these errors were encountered: