-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: bulk index backfill getting stuck #34212
Comments
This problem is occurring because the test uses a small chunk size = 100 thereby triggering a large number of SSTs = 40 all written at once. We know that rocksdb locks ups when this happens. Increasing the chunk size fixes the problem. Leaving this issue open as the main issue tracking throttling of SST writes. |
I used
so that
I tried to
(I'm on latest Anyway, I tried the next best thing: Dump attached. It's funny, the dump shows the |
Oh, and I also had to put |
Maybe we can blame this on Rocks somehow? From the aborted dump:
Mind you, this is far from the only goroutine stuck (with 7min), but I think most of the others are waiting for a stopper. I'm now also recalling that on my gceworker, I was hitting the 10s fatal error on in-memory stores -- which doesn't sound right -- so it's fairly likely that Rocks is doing something bad at this point. I'll get a core dump. |
BTW, we just need one process. It's really happening all the time.
|
$ dlv attach 31579 /tmp/go-build848480181/b001/sql.test Boo. go-delve/delve#1409. |
Let's try
and
Hooray! Now let's see. |
Hmm 1833 is probably not the thread, that's the PID... yeah, it's giving me that thread for every goroutine. Well. Maybe better if I take a dump and terminate this thing with Fast forward (I found the thread by picking one that wasn't in in
There are ~10 more Rocks threads, but they all look like threadpools and background tasks waiting to be pinged. PS I have the core dump (1.3g) and sql.test file around if anyone wants to take a look. |
This does look worrying. The top goroutine is the ingestion, and it's blocked on flushing the memtable because flushing the memtable would stall writes because of overly large L0. That would resolve if compactions took care of L0. But the compaction thread (or one, at least) is stuck waiting for IngestExternalFile 💀 🔒 |
cc @petermattis. This deadlock looks.. very obvious. Am I missing something or are we just looking at an obvious bug? |
I haven’t looked closely at this and obviously would not say a lurking deadlock is OK, but for unrelated reasons (ie foreground traffic p99) we should probably try the no-flush ingest -> non-blocking memtable flush fallback trick anyway, and that might also mean we end up just avoiding this situation ? I was going to take a stab at that this week anyway — happy to try it today to see if it is indeed a workaround. |
@awoods187 observed a seemingly similar situation doing a restore of TPCC data onto a 30 node cluster. The node was most likely running I've attached a goroutine trace. Unfortunately I failed to grab cores because I didn't point gcore at a larger storage device and when gcore fails to write the core dump it tears down the process. |
@vivekmenezes I took at stab at the background flush attempt in #34800 if you want to try again with that |
Another wrinkle here is that it doesn't seem like Andy had enabled |
fwiw, that setting wouldn't affect a RESTORE since only controls index backfill, however RESTORE does call AddSSTable, just typically it calls it with SSTs that overlap nothing, so I wouldn't not expect it to be forced to flush most of the time. |
@ajwerner my expectation is that they're the same still, the setting just makes it much more likely for sstable ingestion to block on a compaction. |
Definitely looks like a deadlock in RocksDB. Also looks like this has already been fixed upstream: facebook/rocksdb@9be3e6b Our version of the code in // Figure out if we need to flush the memtable first
if (status.ok()) {
bool need_flush = false;
status = ingestion_job.NeedsFlush(&need_flush, cfd->GetSuperVersion());
TEST_SYNC_POINT_CALLBACK("DBImpl::IngestExternalFile:NeedFlush",
&need_flush);
if (status.ok() && need_flush) {
mutex_.Unlock();
status = FlushMemTable(cfd, FlushOptions(),
FlushReason::kExternalFileIngestion,
true /* writes_stopped */);
mutex_.Lock();
}
} Upstream master looks like: // Figure out if we need to flush the memtable first
if (status.ok()) {
bool need_flush = false;
status = ingestion_job.NeedsFlush(&need_flush, cfd->GetSuperVersion());
TEST_SYNC_POINT_CALLBACK("DBImpl::IngestExternalFile:NeedFlush",
&need_flush);
if (status.ok() && need_flush) {
FlushOptions flush_opts;
flush_opts.allow_write_stall = true;
if (immutable_db_options_.atomic_flush) {
autovector<ColumnFamilyData*> cfds;
SelectColumnFamiliesForAtomicFlush(&cfds);
mutex_.Unlock();
status = AtomicFlushMemTables(cfds, flush_opts,
FlushReason::kExternalFileIngestion,
true /* writes_stopped */);
} else {
mutex_.Unlock();
status = FlushMemTable(cfd, flush_opts,
FlushReason::kExternalFileIngestion,
true /* writes_stopped */);
}
mutex_.Lock();
}
} The atomic flush stuff can be ignored. The important part is Let me see about backporting this change to our RocksDB branch. |
Fixes cockroachdb#34212 Release note (bug fix): Fix a deadlock that could occur during IMPORT and RESTORE, causing all writes on a node to be stopped.
34818: sql/pgwire: use OIDs when encoding some datum types r=mjibson a=mjibson Teach pgwire how to encode int and float with the various widths. Do this by saving the oids during SetColumns. Teach cmd/generate-binary to also record oids and use those in tests. Fix varbits to expect the same OID as postgres produces. Sadly, our int2 and int4 types don't yet propagate all the way down and so we still encode them as an int8. This commit is a precursor to supporting that. Release note: None 34830: c-deps: bump RocksDB to pick up ingest-external-file deadlock r=tbg a=petermattis Fixes #34212 Release note (bug fix): Fix a deadlock that could occur during IMPORT and RESTORE, causing all writes on a node to be stopped. Co-authored-by: Matt Jibson <[email protected]> Co-authored-by: Peter Mattis <[email protected]>
confirmed that this is indeed fixed! |
👍 btw @vivekmenezes if you're seeing L0 fill up like that in this test it's a good indication that doing real-world workloads with it is going to potentially really mess up RocksDB. Just a heads up! |
The compactor kicks off at 2, so this deadlock was easy to hit long before it "fills up" (which I'd say is 20 since that's when the slowdown kicks in). |
make stress PKG=./pkg/sql TESTS=TestRaceWithBackfill
gets stuck after patching https://github.com/cockroachdb/cockroach/compare/master...vivekmenezes:schemachanger?expand=1 on master
The text was updated successfully, but these errors were encountered: