-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: clearrange/checks=true failed #34860
Comments
This is odd. We start the cluster, successfully import the large fixture (takes slightly north of 3h), stop the cluster. Then we restart the cluster, it takes the node ~4s to connect to Gossip
but then it takes another 11s for node startup to complete:
The log inbetween has logs of messages of this type
and looking at n10 I can see that it connected to Gossip a little later:
Now we're already clocking in at 13s past
It's only around that time that it gets its feet back on the ground:
I will say that I was convinced that these tests would be flaky when they were written. They basically run an update-heavy workload with tight GC TTL and expect to make do with 200mb of disk space. I'm only planning to make sure that there isn't anything going wrong catastrophically here. That said, they became flaky suddenly, so it seems worth figuring out what caused it. PS the logs of all nodes that I looked at had this sort of log spam:
Not sure what that's about, perhaps an artifact of the situation or the test, but seems better to avoid. The warning originates in #18364 as far as I can tell. |
I uploaded the artifacts here. |
SHA: https://github.com/cockroachdb/cockroach/commits/acba091f04f3d8ecabf51009bf394951fbd3643c Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1137872&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/d7c56dcb87c8f187e50303c8e32a64836c42515c Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1139797&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/8e9a1e310e3e8e37f091b7ca8bd204084ad9e2e5 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1142461&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/2e16b7357d5a15d87cd284d5f2c12e424ed29ca1 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1146277&tab=buildLog
|
@ajkr do you see anything in the stats that hints at a problem? This is our main range deletion tombstone test (it drops a large table, which lays down lots of individual range deletions), though it also exercises range merges as the now-empty ranges are all merged away. The node crashed because it took >10s to commit a virtually empty batch (though it might've gotten lumped in with something else in our Go-side commit pipeline, but there shouldn't be any large batches in the test at that stage of the test). |
SHA: https://github.com/cockroachdb/cockroach/commits/eaad50f0ea64985a0d7da05abb00cc9f321c5fa9 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149743&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/71681f60031c101f17339236e2ba75f7a684d1a1 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1155904&tab=buildLog
|
|
My guess would be that range merging sometimes doesn't respect the ContainsEstimates flag. That is, we're merging two ranges (one of which contains estimates) but then the merged range pretends that it doesn't. |
SHA: https://github.com/cockroachdb/cockroach/commits/71681f60031c101f17339236e2ba75f7a684d1a1 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1155867&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/de1793532332fb64fca27cafe92d2481d900a5a0 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160394&tab=buildLog
|
The stats do look like we're putting great strain on the device: write-amp is 238.1. Another notable thing is 1315 bottom-level files were being compacted at the time those stats were printed. Since cockroach has only a few parallel compactions, that means something (usually range tombstones) causes individual compactions to be extremely wide. I will try repro'ing to investigate more. |
Reproducing and understanding this would be super helpful. I filed #26693 last summer when we were doing investigations about problems with range tombstones. See also facebook/rocksdb#3977 (which I know you're aware of). |
Oh, that's what that means! Yikes, that does sound pretty bad. Thank you for taking a look. |
(btw, in case this isn't already clear, |
The import is hitting an interesting edge case. I'm not sure it's the root cause or not as it hasn't failed yet. But this is what's happening:
|
In order to make sure that range deletions are processed in a timely fashion, we mark any sstable containing at least range tombstone for compaction. Perhaps this is having a bad effect and should be disabled. See
I would expect the range tombstone to actually cover a large amount of data. That's why we're trying to push them through compactions. Are you seeing evidence that the range tombstones cover few keys? |
We know that the engine check fires during some tests ([clearrange], for example). This puts us in an awkward position: on the one hand, not being able to sync an engine in 10s is certainly going to cause lots of terrible snowball effects which then eat up troubleshooting time, but on the other hand we're not likely to fix all of the problems in 19.1. For now, up the limit significantly. Also up the corresponding log partition time limit, though we've seen that fire only in rare cases that likely really highlighted some I/O problem (or a severe case of being CPU bound). [clearrange]: cockroachdb#34860 (comment) Release note: None
35936: server,log: increase the max sync durations r=bdarnell a=tbg We know that the engine check fires during some tests ([clearrange], for example). This puts us in an awkward position: on the one hand, not being able to sync an engine in 10s is certainly going to cause lots of terrible snowball effects which then eat up troubleshooting time, but on the other hand we're not likely to fix all of the problems in 19.1. For now, up the limit significantly. Also up the corresponding log partition time limit, though we've seen that fire only in rare cases that likely really highlighted some I/O problem (or a severe case of being CPU bound). [clearrange]: #34860 (comment) Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
We know that the engine check fires during some tests ([clearrange], for example). This puts us in an awkward position: on the one hand, not being able to sync an engine in 10s is certainly going to cause lots of terrible snowball effects which then eat up troubleshooting time, but on the other hand we're not likely to fix all of the problems in 19.1. For now, up the limit significantly. Also up the corresponding log partition time limit, though we've seen that fire only in rare cases that likely really highlighted some I/O problem (or a severe case of being CPU bound). [clearrange]: cockroachdb#34860 (comment) Release note: None
SHA: https://github.com/cockroachdb/cockroach/commits/b5768aecd39461ab9a54e2e7db059a3fe8b00459 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1191957&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/9399d559ae196e5cf2ad122195048ff9115ab56a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194326&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/c59f5347d5424edb90575fb0fd50bad677953752 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195732&tab=buildLog
|
Node 9 had a larger log file than anything else so I took a quick glance. It has all sorts of grpc connection issues in the logs, though I don't immediately see why.
|
Wait, connection refused? Let me double check that nodes really didn't die.. |
pretty sure n4 oomed or something along those lines. |
Yep here we go
|
Hmm. No
These nodes have 15GiB of memory, so this isn't even close. Something must've allocated quite suddenly. |
SHA: https://github.com/cockroachdb/cockroach/commits/5921cf0dcc76548931cc85500c0fa2186a82142f Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212185&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/668162cc99e4f3198b663b1abfa51858eeb3ccb8 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212251&tab=buildLog
|
F190401 09:57:08.220353 152 storage/replica_raft.go:923 [n6,s6,r19943/3:/Table/53/1/5506{6922-8668}] while committing batch: while committing batch: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/026458.log: No space left on device Before that, pages and pages of slow commits, some 20s in duration. Last compaction stats in the logs say
cc @ajkr |
SHA: https://github.com/cockroachdb/cockroach/commits/509c5b130fb1ad0042beb74e083817aa68e4fc92 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1237002&tab=buildLog
|
This comment has been minimized.
This comment has been minimized.
SHA: https://github.com/cockroachdb/cockroach/commits/7109d291e3b9edfa38264361f832cec14fff66ee Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1259219&tab=buildLog
|
We repeatedly see in the logs of clusters undergoing chaos: ``` W190425 06:17:19.438001 251 storage/allocator.go:639 [n9,replicate,s9,r2946/3:/Table/53/1/96{2/4/2…-4/4/7…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.638430 289 storage/allocator.go:639 [n9,replicate,s9,r6893/5:/Table/59/1/158{6/9/-…-7/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.837870 256 storage/allocator.go:639 [n9,replicate,s9,r6472/3:/Table/59/1/129{0/2/-…-2/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:20.238586 276 storage/allocator.go:639 [n9,replicate,s9,r1555/3:/Table/54/1/"\xc8\x{das…-f5\…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() ``` This was first picked up in cockroachdb#34860 (comment). This seems relatively new, so it might have been impacted by 3c76d77. Release note: None
37132: storage: remove spammy log about failed RemoveTarget simulation r=nvanbenschoten a=nvanbenschoten We repeatedly see in the logs of clusters undergoing chaos: ``` W190425 06:17:19.438001 251 storage/allocator.go:639 [n9,replicate,s9,r2946/3:/Table/53/1/96{2/4/2…-4/4/7…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.638430 289 storage/allocator.go:639 [n9,replicate,s9,r6893/5:/Table/59/1/158{6/9/-…-7/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.837870 256 storage/allocator.go:639 [n9,replicate,s9,r6472/3:/Table/59/1/129{0/2/-…-2/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:20.238586 276 storage/allocator.go:639 [n9,replicate,s9,r1555/3:/Table/54/1/"\xc8\x{das…-f5\…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() ``` For instance, it dominates the logs in #36720. This was first picked up in #34860 (comment). This seems relatively new, so it might have been impacted by 3c76d77. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
I still feel like there's more to be done for optimizing temp store compaction. I'll close these and open a separate focused issue for that. That should help import finish faster and make this kind of test less likely to time out. |
We repeatedly see in the logs of clusters undergoing chaos: ``` W190425 06:17:19.438001 251 storage/allocator.go:639 [n9,replicate,s9,r2946/3:/Table/53/1/96{2/4/2…-4/4/7…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.638430 289 storage/allocator.go:639 [n9,replicate,s9,r6893/5:/Table/59/1/158{6/9/-…-7/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:19.837870 256 storage/allocator.go:639 [n9,replicate,s9,r6472/3:/Table/59/1/129{0/2/-…-2/1/-…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() W190425 06:17:20.238586 276 storage/allocator.go:639 [n9,replicate,s9,r1555/3:/Table/54/1/"\xc8\x{das…-f5\…}] simulating RemoveTarget failed: must supply at least one candidate replica to allocator.RemoveTarget() ``` This was first picked up in cockroachdb#34860 (comment). This seems relatively new, so it might have been impacted by 3c76d77. Release note: None
SHA: https://github.com/cockroachdb/cockroach/commits/10f8010fa5778e740c057905e2d7664b5fd5d647
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1135549&tab=buildLog
The text was updated successfully, but these errors were encountered: