Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-26938 Compaction failures after StoreFileTracker integration #4338

Closed
wants to merge 1 commit into from

Conversation

apurtell
Copy link
Contributor

@apurtell apurtell commented Apr 12, 2022

Compactions might be concurrent against a given store and the Compactor is shared among them. Do not put mutable state into shared class fields. All Compactor class fields should be final. At the moment 'keepSeqIdPeriod' is an exception to this rule because some unit tests change it.

Compactor#getProgress and Compactor#getCompactionTargets now return union results of all compactions in progress against the store.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 48s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 💚 mvninstall 2m 41s master passed
+1 💚 compile 2m 17s master passed
+1 💚 checkstyle 0m 35s master passed
+1 💚 spotbugs 1m 18s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 10s the patch passed
+1 💚 compile 2m 13s the patch passed
+1 💚 javac 2m 13s the patch passed
+1 💚 checkstyle 0m 34s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 12m 2s Patch does not cause any errors with Hadoop 3.1.2 3.2.2 3.3.1.
+1 💚 spotbugs 1m 20s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 10s The patch does not generate ASF License warnings.
31m 9s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #4338
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname Linux 73b794091bc1 5.4.0-1043-aws #45~18.04.1-Ubuntu SMP Fri Apr 9 23:32:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 1d8a5bf
Default Java AdoptOpenJDK-1.8.0_282-b08
Max. process+thread count 69 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/console
versions git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 3m 42s Docker mode activated.
-0 ⚠️ yetus 0m 2s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 2m 24s master passed
+1 💚 compile 0m 35s master passed
+1 💚 shadedjars 3m 55s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 23s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 9s the patch passed
+1 💚 compile 0m 34s the patch passed
+1 💚 javac 0m 34s the patch passed
+1 💚 shadedjars 3m 53s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 20s the patch passed
_ Other Tests _
-1 ❌ unit 187m 22s hbase-server in the patch failed.
206m 24s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #4338
Optional Tests javac javadoc unit shadedjars compile
uname Linux 251a03d19c4c 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 1d8a5bf
Default Java AdoptOpenJDK-1.8.0_282-b08
unit https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/testReport/
Max. process+thread count 2738 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

// This step is necessary for the correctness of BrokenStoreFileCleanerChore. It lets the
// CleanerChore know that compaction is done and the file can be cleaned up if compaction
// have failed.
storeEngine.resetCompactionWriter();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed any more?

Copy link
Contributor Author

@apurtell apurtell Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it in the API, but the current implementation only set the writer field to null, and so the method bodies become empty after that field is converted into a parameter and they no longer have anything to do, so I removed it.

If we are going to keep it, we need to use a Map instead of Set to track the writers, and somehow need to pass a key as a parameter to abort and remove a StoreFileWriter instance. What should be the key? The CompactionRequestImpl? I do not think there is a requirement to abort compaction writers in this way. We abort compactions today by interrupting the thread.

If BrokenStoreFileCleanerChore will not function correctly without this, then it will need modification.

I think BrokenStoreFileCleanerChore sees the same results from getCompactionTargets after these changes. When the compaction is finished the StoreFileWriter will be removed from the set in the finally block of compact, so getCompactionTargets will not include the files being written by that writer after that point, which is the same thing that happened when resetCompactionWriter would cause the writer field in the previous impl to become null, and also the files being written by that writer would no longer appear in getCompactionTargets results afterward. But the timing has changed, that is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically we need this to keep correctness. IIRC, the problem here is that, we can only cleanup the writer instance after we successfully commit the store files to SFT, i.e, after the replaceStoreFile method. That's why we can not just simply remove the writer instance in commitWriter, otherwise there could be data loss, i.e, the BrokenStoreFileCleanerChore may delete the store files which are written just now but have not been added to the SFT yet...

Let me check again if the new implementation can solve the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that, you know BrokenStoreFileCleanerChore best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah actually there is even a problem here in the current code, let me fix it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, at least now we do not remove the writer from the set until after commit.

@apurtell
Copy link
Contributor Author

Pushed updates responding to review feedback.

@apurtell apurtell force-pushed the HBASE-26938 branch 2 times, most recently from ce237a9 to 55206be Compare April 12, 2022 16:18
@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 28s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 💚 mvninstall 3m 10s master passed
+1 💚 compile 2m 43s master passed
+1 💚 checkstyle 0m 41s master passed
+1 💚 spotbugs 1m 37s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 13s the patch passed
+1 💚 compile 2m 50s the patch passed
+1 💚 javac 2m 50s the patch passed
+1 💚 checkstyle 0m 38s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 16m 51s Patch does not cause any errors with Hadoop 3.1.2 3.2.2 3.3.1.
+1 💚 spotbugs 1m 59s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 12s The patch does not generate ASF License warnings.
41m 22s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #4338
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname Linux 8b1fd70052cf 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / ea9bc92
Default Java AdoptOpenJDK-1.8.0_282-b08
Max. process+thread count 70 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/console
versions git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@joshelser
Copy link
Member

joshelser commented Apr 12, 2022

FYI, we have some folks running this on a cluster against S3 now. I'll find their Github IDs to tag them here, so we can keep you up to date on real test runs :)

edit: hat-tip for @chrajeshbabu for now
double edit: also hat-tip to @ragarkar and @rsnegi-gh

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 29s Docker mode activated.
-0 ⚠️ yetus 0m 2s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 3m 0s master passed
+1 💚 compile 0m 44s master passed
+1 💚 shadedjars 4m 44s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 39s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 59s the patch passed
+1 💚 compile 0m 53s the patch passed
+1 💚 javac 0m 53s the patch passed
+1 💚 shadedjars 5m 7s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 23s the patch passed
_ Other Tests _
-1 ❌ unit 233m 37s hbase-server in the patch failed.
256m 10s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #4338
Optional Tests javac javadoc unit shadedjars compile
uname Linux c5fa78127f30 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / ea9bc92
Default Java AdoptOpenJDK-1.8.0_282-b08
unit https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/testReport/
Max. process+thread count 2638 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 16s Docker mode activated.
-0 ⚠️ yetus 0m 4s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 2m 43s master passed
+1 💚 compile 0m 46s master passed
+1 💚 shadedjars 3m 40s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 27s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 35s the patch passed
+1 💚 compile 0m 47s the patch passed
+1 💚 javac 0m 47s the patch passed
+1 💚 shadedjars 3m 39s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 26s the patch passed
_ Other Tests _
-1 ❌ unit 238m 8s hbase-server in the patch failed.
256m 57s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #4338
Optional Tests javac javadoc unit shadedjars compile
uname Linux 5e1523f7f52c 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / ea9bc92
Default Java AdoptOpenJDK-11.0.10+9
unit https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/testReport/
Max. process+thread count 2604 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

return Collections.emptyList();
} else {
// Finished, commit the writer's results.
return commitWriter(writer, fd, request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commitWriter here just means append metadata and close the writer, it does not mean to record the files in SFT...

We still need to add something after the replaceStoreFile call to remove the writer...

So maybe you are right, we need to use a Map instead of Set, the key can be the CompactionRequestImpl?

@joshelser
Copy link
Member

FWIW, we had a big ycsb load running from multiple nodes in parallel against a 5 node cluster with this change that ran happily for 6 hours (eventually failed due to an OOME which I think is just a misconfiguration by us).

@apurtell
Copy link
Contributor Author

Last set of changes broke tests so I will revert them.

@Apache9 I think we need to go back to the drawing board here. We need to put the call out to SFT logic somewhere else, or we need to make Compactor something that is created per compaction as was the assumption behind the changes that added the 'writer' field and requires the semantics offered by that. I will try the latter.

@Apache9
Copy link
Contributor

Apache9 commented Apr 13, 2022

Last set of changes broke tests so I will revert them.

@Apache9 I think we need to go back to the drawing board here. We need to put the call out to SFT logic somewhere else, or we need to make Compactor something that is created per compaction as was the assumption behind the changes that added the 'writer' field and requires the semantics offered by that. I will try the latter.

Creating Compactor per compaction maybe too big? FWIW, we have a Compactor instance in StoreEngine, if we want to make it per compaction, it will cause a very big refactoring.

So I suggest that, we add something like a compactionId to the CompactionRequest interface, and use it as the key for our map, when calling StoreEngine.replaceStoreFile, we pass this id in, then we could use this to remove the writer in the Compactor.

WDYT?

@apurtell
Copy link
Contributor Author

apurtell commented Apr 13, 2022

So I suggest that, we add something like a compactionId to the CompactionRequest interface, and use it as the key for our map, when calling StoreEngine.replaceStoreFile, we pass this id in, then we could use this to remove the writer in the Compactor.

I think it's messy and leaks abstraction.

We are not going to remove the writer in Compactor. Compactor will create the writer and leave it in the map. External code will call some new API to remove the writer only after the appropriate SFT methods have been called. So this internal detail of Compactor leaks out to all of the users.

Edit: I had a change of heart because the result is not bad and solves a couple of related problems.

Anyway my other idea wouldn't work because of resetCompactionWriter in StoreFileEngine, which assumes that Compactor is a singleton, even though the other SFT changes assume it is per compaction.

@apurtell
Copy link
Contributor Author

apurtell commented Apr 13, 2022

@Apache9 I have been testing with my original workaround for what it's worth, #4334 . That change does not allow concurrent compaction against a given store, respecting that Compactor is not thread safe for now. It works well. The performance of the test scenario is unchanged from baseline without any SFT changes. As an option to unblock us we could use it for now and come back to the implementation issues with SFT and compaction in a follow up issue.

@Apache9
Copy link
Contributor

Apache9 commented Apr 13, 2022

Maybe we do not call it resetCompactionWriter? We just call it cleanupCompaction or something else, which indicate that the compaction is finally finished, and the compactor should release all the related resources of this compaction.

@Apache9
Copy link
Contributor

Apache9 commented Apr 13, 2022

@Apache9 I have been testing with my original workaround for what it's worth, #4334 . That change does not allow concurrent compaction against a given store, respecting that Compactor is not thread safe for now. It works well. The performance of the test scenario is unchanged from baseline without any SFT changes. As an option to unblock us we could use it for now and come back to the implementation issues with SFT and compaction in a follow up issue.

I do not think this is a good way to solve the problem, we do allow concurrent compactions happen at the same time in the past...

@apurtell
Copy link
Contributor Author

Maybe we do not call it resetCompactionWriter? We just call it cleanupCompaction or something else, which indicate that the compaction is finally finished, and the compactor should release all the related resources of this compaction.

I still think it is messy.
We have this PR in progress already, so I will make this change, so we can at least look at it.

@apurtell
Copy link
Contributor Author

I do not think this is a good way to solve the problem, we do allow concurrent compactions happen at the same time in the past...

I agree. It would be to unblock us to give more time to think about SFT design around compaction, not a permanent solution. Anyway I will update this PR as promised soon.

@joshelser
Copy link
Member

That change does not allow concurrent compaction against a given store, respecting that Compactor is not thread safe for now. It works well. The performance of the test scenario is unchanged from baseline without any SFT changes. As an option to unblock us we could use it for now and come back to the implementation issues with SFT and compaction in a follow up issue.

I agree with your feelings, Andrew. Correctness first and then optimization.

we do allow concurrent compactions happen at the same time in the past...

Acknowledging this too: yes, we should be able to compact two distinct subsets of the files in a store concurrently. And, I could see value in doing so (e.g. compacting three larger files in a store and also wanting to compact a few smaller hfiles created from memstore flushes -- we don't want to wait for that big compaction to finish to run the smaller one).

@apurtell
Copy link
Contributor Author

apurtell commented Apr 13, 2022

Posting an update soon. Making sure compaction unit tests all pass first.

@apurtell
Copy link
Contributor Author

@joshelser We have the other approach to fall back on but I think we can make the current code work with some polish.

Compactions might be concurrent against a given store and the Compactor is
shared among them. Do not put mutable state into shared class fields. All
Compactor class fields should be final or effectively final.
'keepSeqIdPeriod' is an exception to this rule because unit tests may set it.
@apurtell
Copy link
Contributor Author

Pushed an update that brings all of the above discussion together.

  • Compactions might be concurrent against a given store and the Compactor is shared among them. Add a comment to the class javadoc to not put mutable state into shared class fields. Make all Compactor class fields final, with the exception of keepSeqIdPeriod, which is set by unit tests.
  • Pass writer and progress instances through compaction as method parmeters of performCompaction to improve MT-safety of these code paths.
  • Scope compaction progress to each compaction to improve MT-safety overall and the accuracy of compaction progress reporting.

All compaction unit tests pass.

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hbase.util.compaction.TestMajorCompactionTTLRequest
[INFO] Running org.apache.hadoop.hbase.util.compaction.TestMajorCompactionRequest
[INFO] Running org.apache.hadoop.hbase.quotas.policies.TestNoWritesCompactionsViolationPolicyEnforcement
[INFO] Running org.apache.hadoop.hbase.regionserver.TestMinorCompaction
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.427 s - in org.apache.hadoop.hbase.quotas.policies.TestNoWritesCompactionsViolationPolicyEnforcement
[INFO] Running org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicyOverflow
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.942 s - in org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicyOverflow
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.878 s - in org.apache.hadoop.hbase.util.compaction.TestMajorCompactionTTLRequest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.918 s - in org.apache.hadoop.hbase.util.compaction.TestMajorCompactionRequest
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionState
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionArchiveConcurrentClose
[INFO] Running org.apache.hadoop.hbase.regionserver.throttle.TestCompactionWithThroughputController
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.935 s - in org.apache.hadoop.hbase.regionserver.TestMinorCompaction
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionWithCoprocessor
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.828 s - in org.apache.hadoop.hbase.regionserver.TestCompactionArchiveConcurrentClose
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionWithByteBuff
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 34.662 s - in org.apache.hadoop.hbase.regionserver.TestCompactionWithByteBuff
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionFileNotFound
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 18.873 s - in org.apache.hadoop.hbase.regionserver.TestCompactionFileNotFound
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionArchiveIOException
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.024 s - in org.apache.hadoop.hbase.regionserver.TestCompactionArchiveIOException
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionLifeCycleTracker
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 84.897 s - in org.apache.hadoop.hbase.regionserver.throttle.TestCompactionWithThroughputController
[INFO] Running org.apache.hadoop.hbase.regionserver.TestMajorCompaction
[WARNING] Tests run: 3, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 16.319 s - in org.apache.hadoop.hbase.regionserver.TestCompactionLifeCycleTracker
[INFO] Running org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicy
[INFO] Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.984 s - in org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicy
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionInDeadRegionServer
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 26.748 s - in org.apache.hadoop.hbase.regionserver.TestCompactionInDeadRegionServer
[INFO] Running org.apache.hadoop.hbase.regionserver.querymatcher.TestCompactionScanQueryMatcher
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.896 s - in org.apache.hadoop.hbase.regionserver.querymatcher.TestCompactionScanQueryMatcher
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompaction
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 199.59 s - in org.apache.hadoop.hbase.regionserver.TestCompactionWithCoprocessor
[INFO] Running org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicyHeterogeneousStorage
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.228 s - in org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicyHeterogeneousStorage
[INFO] Running org.apache.hadoop.hbase.regionserver.compactions.TestFIFOCompactionPolicy
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 219.608 s - in org.apache.hadoop.hbase.regionserver.TestCompactionState
[INFO] Running org.apache.hadoop.hbase.regionserver.compactions.TestStripeCompactionPolicy
[INFO] Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.35 s - in org.apache.hadoop.hbase.regionserver.compactions.TestStripeCompactionPolicy
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionAfterBulkLoad
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 24.276 s - in org.apache.hadoop.hbase.regionserver.compactions.TestFIFOCompactionPolicy
[INFO] Running org.apache.hadoop.hbase.master.region.TestMasterRegionCompaction
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.307 s - in org.apache.hadoop.hbase.regionserver.TestCompactionAfterBulkLoad
[INFO] Running org.apache.hadoop.hbase.rsgroup.TestRSGroupMajorCompactionTTL
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.496 s - in org.apache.hadoop.hbase.master.region.TestMasterRegionCompaction
[INFO] Running org.apache.hadoop.hbase.mob.TestMobStoreCompaction
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.738 s - in org.apache.hadoop.hbase.mob.TestMobStoreCompaction
[INFO] Running org.apache.hadoop.hbase.mob.TestMobCompactionWithDefaults
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 208.078 s - in org.apache.hadoop.hbase.regionserver.TestCompaction
[INFO] Running org.apache.hadoop.hbase.mob.TestMobCompactionRegularRegionBatchMode
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 115.164 s - in org.apache.hadoop.hbase.rsgroup.TestRSGroupMajorCompactionTTL
[INFO] Running org.apache.hadoop.hbase.mob.TestMobCompactionOptRegionBatchMode
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 187.897 s - in org.apache.hadoop.hbase.mob.TestMobCompactionWithDefaults
[INFO] Running org.apache.hadoop.hbase.mob.TestMobCompactionOptMode
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 179.166 s - in org.apache.hadoop.hbase.mob.TestMobCompactionOptMode
[INFO] Running org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreadedWithBasicCompaction
[WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.022 s - in org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreadedWithBasicCompaction
[INFO] Running org.apache.hadoop.hbase.client.TestMobRestoreSnapshotFromClientGetCompactionState
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 290.012 s - in org.apache.hadoop.hbase.mob.TestMobCompactionOptRegionBatchMode
[INFO] Running org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreadedWithEagerCompaction
[WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.026 s - in org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreadedWithEagerCompaction
[INFO] Running org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientGetCompactionState
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 25.904 s - in org.apache.hadoop.hbase.client.TestMobRestoreSnapshotFromClientGetCompactionState
[INFO] Running org.apache.hadoop.hbase.TestAcidGuaranteesWithNoInMemCompaction
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 321.549 s - in org.apache.hadoop.hbase.mob.TestMobCompactionRegularRegionBatchMode
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 31.118 s - in org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientGetCompactionState
[INFO] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 688.559 s - in org.apache.hadoop.hbase.regionserver.TestMajorCompaction
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 201.193 s - in org.apache.hadoop.hbase.TestAcidGuaranteesWithNoInMemCompaction
[INFO] Results:
[WARNING] Tests run: 181, Failures: 0, Errors: 0, Skipped: 3

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 56s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 💚 mvninstall 2m 25s master passed
+1 💚 compile 2m 15s master passed
+1 💚 checkstyle 0m 35s master passed
+1 💚 spotbugs 1m 13s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 1m 59s the patch passed
+1 💚 compile 2m 8s the patch passed
+1 💚 javac 2m 8s the patch passed
-0 ⚠️ checkstyle 0m 33s hbase-server: The patch generated 1 new + 39 unchanged - 0 fixed = 40 total (was 39)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 11m 23s Patch does not cause any errors with Hadoop 3.1.2 3.2.2 3.3.1.
+1 💚 spotbugs 1m 16s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 8s The patch does not generate ASF License warnings.
29m 46s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #4338
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname Linux 609aeac8c873 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f990f56
Default Java AdoptOpenJDK-1.8.0_282-b08
checkstyle https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/artifact/yetus-general-check/output/diff-checkstyle-hbase-server.txt
Max. process+thread count 64 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/console
versions git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented Apr 14, 2022

I think the key problem here is the compactionTargets is used during the whole compaction life time, but the state is only stored in compactor, which is only used for generating the new store files, so no matter what we do, it is still very awkward.

I got another idea in mind that, maybe we could just pass a StoreFileWriterTargetCollector to the compactor, the compactor implementation will add new target to the collector when creating new store file writer. And the collector is stored in StoreEngine or HStore, so we are free to clean it up at any time. The compactor does not need to do the cleanup work any more. I could implement a POC to see if it works. In general, I think we should also track the target of flushed store files as well, but this is a missing part in the original patch of HBASE-26271.

Thanks.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 39s Docker mode activated.
-0 ⚠️ yetus 0m 2s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 3m 46s master passed
+1 💚 compile 0m 34s master passed
+1 💚 shadedjars 3m 52s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 23s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 6s the patch passed
+1 💚 compile 0m 34s the patch passed
+1 💚 javac 0m 34s the patch passed
+1 💚 shadedjars 3m 50s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 21s the patch passed
_ Other Tests _
+1 💚 unit 183m 11s hbase-server in the patch passed.
201m 9s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #4338
Optional Tests javac javadoc unit shadedjars compile
uname Linux dd6216435bbc 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f990f56
Default Java AdoptOpenJDK-1.8.0_282-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/testReport/
Max. process+thread count 3052 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 20s Docker mode activated.
-0 ⚠️ yetus 0m 4s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 2m 35s master passed
+1 💚 compile 0m 48s master passed
+1 💚 shadedjars 3m 41s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 27s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 35s the patch passed
+1 💚 compile 0m 45s the patch passed
+1 💚 javac 0m 45s the patch passed
+1 💚 shadedjars 3m 38s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 26s the patch passed
_ Other Tests _
+1 💚 unit 206m 49s hbase-server in the patch passed.
225m 17s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #4338
Optional Tests javac javadoc unit shadedjars compile
uname Linux 8a34d59bfcd2 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f990f56
Default Java AdoptOpenJDK-11.0.10+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/testReport/
Max. process+thread count 2669 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented Apr 14, 2022

Please see if this POC could make things better...

I've introduced a a StoreFileWriterCreationTracker(actually in most places it is just a Consumer), for recording the creation of a StoreFileWriter, and this time we could also track the written files for flush. The trackers are maintained in HStore, so in compactor we do not need to store them any more.

Apache9@79fa3a9

Copy link
Member

@joshelser joshelser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long before really digging into the code. Your solution makes sense to me, Andrew. I also see why Duo doesn't think this is optimal.

Going over to his commit to look at that now.

Comment on lines +383 to +384
// This signals that the target file is no longer written and can be cleaned up
completeCompaction(request);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying: we only need to call completeCompaction in the exceptional case (compaction didn't finish normally)? And commitWriter() down below is doing the same kind of cleanup work that completeCompaction() here is doing?

if (writer instanceof StoreFileWriter) {
targets.add(((StoreFileWriter) writer).getPath());
} else {
((AbstractMultiFileWriter) writer).writers().stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest: make this an else if (writer instance of AbstractMultiFileWriter) to be defensive and have the else branch to fail loudly if something goes wrong.

Comment on lines +609 to +610
writerMap.remove(request);
progressMap.remove(request);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that you're generally OK having these two synchronized data structures updated not under the same lock? Given the current use of writerMap and progressMap, nothing is jumping out at me as bad as you currently have it. Just thinking out loud.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care if either of the remove() calls returns null (i.e. the map doesn't have the mapping in it)?

@apurtell
Copy link
Contributor Author

apurtell commented Apr 14, 2022

Let @Apache9 propose a PR for his solution that moves the scope of where this state is tracked, makes sense to me.
No hard feelings. Need to conserve my time. We shouldn't be pursuing two different solutions. Let's just review the one that is cleaner, I think both Duo's proposal and Josh's comment expressed that preference.

@apurtell apurtell closed this Apr 14, 2022
@apurtell apurtell deleted the HBASE-26938 branch April 14, 2022 22:57
@apurtell
Copy link
Contributor Author

One thing I will note is I also tried to clean up progress reporting, which will need to be separately addressed.

@Apache9
Copy link
Contributor

Apache9 commented Apr 15, 2022

Let @Apache9 propose a PR for his solution that moves the scope of where this state is tracked, makes sense to me. No hard feelings. Need to conserve my time. We shouldn't be pursuing two different solutions. Let's just review the one that is cleaner, I think both Duo's proposal and Josh's comment expressed that preference.

Oh, sorry, I didn't mean to push my solution. Why I implement a POC is that seems we all think the current approach in this PR is a bit messy as we can not clean up the state with the 'try..finally' style, so I just give us an impression that what the code will look like if we can do the cleanup with the 'try...finally' style.

In fact I think the current solution in this PR is better enough to solve the problem for now and all related classes are IA.Private, so we are free to apply this PR first and then change to other approachs in the future.

So I'm neutrality on these two approach, especially if we want to fix the problem ASAP. Then we will have plenty of time to discuss what is a better solution, as it will not block the 2.5.0 release.

Just my thpughts, sorry if I didn't say this clearly before posting the POC. Anyway, if you guys think the approach in POC is generally good, let me convert it to a PR so you can better review it.

Thanks~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants