HBASE-26938 Compaction failures after StoreFileTracker integration #4338

apurtell · 2022-04-12T02:13:50Z

Compactions might be concurrent against a given store and the Compactor is shared among them. Do not put mutable state into shared class fields. All Compactor class fields should be final. At the moment 'keepSeqIdPeriod' is an exception to this rule because some unit tests change it.

Compactor#getProgress and Compactor#getCompactionTargets now return union results of all compactions in progress against the store.

Apache-HBase · 2022-04-12T03:07:53Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 48s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ master Compile Tests _
+1 💚	mvninstall	2m 41s	master passed
+1 💚	compile	2m 17s	master passed
+1 💚	checkstyle	0m 35s	master passed
+1 💚	spotbugs	1m 18s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 10s	the patch passed
+1 💚	compile	2m 13s	the patch passed
+1 💚	javac	2m 13s	the patch passed
+1 💚	checkstyle	0m 34s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	12m 2s	Patch does not cause any errors with Hadoop 3.1.2 3.2.2 3.3.1.
+1 💚	spotbugs	1m 20s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 10s	The patch does not generate ASF License warnings.
		31m 9s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#4338
Optional Tests	dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname	Linux 73b794091bc1 5.4.0-1043-aws #45~18.04.1-Ubuntu SMP Fri Apr 9 23:32:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `1d8a5bf`
Default Java	AdoptOpenJDK-1.8.0_282-b08
Max. process+thread count	69 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/console
versions	git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2022-04-12T06:03:11Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	3m 42s	Docker mode activated.
-0 ⚠️	yetus	0m 2s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	2m 24s	master passed
+1 💚	compile	0m 35s	master passed
+1 💚	shadedjars	3m 55s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 23s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 9s	the patch passed
+1 💚	compile	0m 34s	the patch passed
+1 💚	javac	0m 34s	the patch passed
+1 💚	shadedjars	3m 53s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 20s	the patch passed
		_ Other Tests _
-1 ❌	unit	187m 22s	hbase-server in the patch failed.
		206m 24s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR	#4338
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 251a03d19c4c 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `1d8a5bf`
Default Java	AdoptOpenJDK-1.8.0_282-b08
unit	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/testReport/
Max. process+thread count	2738 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/1/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java

Apache9 · 2022-04-12T06:02:06Z

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java

-    // This step is necessary for the correctness of BrokenStoreFileCleanerChore. It lets the
-    // CleanerChore know that compaction is done and the file can be cleaned up if compaction
-    // have failed.
-    storeEngine.resetCompactionWriter();


This is not needed any more?

We can leave it in the API, but the current implementation only set the writer field to null, and so the method bodies become empty after that field is converted into a parameter and they no longer have anything to do, so I removed it.

If we are going to keep it, we need to use a Map instead of Set to track the writers, and somehow need to pass a key as a parameter to abort and remove a StoreFileWriter instance. What should be the key? The CompactionRequestImpl? I do not think there is a requirement to abort compaction writers in this way. We abort compactions today by interrupting the thread.

If BrokenStoreFileCleanerChore will not function correctly without this, then it will need modification.

I think BrokenStoreFileCleanerChore sees the same results from getCompactionTargets after these changes. When the compaction is finished the StoreFileWriter will be removed from the set in the finally block of compact, so getCompactionTargets will not include the files being written by that writer after that point, which is the same thing that happened when resetCompactionWriter would cause the writer field in the previous impl to become null, and also the files being written by that writer would no longer appear in getCompactionTargets results afterward. But the timing has changed, that is true.

Logically we need this to keep correctness. IIRC, the problem here is that, we can only cleanup the writer instance after we successfully commit the store files to SFT, i.e, after the replaceStoreFile method. That's why we can not just simply remove the writer instance in commitWriter, otherwise there could be data loss, i.e, the BrokenStoreFileCleanerChore may delete the store files which are written just now but have not been added to the SFT yet...

Let me check again if the new implementation can solve the problem.

Thanks for that, you know BrokenStoreFileCleanerChore best.

Ah actually there is even a problem here in the current code, let me fix it...

Fixed, at least now we do not remove the writer from the set until after commit.

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java

apurtell · 2022-04-12T16:03:48Z

Pushed updates responding to review feedback.

Apache-HBase · 2022-04-12T17:11:16Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 28s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ master Compile Tests _
+1 💚	mvninstall	3m 10s	master passed
+1 💚	compile	2m 43s	master passed
+1 💚	checkstyle	0m 41s	master passed
+1 💚	spotbugs	1m 37s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 13s	the patch passed
+1 💚	compile	2m 50s	the patch passed
+1 💚	javac	2m 50s	the patch passed
+1 💚	checkstyle	0m 38s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	16m 51s	Patch does not cause any errors with Hadoop 3.1.2 3.2.2 3.3.1.
+1 💚	spotbugs	1m 59s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 12s	The patch does not generate ASF License warnings.
		41m 22s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#4338
Optional Tests	dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname	Linux 8b1fd70052cf 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `ea9bc92`
Default Java	AdoptOpenJDK-1.8.0_282-b08
Max. process+thread count	70 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/console
versions	git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

joshelser · 2022-04-12T18:22:36Z

FYI, we have some folks running this on a cluster against S3 now. I'll find their Github IDs to tag them here, so we can keep you up to date on real test runs :)

edit: hat-tip for @chrajeshbabu for now
double edit: also hat-tip to @ragarkar and @rsnegi-gh

Apache-HBase · 2022-04-12T20:44:39Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 29s	Docker mode activated.
-0 ⚠️	yetus	0m 2s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	3m 0s	master passed
+1 💚	compile	0m 44s	master passed
+1 💚	shadedjars	4m 44s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 39s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 59s	the patch passed
+1 💚	compile	0m 53s	the patch passed
+1 💚	javac	0m 53s	the patch passed
+1 💚	shadedjars	5m 7s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 23s	the patch passed
		_ Other Tests _
-1 ❌	unit	233m 37s	hbase-server in the patch failed.
		256m 10s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR	#4338
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux c5fa78127f30 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `ea9bc92`
Default Java	AdoptOpenJDK-1.8.0_282-b08
unit	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/testReport/
Max. process+thread count	2638 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2022-04-12T20:45:21Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 16s	Docker mode activated.
-0 ⚠️	yetus	0m 4s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	2m 43s	master passed
+1 💚	compile	0m 46s	master passed
+1 💚	shadedjars	3m 40s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 27s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 35s	the patch passed
+1 💚	compile	0m 47s	the patch passed
+1 💚	javac	0m 47s	the patch passed
+1 💚	shadedjars	3m 39s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 26s	the patch passed
		_ Other Tests _
-1 ❌	unit	238m 8s	hbase-server in the patch failed.
		256m 57s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#4338
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 5e1523f7f52c 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `ea9bc92`
Default Java	AdoptOpenJDK-11.0.10+9
unit	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/testReport/
Max. process+thread count	2604 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/2/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache9 · 2022-04-12T23:59:56Z

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java

+            return Collections.emptyList();
+          } else {
+            // Finished, commit the writer's results.
+            return commitWriter(writer, fd, request);


The commitWriter here just means append metadata and close the writer, it does not mean to record the files in SFT...

We still need to add something after the replaceStoreFile call to remove the writer...

So maybe you are right, we need to use a Map instead of Set, the key can be the CompactionRequestImpl?

joshelser · 2022-04-13T00:12:11Z

FWIW, we had a big ycsb load running from multiple nodes in parallel against a 5 node cluster with this change that ran happily for 6 hours (eventually failed due to an OOME which I think is just a misconfiguration by us).

apurtell · 2022-04-13T15:47:26Z

Last set of changes broke tests so I will revert them.

@Apache9 I think we need to go back to the drawing board here. We need to put the call out to SFT logic somewhere else, or we need to make Compactor something that is created per compaction as was the assumption behind the changes that added the 'writer' field and requires the semantics offered by that. I will try the latter.

Apache9 · 2022-04-13T15:53:04Z

Last set of changes broke tests so I will revert them.

@Apache9 I think we need to go back to the drawing board here. We need to put the call out to SFT logic somewhere else, or we need to make Compactor something that is created per compaction as was the assumption behind the changes that added the 'writer' field and requires the semantics offered by that. I will try the latter.

Creating Compactor per compaction maybe too big? FWIW, we have a Compactor instance in StoreEngine, if we want to make it per compaction, it will cause a very big refactoring.

So I suggest that, we add something like a compactionId to the CompactionRequest interface, and use it as the key for our map, when calling StoreEngine.replaceStoreFile, we pass this id in, then we could use this to remove the writer in the Compactor.

WDYT?

apurtell · 2022-04-13T15:58:28Z

So I suggest that, we add something like a compactionId to the CompactionRequest interface, and use it as the key for our map, when calling StoreEngine.replaceStoreFile, we pass this id in, then we could use this to remove the writer in the Compactor.

~~I think it's messy and leaks abstraction.~~

We are not going to remove the writer in Compactor. Compactor will create the writer and leave it in the map. External code will call some new API to remove the writer only after the appropriate SFT methods have been called. So this internal detail of Compactor leaks out to all of the users.

Edit: I had a change of heart because the result is not bad and solves a couple of related problems.

Anyway my other idea wouldn't work because of resetCompactionWriter in StoreFileEngine, which assumes that Compactor is a singleton, even though the other SFT changes assume it is per compaction.

apurtell · 2022-04-13T16:03:24Z

@Apache9 I have been testing with my original workaround for what it's worth, #4334 . That change does not allow concurrent compaction against a given store, respecting that Compactor is not thread safe for now. It works well. The performance of the test scenario is unchanged from baseline without any SFT changes. As an option to unblock us we could use it for now and come back to the implementation issues with SFT and compaction in a follow up issue.

Apache9 · 2022-04-13T16:07:43Z

Maybe we do not call it resetCompactionWriter? We just call it cleanupCompaction or something else, which indicate that the compaction is finally finished, and the compactor should release all the related resources of this compaction.

Apache9 · 2022-04-13T16:08:48Z

@Apache9 I have been testing with my original workaround for what it's worth, #4334 . That change does not allow concurrent compaction against a given store, respecting that Compactor is not thread safe for now. It works well. The performance of the test scenario is unchanged from baseline without any SFT changes. As an option to unblock us we could use it for now and come back to the implementation issues with SFT and compaction in a follow up issue.

I do not think this is a good way to solve the problem, we do allow concurrent compactions happen at the same time in the past...

apurtell · 2022-04-13T16:08:54Z

Maybe we do not call it resetCompactionWriter? We just call it cleanupCompaction or something else, which indicate that the compaction is finally finished, and the compactor should release all the related resources of this compaction.

I still think it is messy.
We have this PR in progress already, so I will make this change, so we can at least look at it.

apurtell · 2022-04-13T16:09:48Z

I do not think this is a good way to solve the problem, we do allow concurrent compactions happen at the same time in the past...

I agree. It would be to unblock us to give more time to think about SFT design around compaction, not a permanent solution. Anyway I will update this PR as promised soon.

joshelser · 2022-04-13T16:32:30Z

That change does not allow concurrent compaction against a given store, respecting that Compactor is not thread safe for now. It works well. The performance of the test scenario is unchanged from baseline without any SFT changes. As an option to unblock us we could use it for now and come back to the implementation issues with SFT and compaction in a follow up issue.

I agree with your feelings, Andrew. Correctness first and then optimization.

we do allow concurrent compactions happen at the same time in the past...

Acknowledging this too: yes, we should be able to compact two distinct subsets of the files in a store concurrently. And, I could see value in doing so (e.g. compacting three larger files in a store and also wanting to compact a few smaller hfiles created from memstore flushes -- we don't want to wait for that big compaction to finish to run the smaller one).

apurtell · 2022-04-13T17:14:25Z

Posting an update soon. Making sure compaction unit tests all pass first.

apurtell · 2022-04-13T17:15:29Z

@joshelser We have the other approach to fall back on but I think we can make the current code work with some polish.

Compactions might be concurrent against a given store and the Compactor is shared among them. Do not put mutable state into shared class fields. All Compactor class fields should be final or effectively final. 'keepSeqIdPeriod' is an exception to this rule because unit tests may set it.

apurtell · 2022-04-13T23:07:59Z

Pushed an update that brings all of the above discussion together.

Compactions might be concurrent against a given store and the Compactor is shared among them. Add a comment to the class javadoc to not put mutable state into shared class fields. Make all Compactor class fields final, with the exception of keepSeqIdPeriod, which is set by unit tests.
Pass writer and progress instances through compaction as method parmeters of performCompaction to improve MT-safety of these code paths.
Scope compaction progress to each compaction to improve MT-safety overall and the accuracy of compaction progress reporting.

All compaction unit tests pass.

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hbase.util.compaction.TestMajorCompactionTTLRequest
[INFO] Running org.apache.hadoop.hbase.util.compaction.TestMajorCompactionRequest
[INFO] Running org.apache.hadoop.hbase.quotas.policies.TestNoWritesCompactionsViolationPolicyEnforcement
[INFO] Running org.apache.hadoop.hbase.regionserver.TestMinorCompaction
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.427 s - in org.apache.hadoop.hbase.quotas.policies.TestNoWritesCompactionsViolationPolicyEnforcement
[INFO] Running org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicyOverflow
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.942 s - in org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicyOverflow
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.878 s - in org.apache.hadoop.hbase.util.compaction.TestMajorCompactionTTLRequest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.918 s - in org.apache.hadoop.hbase.util.compaction.TestMajorCompactionRequest
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionState
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionArchiveConcurrentClose
[INFO] Running org.apache.hadoop.hbase.regionserver.throttle.TestCompactionWithThroughputController
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.935 s - in org.apache.hadoop.hbase.regionserver.TestMinorCompaction
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionWithCoprocessor
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.828 s - in org.apache.hadoop.hbase.regionserver.TestCompactionArchiveConcurrentClose
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionWithByteBuff
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 34.662 s - in org.apache.hadoop.hbase.regionserver.TestCompactionWithByteBuff
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionFileNotFound
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 18.873 s - in org.apache.hadoop.hbase.regionserver.TestCompactionFileNotFound
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionArchiveIOException
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.024 s - in org.apache.hadoop.hbase.regionserver.TestCompactionArchiveIOException
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionLifeCycleTracker
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 84.897 s - in org.apache.hadoop.hbase.regionserver.throttle.TestCompactionWithThroughputController
[INFO] Running org.apache.hadoop.hbase.regionserver.TestMajorCompaction
[WARNING] Tests run: 3, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 16.319 s - in org.apache.hadoop.hbase.regionserver.TestCompactionLifeCycleTracker
[INFO] Running org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicy
[INFO] Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.984 s - in org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicy
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionInDeadRegionServer
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 26.748 s - in org.apache.hadoop.hbase.regionserver.TestCompactionInDeadRegionServer
[INFO] Running org.apache.hadoop.hbase.regionserver.querymatcher.TestCompactionScanQueryMatcher
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.896 s - in org.apache.hadoop.hbase.regionserver.querymatcher.TestCompactionScanQueryMatcher
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompaction
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 199.59 s - in org.apache.hadoop.hbase.regionserver.TestCompactionWithCoprocessor
[INFO] Running org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicyHeterogeneousStorage
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.228 s - in org.apache.hadoop.hbase.regionserver.TestDateTieredCompactionPolicyHeterogeneousStorage
[INFO] Running org.apache.hadoop.hbase.regionserver.compactions.TestFIFOCompactionPolicy
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 219.608 s - in org.apache.hadoop.hbase.regionserver.TestCompactionState
[INFO] Running org.apache.hadoop.hbase.regionserver.compactions.TestStripeCompactionPolicy
[INFO] Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.35 s - in org.apache.hadoop.hbase.regionserver.compactions.TestStripeCompactionPolicy
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactionAfterBulkLoad
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 24.276 s - in org.apache.hadoop.hbase.regionserver.compactions.TestFIFOCompactionPolicy
[INFO] Running org.apache.hadoop.hbase.master.region.TestMasterRegionCompaction
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.307 s - in org.apache.hadoop.hbase.regionserver.TestCompactionAfterBulkLoad
[INFO] Running org.apache.hadoop.hbase.rsgroup.TestRSGroupMajorCompactionTTL
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.496 s - in org.apache.hadoop.hbase.master.region.TestMasterRegionCompaction
[INFO] Running org.apache.hadoop.hbase.mob.TestMobStoreCompaction
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.738 s - in org.apache.hadoop.hbase.mob.TestMobStoreCompaction
[INFO] Running org.apache.hadoop.hbase.mob.TestMobCompactionWithDefaults
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 208.078 s - in org.apache.hadoop.hbase.regionserver.TestCompaction
[INFO] Running org.apache.hadoop.hbase.mob.TestMobCompactionRegularRegionBatchMode
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 115.164 s - in org.apache.hadoop.hbase.rsgroup.TestRSGroupMajorCompactionTTL
[INFO] Running org.apache.hadoop.hbase.mob.TestMobCompactionOptRegionBatchMode
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 187.897 s - in org.apache.hadoop.hbase.mob.TestMobCompactionWithDefaults
[INFO] Running org.apache.hadoop.hbase.mob.TestMobCompactionOptMode
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 179.166 s - in org.apache.hadoop.hbase.mob.TestMobCompactionOptMode
[INFO] Running org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreadedWithBasicCompaction
[WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.022 s - in org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreadedWithBasicCompaction
[INFO] Running org.apache.hadoop.hbase.client.TestMobRestoreSnapshotFromClientGetCompactionState
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 290.012 s - in org.apache.hadoop.hbase.mob.TestMobCompactionOptRegionBatchMode
[INFO] Running org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreadedWithEagerCompaction
[WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.026 s - in org.apache.hadoop.hbase.client.TestAsyncTableGetMultiThreadedWithEagerCompaction
[INFO] Running org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientGetCompactionState
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 25.904 s - in org.apache.hadoop.hbase.client.TestMobRestoreSnapshotFromClientGetCompactionState
[INFO] Running org.apache.hadoop.hbase.TestAcidGuaranteesWithNoInMemCompaction
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 321.549 s - in org.apache.hadoop.hbase.mob.TestMobCompactionRegularRegionBatchMode
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 31.118 s - in org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientGetCompactionState
[INFO] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 688.559 s - in org.apache.hadoop.hbase.regionserver.TestMajorCompaction
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 201.193 s - in org.apache.hadoop.hbase.TestAcidGuaranteesWithNoInMemCompaction
[INFO] Results:
[WARNING] Tests run: 181, Failures: 0, Errors: 0, Skipped: 3

Apache-HBase · 2022-04-14T00:07:09Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 56s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 1s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ master Compile Tests _
+1 💚	mvninstall	2m 25s	master passed
+1 💚	compile	2m 15s	master passed
+1 💚	checkstyle	0m 35s	master passed
+1 💚	spotbugs	1m 13s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	1m 59s	the patch passed
+1 💚	compile	2m 8s	the patch passed
+1 💚	javac	2m 8s	the patch passed
-0 ⚠️	checkstyle	0m 33s	hbase-server: The patch generated 1 new + 39 unchanged - 0 fixed = 40 total (was 39)
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	11m 23s	Patch does not cause any errors with Hadoop 3.1.2 3.2.2 3.3.1.
+1 💚	spotbugs	1m 16s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 8s	The patch does not generate ASF License warnings.
		29m 46s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#4338
Optional Tests	dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname	Linux 609aeac8c873 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `f990f56`
Default Java	AdoptOpenJDK-1.8.0_282-b08
checkstyle	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/artifact/yetus-general-check/output/diff-checkstyle-hbase-server.txt
Max. process+thread count	64 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/console
versions	git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache9 · 2022-04-14T02:06:18Z

I think the key problem here is the compactionTargets is used during the whole compaction life time, but the state is only stored in compactor, which is only used for generating the new store files, so no matter what we do, it is still very awkward.

I got another idea in mind that, maybe we could just pass a StoreFileWriterTargetCollector to the compactor, the compactor implementation will add new target to the collector when creating new store file writer. And the collector is stored in StoreEngine or HStore, so we are free to clean it up at any time. The compactor does not need to do the cleanup work any more. I could implement a POC to see if it works. In general, I think we should also track the target of flushed store files as well, but this is a missing part in the original patch of HBASE-26271.

Thanks.

Apache-HBase · 2022-04-14T02:58:33Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 39s	Docker mode activated.
-0 ⚠️	yetus	0m 2s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	3m 46s	master passed
+1 💚	compile	0m 34s	master passed
+1 💚	shadedjars	3m 52s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 23s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 6s	the patch passed
+1 💚	compile	0m 34s	the patch passed
+1 💚	javac	0m 34s	the patch passed
+1 💚	shadedjars	3m 50s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 21s	the patch passed
		_ Other Tests _
+1 💚	unit	183m 11s	hbase-server in the patch passed.
		201m 9s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR	#4338
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux dd6216435bbc 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `f990f56`
Default Java	AdoptOpenJDK-1.8.0_282-b08
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/testReport/
Max. process+thread count	3052 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2022-04-14T03:22:40Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 20s	Docker mode activated.
-0 ⚠️	yetus	0m 4s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	2m 35s	master passed
+1 💚	compile	0m 48s	master passed
+1 💚	shadedjars	3m 41s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 27s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 35s	the patch passed
+1 💚	compile	0m 45s	the patch passed
+1 💚	javac	0m 45s	the patch passed
+1 💚	shadedjars	3m 38s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 26s	the patch passed
		_ Other Tests _
+1 💚	unit	206m 49s	hbase-server in the patch passed.
		225m 17s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#4338
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 8a34d59bfcd2 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `f990f56`
Default Java	AdoptOpenJDK-11.0.10+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/testReport/
Max. process+thread count	2669 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4338/3/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache9 · 2022-04-14T16:49:40Z

Please see if this POC could make things better...

I've introduced a a StoreFileWriterCreationTracker(actually in most places it is just a Consumer), for recording the creation of a StoreFileWriter, and this time we could also track the written files for flush. The trackers are maintained in HStore, so in compactor we do not need to store them any more.

Apache9@79fa3a9

joshelser

Sorry for taking so long before really digging into the code. Your solution makes sense to me, Andrew. I also see why Duo doesn't think this is optimal.

Going over to his commit to look at that now.

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java

joshelser · 2022-04-14T22:28:09Z

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java

+        // This signals that the target file is no longer written and can be cleaned up
+        completeCompaction(request);


Clarifying: we only need to call completeCompaction in the exceptional case (compaction didn't finish normally)? And commitWriter() down below is doing the same kind of cleanup work that completeCompaction() here is doing?

joshelser · 2022-04-14T22:37:08Z

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java

+        if (writer instanceof StoreFileWriter) {
+          targets.add(((StoreFileWriter) writer).getPath());
+        } else {
+          ((AbstractMultiFileWriter) writer).writers().stream()


Suggest: make this an else if (writer instance of AbstractMultiFileWriter) to be defensive and have the else branch to fail loudly if something goes wrong.

joshelser · 2022-04-14T22:41:52Z

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java

+    writerMap.remove(request);
+    progressMap.remove(request);


I'm assuming that you're generally OK having these two synchronized data structures updated not under the same lock? Given the current use of writerMap and progressMap, nothing is jumping out at me as bad as you currently have it. Just thinking out loud.

Do we care if either of the remove() calls returns null (i.e. the map doesn't have the mapping in it)?

apurtell · 2022-04-14T22:56:34Z

Let @Apache9 propose a PR for his solution that moves the scope of where this state is tracked, makes sense to me.
No hard feelings. Need to conserve my time. We shouldn't be pursuing two different solutions. Let's just review the one that is cleaner, I think both Duo's proposal and Josh's comment expressed that preference.

apurtell · 2022-04-14T22:59:26Z

One thing I will note is I also tried to clean up progress reporting, which will need to be separately addressed.

Apache9 · 2022-04-15T00:19:15Z

Let @Apache9 propose a PR for his solution that moves the scope of where this state is tracked, makes sense to me. No hard feelings. Need to conserve my time. We shouldn't be pursuing two different solutions. Let's just review the one that is cleaner, I think both Duo's proposal and Josh's comment expressed that preference.

Oh, sorry, I didn't mean to push my solution. Why I implement a POC is that seems we all think the current approach in this PR is a bit messy as we can not clean up the state with the 'try..finally' style, so I just give us an impression that what the code will look like if we can do the cleanup with the 'try...finally' style.

In fact I think the current solution in this PR is better enough to solve the problem for now and all related classes are IA.Private, so we are free to apply this PR first and then change to other approachs in the future.

So I'm neutrality on these two approach, especially if we want to fix the problem ASAP. Then we will have plenty of time to discuss what is a better solution, as it will not block the 2.5.0 release.

Just my thpughts, sorry if I didn't say this clearly before posting the POC. Anyway, if you guys think the approach in POC is generally good, let me convert it to a PR so you can better review it.

Thanks~

apurtell requested review from Apache9, wchevreuil and virajjasani April 12, 2022 02:13

apurtell mentioned this pull request Apr 12, 2022

HBASE-26938 Compaction failures after StoreFileTracker integration (branch-2, branch-2.5) #4334

Closed

Apache9 reviewed Apr 12, 2022

View reviewed changes

apurtell force-pushed the HBASE-26938 branch 2 times, most recently from ce237a9 to 55206be Compare April 12, 2022 16:18

Apache9 reviewed Apr 13, 2022

View reviewed changes

apurtell force-pushed the HBASE-26938 branch from 55206be to 8c003fb Compare April 13, 2022 22:58

joshelser reviewed Apr 14, 2022

View reviewed changes

apurtell closed this Apr 14, 2022

apurtell deleted the HBASE-26938 branch April 14, 2022 22:57

apurtell referenced this pull request in Apache9/hbase Apr 14, 2022

HBASE-26938 Introduce a StoreFileWriterCreationTracker

79fa3a9

		// This signals that the target file is no longer written and can be cleaned up
		completeCompaction(request);

HBASE-26938 Compaction failures after StoreFileTracker integration #4338

HBASE-26938 Compaction failures after StoreFileTracker integration #4338

Conversation

apurtell commented Apr 12, 2022 • edited Loading

Apache-HBase commented Apr 12, 2022

Apache-HBase commented Apr 12, 2022

Choose a reason for hiding this comment

apurtell Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apurtell commented Apr 12, 2022

Apache-HBase commented Apr 12, 2022

joshelser commented Apr 12, 2022 • edited Loading

Apache-HBase commented Apr 12, 2022

Apache-HBase commented Apr 12, 2022

Choose a reason for hiding this comment

joshelser commented Apr 13, 2022

apurtell commented Apr 13, 2022

Apache9 commented Apr 13, 2022

apurtell commented Apr 13, 2022 • edited Loading

apurtell commented Apr 13, 2022 • edited Loading

Apache9 commented Apr 13, 2022

Apache9 commented Apr 13, 2022

apurtell commented Apr 13, 2022

apurtell commented Apr 13, 2022

joshelser commented Apr 13, 2022

apurtell commented Apr 13, 2022 • edited Loading

apurtell commented Apr 13, 2022

apurtell commented Apr 13, 2022

Apache-HBase commented Apr 14, 2022

Apache9 commented Apr 14, 2022

Apache-HBase commented Apr 14, 2022

Apache-HBase commented Apr 14, 2022

Apache9 commented Apr 14, 2022

joshelser left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apurtell commented Apr 14, 2022 • edited Loading

apurtell commented Apr 14, 2022

Apache9 commented Apr 15, 2022

apurtell commented Apr 12, 2022 •

edited

Loading

apurtell Apr 12, 2022 •

edited

Loading

joshelser commented Apr 12, 2022 •

edited

Loading

apurtell commented Apr 13, 2022 •

edited

Loading

apurtell commented Apr 13, 2022 •

edited

Loading

apurtell commented Apr 13, 2022 •

edited

Loading

joshelser left a comment •

edited

Loading

apurtell commented Apr 14, 2022 •

edited

Loading