Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only SyncClosedLogs for multiple CFs #4460

Closed
wants to merge 1 commit into from

Conversation

solicomo
Copy link
Contributor

@solicomo solicomo commented Oct 5, 2018

Call SyncClosedLogs() only if there are more than one column families.

Update several unit tests (in fault_injection_test and db_flush_test) correspondingly.

See #3840 for more info.

@solicomo
Copy link
Contributor Author

solicomo commented Oct 9, 2018

@miasantreble PTAL.

@riversand963
Copy link
Contributor

Nice. Also cc @ajkr

@riversand963
Copy link
Contributor

#4416 is also duplicate of this. I will close #4416 after merging this one.

@solicomo
Copy link
Contributor Author

#4416 is also duplicate of this. I will close #4416 after merging this one.

Sorry. I didn't know you are working on it.

@riversand963
Copy link
Contributor

I took a pass, and found that you removed the case of single column family, which should fail according to the test result of #4416 . Can you add a test for single column family case, too?

@solicomo
Copy link
Contributor Author

I took a pass, and found that you removed the case of single column family, which should fail according to the test result of #4416 . Can you add a test for single column family case, too?

Are you saying test DBFlushTest::SyncFail?

@riversand963
Copy link
Contributor

I took a pass, and found that you removed the case of single column family, which should fail according to the test result of #4416 . Can you add a test for single column family case, too?

Are you saying test DBFlushTest::SyncFail?

Yes. Original implementation always calls SyncClosedLogs since the number of column families is always greater than 0. This PR changed the logic such that SyncClosedLogs is not called when there is only one column family. Therefore, we are interested in what would happen if there is only one column family in addition to your updated version of the test.

Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @solicomo for this PR. I've left a comment.

@solicomo
Copy link
Contributor Author

I took a pass, and found that you removed the case of single column family, which should fail according to the test result of #4416 . Can you add a test for single column family case, too?

Are you saying test DBFlushTest::SyncFail?

Yes. Original implementation always calls SyncClosedLogs since the number of column families is always greater than 0. This PR changed the logic such that SyncClosedLogs is not called when there is only one column family. Therefore, we are interested in what would happen if there is only one column family in addition to your updated version of the test.

Agreed. I'm working on it.
Thanks for your suggestion.

@solicomo solicomo force-pushed the issue-3840-SyncClosedLogs branch from 8ab5782 to 94781ae Compare October 14, 2018 00:07
@solicomo
Copy link
Contributor Author

Hi @riversand963 , the unit test is added. Could you please review it again?
Thanks.

@riversand963
Copy link
Contributor

Hi @solicomo , thanks for working on this. I noticed that you added a column family 'pikachu' in DBFlushTest.SyncFail. Therefore, in DBFlushTest.SyncFail, the number of column families now is 2 ('default' and 'pikachu'), and SyncClosedLogs is still always called.
Also, is it necessary to add one column family in FaultInjectionTest? The change you made in db_impl_compaction_flush.cc will mostly affect cases in which there is only one column family.

@solicomo
Copy link
Contributor Author

Hi @solicomo , thanks for working on this. I noticed that you added a column family 'pikachu' in DBFlushTest.SyncFail. Therefore, in DBFlushTest.SyncFail, the number of column families now is 2 ('default' and 'pikachu'), and SyncClosedLogs is still always called.
Also, is it necessary to add one column family in FaultInjectionTest? The change you made in db_impl_compaction_flush.cc will mostly affect cases in which there is only one column family.

I think DBFlushTest.SyncFail is trying to test the behavior of RocksDB when the filesystem is inactive.
That's why fault_injection_env is used and set to be inactive before Sync.

From the SyncPoints in DBFlushTest.SyncFail, we can also know that SyncClosedLogs() is expected to be called.

To make sure SyncClosedLogs() can be called, we need at least two column families.

Similar thing to the tests in FaultInjectionTest.

I hope I understand the code right. Please let me know if I am wrong.
Thank you.

@riversand963
Copy link
Contributor

@solicomo Due to the sync points dependency in DBFlushTest.SyncFail, I agree that adding a column family here is necessary to make it pass, otherwise the test will wait forever.
However, in FaultInjectionTest, can you try not add extra column family? My local run (also #4416) shows that, with the proposed change to FlushMemTableToOutputFile, FaultInjectionTest fails because certain logs are missing. They are missing because we no longer sync closed logs with the new change since there is only one column family. This means FaultInjectionTest catches a bug, which is a good thing. The reason why your PR passes this test is because you have two column families, thus RocksDB still sync the closed logs. I don't think we can change this test so that only one code path is followed, while the other causes problem. Please let me know if I have missed anything. Thanks

@solicomo
Copy link
Contributor Author

@solicomo Due to the sync points dependency in DBFlushTest.SyncFail, I agree that adding a column family here is necessary to make it pass, otherwise the test will wait forever.
However, in FaultInjectionTest, can you try not add extra column family? My local run (also #4416) shows that, with the proposed change to FlushMemTableToOutputFile, FaultInjectionTest fails because certain logs are missing. They are missing because we no longer sync closed logs with the new change since there is only one column family. This means FaultInjectionTest catches a bug, which is a good thing. The reason why your PR passes this test is because you have two column families, thus RocksDB still sync the closed logs. I don't think we can change this test so that only one code path is followed, while the other causes problem. Please let me know if I have missed anything. Thanks

Are you saying, even SyncClosedLogs() is not called, the "missing logs" are still expected to exist?
If they don't exist, there must be a bug and we need to figure out why they don't exist.
Do I understand right?

@riversand963
Copy link
Contributor

Sorry for the confusion.
I suggested that we do not add an extra column family to force the control flow of the program to take a specific branch.
If you try this suggestion, you can see that the tests will fail because certain files are missing. We need to understand whether these missing files will hurt the reliability of the database.

@solicomo
Copy link
Contributor Author

Sorry for the confusion.
I suggested that we do not add an extra column family to force the control flow of the program to take a specific branch.
If you try this suggestion, you can see that the tests will fail because certain files are missing. We need to understand whether these missing files will hurt the reliability of the database.

OK, got it.
I'm gonna investigate why we lost those log files.
Thank you.

@solicomo solicomo force-pushed the issue-3840-SyncClosedLogs branch 2 times, most recently from 4f0ca8d to 5e95605 Compare October 25, 2018 05:32
@solicomo solicomo changed the title only SyncClosedLogs for multiple CFs Fix #3840: only SyncClosedLogs for multiple CFs Oct 25, 2018
@solicomo solicomo force-pushed the issue-3840-SyncClosedLogs branch from 5e95605 to a0c88a0 Compare October 29, 2018 12:20
@solicomo
Copy link
Contributor Author

Status Update: I am still working on it. It's almost there.

I'm sorry for the delay. But don't blame me, blame Zelda and Nintendo.

@solicomo solicomo force-pushed the issue-3840-SyncClosedLogs branch 2 times, most recently from 6255f44 to 8979c08 Compare October 31, 2018 21:06
@solicomo
Copy link
Contributor Author

I rebased and squashed my commits and reworded the commit message.

@riversand963 Could you take a look again?

The reason of test failure DBFlushTest.SyncFail is FaultInjectionTestEnv fails to untrack the files already deleted.
See summary in commit message for details.

@siying Could you please take a look, too? Because it is related to DBWALTestWithEnrichedEnv.SkipDeletedWALs.

Thanks.

db/db_wal_test.cc Show resolved Hide resolved
fname = dir_to_sync + "/" + to_delete;
fname = dir_to_sync
+ (dir_to_sync.back() == '/' || to_delete.front() == '/' ? "" : "/")
+ to_delete;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert dir_to_sync and to_delete are not empty.
Actually, maybe dir_to_sync can be empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my last comment. If you address this, I'll defer to @riversand963 to accept and merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new commit is pushed.

Actually, I think maybe we need something like canonicalize_file_name() or realpath().

Thanks for your review.

db/db_wal_test.cc Show resolved Hide resolved
Summary:

Call `SyncClosedLogs()` only if there are more than one column families.
See facebook#3840 for more info.

The fix is pretty simple. We only need to change 0 to 1. But the hard part of this PR is fixing the test failures uncovered by this fix.

- The first test failure came up is `DBFlushTest.SyncFail`. It is the easiest one.
  This test is trying to test the behavior of `SyncClosedLogs` when the file system is inactive. So it expects `SyncClosedLogs` can be called.
  To make sure `SyncClosedLogs` can be called, we have to create/open RocksDB with at least 2 column families.

Extra '/' is to blame for all the following test failures.

- The next test failure is `FaultInjectionTestSplitted.FaultTest`.
  After several steps, this test will try to delete all the files which is not fully synced.
  `FaultInjectionTestEnv` is used in this test case. `FaultInjectionTestEnv` is tracking all the files opened and deleted.
  Unfortunately, when a file is opened, `FaultInjectionTestEnv` is told that the file name is like `testrocksdb-26040/fault_test_2273688893793634580/000003.log`,
  but when that file is deleted, `FaultInjectionTestEnv` is told that the file name is like `testrocksdb-26040/fault_test_2273688893793634580//000003.log`.
  See the extra '/'?

  This extra '/' is added to the file name [on line 459 in `db_impl_files.cc`](https://github.com/facebook/rocksdb/blob/v5.15.10/db/db_impl_files.cc#L459).

  But why this test case passed before? Maybe it passed "by mistake".

- `DBWALTestWithEnrichedEnv.SkipDeletedWALs` is the last test failure.
  After I fixed the extra '/' issue, I got this test failure.
  This test case was added in this commit: facebook@d595492
  This commit tries to skip deleted WALs during recovery. If I am right, it works only when `allow_2pc` is enabled.
  But in this test case, `allow_2pc` is not enabled at all.

  Again, why this test case passed before? I think it is the extra '/' who should be blamed.
@solicomo solicomo force-pushed the issue-3840-SyncClosedLogs branch from 8979c08 to efbd9a3 Compare November 1, 2018 23:10
@riversand963
Copy link
Contributor

Good job! @solicomo, thanks for the improvement you make. Let me import your change for our internal build/check. Once that's done, we can merge it.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@sagar0 sagar0 changed the title Fix #3840: only SyncClosedLogs for multiple CFs Only SyncClosedLogs for multiple CFs Nov 13, 2018
ajkr pushed a commit to cockroachdb/rocksdb that referenced this pull request Apr 10, 2019
)

Summary:
Call `SyncClosedLogs()` only if there are more than one column families.

Update several unit tests (in `fault_injection_test` and `db_flush_test`) correspondingly.

See facebook#3840 for more info.
Pull Request resolved: facebook#4460

Differential Revision: D12896377

Pulled By: riversand963

fbshipit-source-id: f49afdaec32568f12f001219a3aec1dfde3b32bf
ajkr pushed a commit to cockroachdb/rocksdb that referenced this pull request Apr 16, 2019
)

Summary:
Call `SyncClosedLogs()` only if there are more than one column families.

Update several unit tests (in `fault_injection_test` and `db_flush_test`) correspondingly.

See facebook#3840 for more info.
Pull Request resolved: facebook#4460

Differential Revision: D12896377

Pulled By: riversand963

fbshipit-source-id: f49afdaec32568f12f001219a3aec1dfde3b32bf
@ajkr ajkr mentioned this pull request Jun 26, 2019
32 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants