Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disagg: Add nullptr check on S3LockLocalManager #9387

Merged
merged 8 commits into from
Sep 2, 2024

Conversation

CalvinNeo
Copy link
Member

@CalvinNeo CalvinNeo commented Aug 30, 2024

What problem does this PR solve?

Issue Number: ref #8673 ref #9394

Problem Summary:

What is changed and how it works?


We have not figure out why there is a nullptr crash, add some checks along with logging to figure out the reason

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

a
Signed-off-by: Calvin Neo <[email protected]>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 30, 2024
@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Aug 30, 2024
@JaySon-Huang
Copy link
Contributor

/hold
We need to figure out why the s3lock_client is nullptr when S3LockLocalManager::createS3Lock is called

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 30, 2024
@CalvinNeo
Copy link
Member Author

CalvinNeo commented Aug 30, 2024

/hold We need to figure out why the s3lock_client is nullptr when S3LockLocalManager::createS3Lock is called

Because it is lazily initialized in UniversalPageStorageService::uploadCheckpointImpl. So if the store has not upload any checkpoint, and an FAP snapshot comes, it could fire. @JaySon-Huang

@CalvinNeo
Copy link
Member Author

/retest

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Aug 30, 2024

@CalvinNeo The waitUntilInited in createS3LockForWriteBatch should blocked until s3lock_client is inited

void S3LockLocalManager::createS3LockForWriteBatch(UniversalWriteBatch & write_batch)
{
waitUntilInited();

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Aug 30, 2024

And loggings show that there are checkpoint uploaded right before this crash happen

[2024/08/28 07:53:21.562 +00:00] [ERROR] [BaseDaemon.cpp:563] ["
  0xaaaad36f7a2c    faultSignalHandler(int, siginfo_t*, void*) [tiflash+122124844]
                    libs/libdaemon/src/BaseDaemon.cpp:214
  0xffff9594e850    <unknown symbol> [linux-vdso.so.1+2128]
  0xaaaad472ac58    DB::PS::V3::S3LockLocalManager::createS3Lock(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, DB::S3::S3FilenameView const&, unsigned long) [tiflash+139111512]
                    dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp:229
  0xaaaad472a57c    DB::PS::V3::S3LockLocalManager::createS3LockForWriteBatch(DB::UniversalWriteBatch&) [tiflash+139109756]
                    dbms/src/Storages/Page/V3/Universal/S3LockLocalManager.cpp:173
  0xaaaad4720150    DB::UniversalPageStorage::write(DB::UniversalWriteBatch&&, DB::PS::V3::PageType, std::__1::shared_ptr<DB::WriteLimiter> const&) const [tiflash+139067728]
                    dbms/src/Storages/Page/V3/Universal/UniversalPageStorage.cpp:118
  0xaaaad462dc24    DB::PageWriter::write(DB::WriteBatchWrapper&&, std::__1::shared_ptr<DB::WriteLimiter>) const [tiflash+138075172]
                    dbms/src/Storages/Page/PageStorage.cpp:780
  0xaaaad3495588    DB::DM::WriteBatches::writeLogAndData() [tiflash+119625096]
                    dbms/src/Storages/DeltaMerge/WriteBatchesImpl.h:146
  0xaaaad3573f3c    DB::DM::StableValueSpace::createFromCheckpoint(std::__1::shared_ptr<DB::Logger> const&, DB::DM::DMContext&, std::__1::shared_ptr<DB::UniversalPageStorage>, unsigned long, DB::DM::WriteBatches&) [tiflash+120536892]
                    dbms/src/Storages/DeltaMerge/StableValueSpace.cpp:283
  0xaaaad34ef63c    DB::DM::Segment::createTargetSegmentsFromCheckpoint(std::__1::shared_ptr<DB::Logger> const&, DB::DM::DMContext&, unsigned long, std::__1::vector<DB::DM::Segment::SegmentMetaInfo, std::__1::allocator<DB::DM::Segment::SegmentMetaInfo>> const&, DB::DM::RowKeyRange const&, std::__1::shared_ptr<DB::UniversalPageStorage>, DB::DM::WriteBatches&) [tiflash+119993916]
                    dbms/src/Storages/DeltaMerge/Segment.cpp:490
  0xaaaad407d870    DB::DM::DeltaMergeStore::buildSegmentsFromCheckpointInfo(std::__1::shared_ptr<DB::DM::DMContext> const&, DB::DM::RowKeyRange const&, std::__1::shared_ptr<DB::CheckpointInfo> const&) const [tiflash+132110448]
                    dbms/src/Storages/DeltaMerge/DeltaMergeStore_Ingest.cpp:1140
  0xaaaad4068954    DB::DM::DeltaMergeStore::buildSegmentsFromCheckpointInfo(DB::Context const&, DB::Settings const&, DB::DM::RowKeyRange const&, std::__1::shared_ptr<DB::CheckpointInfo> const&) [tiflash+132024660]
                    dbms/src/Storages/DeltaMerge/DeltaMergeStore.h:401
  0xaaaad4c469ac    DB::FastAddPeerImplWrite(DB::TMTContext&, DB::TiFlashRaftProxyHelper const*, unsigned long, unsigned long, std::__1::tuple<std::__1::shared_ptr<DB::CheckpointInfo>, std::__1::shared_ptr<DB::Region>, raft_serverpb::RaftApplyState, raft_serverpb::RegionLocalState>&&, unsigned long) [tiflash+144468396]
                    dbms/src/Storages/KVStore/MultiRaft/Disagg/FastAddPeer.cpp:348
  0xaaaad4c47e30    DB::FastAddPeerImpl(std::__1::shared_ptr<DB::FastAddPeerContext>, DB::TMTContext&, DB::TiFlashRaftProxyHelper const*, unsigned long, unsigned long, unsigned long) [tiflash+144473648]
                    dbms/src/Storages/KVStore/MultiRaft/Disagg/FastAddPeer.cpp:455
  0xaaaad4c4b30c    std::__1::__function::__func<FastAddPeer::$_5, std::__1::allocator<FastAddPeer::$_5>, DB::FastAddPeerRes ()>::operator()() [tiflash+144487180]
                    dbms/src/Storages/KVStore/MultiRaft/Disagg/FastAddPeer.cpp:670
  0xaaaad4c4f670    std::__1::packaged_task<DB::FastAddPeerRes ()>::operator()() [tiflash+144504432]
                    /usr/lib/llvm-17/bin/../include/c++/v1/future:1891
  0xaaaad4c4f2dc    DB::AsyncTasks<unsigned long, std::__1::function<DB::FastAddPeerRes ()>, DB::FastAddPeerRes>::addTaskWithCancel(unsigned long, std::__1::function<DB::FastAddPeerRes ()>, std::__1::function<void ()>)::'lambda'()::operator()() const [tiflash+144503516]
                    dbms/src/Storages/KVStore/Utils/AsyncTasks.h:289
  0xaaaace47903c    DB::ThreadPoolImpl<DB::ThreadFromGlobalPoolImpl<false>>::worker(std::__1::__list_iterator<DB::ThreadFromGlobalPoolImpl<false>, void*>) [tiflash+35622972]
                    /usr/lib/llvm-17/bin/../include/c++/v1/__functional/function.h:517
  0xaaaace47bd94    std::__1::__function::__func<DB::ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<bool DB::ThreadPoolImpl<DB::ThreadFromGlobalPoolImpl<false>>::scheduleImpl<bool>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(bool&&)::'lambda'(), std::__1::allocator<DB::ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<bool DB::ThreadPoolImpl<DB::ThreadFromGlobalPoolImpl<false>>::scheduleImpl<bool>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>(bool&&)::'lambda'()>, void ()>::operator()() [tiflash+35634580]
                    dbms/src/Common/UniThreadPool.cpp:160
  0xaaaace477b38    DB::ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) [tiflash+35617592]
                    /usr/lib/llvm-17/bin/../include/c++/v1/__functional/function.h:517
  0xaaaace479f9c    void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void DB::ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, long, std::__1::optional<unsigned long>, bool)::'lambda0'()>>(void*) [tiflash+35626908]
                    dbms/src/Common/UniThreadPool.cpp:160
  0xffff9201d5c8    start_thread [libc.so.6+513480]
                    ./nptl/./nptl/pthread_create.c:442"] [source=BaseDaemon] [thread_id=2007]
[2024/08/28 07:53:17.340 +00:00] [ERROR] [BaseDaemon.cpp:416] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=2007]
[2024/08/28 07:53:17.340 +00:00] [ERROR] [BaseDaemon.cpp:399] ["Address: NULL pointer."] [source=BaseDaemon] [thread_id=2007]
[2024/08/28 07:53:17.340 +00:00] [ERROR] [BaseDaemon.cpp:371] ["(from thread 1957) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=2007]
[2024/08/28 07:53:17.340 +00:00] [ERROR] [BaseDaemon.cpp:370] [########################################] [source=BaseDaemon] [thread_id=2007]
[2024/08/28 07:53:16.829 +00:00] [INFO] [UniversalPageStorageService.cpp:294] ["Upload checkpoint success, upload_sequence=3471 incremental_bytes=3838 compact_bytes=0"] [source=UniPSService] [thread_id=4]
[2024/08/28 07:53:16.828 +00:00] [INFO] [UniversalPageStorage.cpp:597] ["Checkpoint result: files=[/data0/tmp/checkpoint_upload_3471/dat_3471_0] dump_snapshot=0.012s dump_data=0.074s upload=0.178s copy_checkpoint_info=0.004s total=0.270s sequence=3471 CPDataDumpStats{incremental_data_bytes=3838 compact_data_bytes=0 n_records{total=18216 pages_unchanged=15902 pages_compact=0 pages_incremental=57 ext_pages=2165 ref_pages=46 delete=46 other=0} types[{type=Unknown keys=0 bytes=0} {type=Log keys=128 bytes=0} {type=Data keys=2257 bytes=0} {type=Meta keys=6474 bytes=1097} {type=KVStore keys=1492 bytes=1757} {type=RaftEngine keys=0 bytes=0} {type=KVEngine keys=7865 bytes=984} {type=LocalKV keys=0 bytes=0}]}"] [source=uni_write] [thread_id=4]
[2024/08/28 07:53:16.597 +00:00] [INFO] [PageDirectory.cpp:2233] ["Dumped snapshot to edits.[sequence=406856]"] [source=uni_write_3471] [thread_id=4]
[2024/08/28 07:53:16.570 +00:00] [INFO] [PageDirectory.cpp:2233] ["Dumped snapshot to edits.[sequence=406856]"] [source=uni_write] [thread_id=4]
[2024/08/28 07:45:09.231 +00:00] [INFO] [UniversalPageStorageService.cpp:294] ["Upload checkpoint success, upload_sequence=3470 incremental_bytes=9026 compact_bytes=0"] [source=UniPSService] [thread_id=4]
[2024/08/28 07:45:09.230 +00:00] [INFO] [UniversalPageStorage.cpp:597] ["Checkpoint result: files=[/data0/tmp/checkpoint_upload_3470/dat_3470_0] dump_snapshot=0.012s dump_data=0.070s upload=0.285s copy_checkpoint_info=0.004s total=0.373s sequence=3470 CPDataDumpStats{incremental_data_bytes=9026 compact_data_bytes=0 n_records{total=18131 pages_unchanged=15900 pages_compact=0 pages_incremental=2 ext_pages=2151 ref_pages=39 delete=39 other=0} types[{type=Unknown keys=0 bytes=0} {type=Log keys=128 bytes=8771} {type=Data keys=2229 bytes=0} {type=Meta keys=6453 bytes=255} {type=KVStore keys=1485 bytes=0} {type=RaftEngine keys=0 bytes=0} {type=KVEngine keys=7836 bytes=0} {type=LocalKV keys=0 bytes=0}]}"] [source=uni_write] [thread_id=4]
[2024/08/28 07:45:08.895 +00:00] [INFO] [PageDirectory.cpp:2233] ["Dumped snapshot to edits.[sequence=406693]"] [source=uni_write_3470] [thread_id=4]
[2024/08/28 07:45:08.870 +00:00] [INFO] [PageDirectory.cpp:2233] ["Dumped snapshot to edits.[sequence=406695]"] [source=uni_write] [thread_id=4]
...
[2024/08/28 04:35:01.281 +00:00] [INFO] [S3LockLocalManager.cpp:109] ["restore from S3 finish, elapsed=0.175s last_upload_sequence=3457 num_s3_entries=18174 num_copied_entries=15945 last_prefix=file_format: 1 local_sequence: 406486 last_local_sequence: 406472 create_at_ms: 1724808302712 writer_info { store_id: 34623828 version: \"v7.1.0-alpha-640-gee20c4ff9\" version_git: \"ee20c4ff9162e414dfb9f8f845238d608aad4f83\" start_at_ms: 1723531926000 remote_info { type_name: \"S3\" root: \"tiflash-prod-s0001/\" } }"] [thread_id=4]

@CalvinNeo
Copy link
Member Author

CalvinNeo commented Sep 2, 2024

@CalvinNeo The waitUntilInited in createS3LockForWriteBatch should blocked until s3lock_client is inited

void S3LockLocalManager::createS3LockForWriteBatch(UniversalWriteBatch & write_batch)
{
waitUntilInited();

It is strange that this panic should happen. I find no path that can lead to this panic. Maybe we can add some logs for it. @JaySon-Huang

@JaySon-Huang JaySon-Huang changed the title Disagg: Fix s3 lock nullptr Disagg: Add nullptr check on S3LockLocalManager Sep 2, 2024
@JaySon-Huang
Copy link
Contributor

/hold cancel

We have not figure out why there is a nullptr crash, add some checks along with logging to figure out the reason

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 2, 2024
Copy link
Contributor

@JaySon-Huang JaySon-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

ti-chi-bot bot commented Sep 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang, Lloyd-Pottiger

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JaySon-Huang,Lloyd-Pottiger]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Sep 2, 2024
Copy link
Contributor

ti-chi-bot bot commented Sep 2, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-08-30 03:48:58.163681585 +0000 UTC m=+1101333.298131770: ☑️ agreed by Lloyd-Pottiger.
  • 2024-09-02 10:23:43.638474221 +0000 UTC m=+267148.156527146: ☑️ agreed by JaySon-Huang.

Copy link
Contributor

ti-chi-bot bot commented Sep 2, 2024

@CalvinNeo: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@JaySon-Huang
Copy link
Contributor

/retest

@ti-chi-bot ti-chi-bot bot merged commit 0761d45 into pingcap:master Sep 2, 2024
5 checks passed
CalvinNeo added a commit to CalvinNeo/tiflash that referenced this pull request Sep 23, 2024
…gcap#282)

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants