Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race in tunnel_set/receiver_set and establish call data #5650

Merged

Conversation

windtalker
Copy link
Contributor

@windtalker windtalker commented Aug 18, 2022

What problem does this PR solve?

Issue Number: ref #5095, close #5651

Problem Summary:

  • Data race in tunnel_set/receiver_set:
    In Some refinements of mpp_exchange_receiver_map and MPPTunnelSet #5132, it try to use the atomic variable status to protect tunnel_set/receiver_set in multiple thread read/write(MPPTask::runImpl thread to write and MPPTask::abort thread to read):
    if (status != INITIALIZING)
    throw Exception(fmt::format("The tunnel {} can not be registered, because the task is not in initializing state", tunnel->id()));
    tunnel_set->registerTunnel(MPPTaskId{task_meta.start_ts(), task_meta.task_id()}, tunnel);

    Unfortunately, it turns out not working becase the value of status can be changed after L145 and before L147
  • Data race in establish call data:
    void EstablishCallData::writeErr(const mpp::MPPDataPacket & packet)
    {
    state = ERR_HANDLE;
    if (write(packet))
    err_status = grpc::Status::OK;
    else
    err_status = grpc::Status(grpc::StatusCode::UNKNOWN, "Write error message failed for unknown reason.");
    }

    After write(packet) in L109, the call data is put back to completion queue, and it can be deleted anytime, so there is a chance that before executing L110, the call data itself has been deleted.

What is changed and how it works?

  • Add lock to protect tunnel_set and receiver_set
  • Don't access the member variable after write(packet), since write(packet) always return true, just set err_status to OK before write(packet)

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
    run random failpoint tests
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Aug 18, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • bestwoody
  • yibin87

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 18, 2022
}
LOG_FMT_TRACE(log, "waiting consumer finish!");
waitForSenderFinish(/*allow_throw=*/false);
close("");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MPPTask's destruct, it will close all tunnels, so it is safe to make this change.

MPPTask::~MPPTask()
{
/// MPPTask maybe destructed by different thread, set the query memory_tracker
/// to current_memory_tracker in the destructor
if (current_memory_tracker != memory_tracker)
current_memory_tracker = memory_tracker;
closeAllTunnels("");
if (schedule_state == ScheduleState::SCHEDULED)

Copy link
Contributor

@bestwoody bestwoody Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if one of tunnels throw error in tunnel::close() when closeAllTunnels is called? Is it safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPPTunnel::close() should not throw error, if it throws error, I think maybe the TiFlash server will be terminated.

@windtalker windtalker removed needs-cherry-pick-release-6.0 Type: Need cherry pick to release-6.0 needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. labels Aug 18, 2022
Copy link
Contributor

@bestwoody bestwoody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Aug 18, 2022
Copy link
Contributor

@yibin87 yibin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Aug 18, 2022
@yibin87
Copy link
Contributor

yibin87 commented Aug 18, 2022

/merge

@ti-chi-bot
Copy link
Member

@yibin87: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 2d9d10b

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Aug 18, 2022
@windtalker
Copy link
Contributor Author

/run-unit-test

@ti-chi-bot
Copy link
Member

@windtalker: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@windtalker
Copy link
Contributor Author

/rebuild

@windtalker
Copy link
Contributor Author

/run-integration-test

@sre-bot
Copy link
Collaborator

sre-bot commented Aug 18, 2022

Coverage for changed files

Filename                       Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Coprocessor/DAGContext.cpp          97                34    64.95%          26                 6    76.92%         175                44    74.86%          66                35    46.97%
Coprocessor/DAGContext.h            37                 9    75.68%          28                 7    75.00%          85                22    74.12%          12                 6    50.00%
EstablishCall.cpp                   83                71    14.46%          15                11    26.67%         135               117    13.33%          36                35     2.78%
Mpp/MPPTask.cpp                    511               203    60.27%          27                 8    70.37%         457               171    62.58%         186               115    38.17%
Mpp/MPPTask.h                        4                 1    75.00%           4                 1    75.00%           6                 1    83.33%           0                 0         -
Mpp/MPPTunnel.cpp                  380                75    80.26%          22                 1    95.45%         310                48    84.52%         142                42    70.42%
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                             1112               393    64.66%         122                34    72.13%        1168               403    65.50%         442               233    47.29%

Coverage summary

Functions  MissedFunctions  Executed  Lines   MissedLines  Cover
18384      8351             54.57%    211845  86351        59.24%

full coverage report (for internal network access only)

@ti-chi-bot ti-chi-bot merged commit 7428d37 into pingcap:master Aug 18, 2022
ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request Aug 18, 2022
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #5652.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-6.2 release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data race in tunnel_set/receiver_set and establish call data
5 participants