Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [ml] fix key_shared mode delivery are order out order after a consumers reconnection #23803

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Jan 2, 2025

Motivation

  • There is an issue that leads to repeated consumption when replay reading and cursor.rewind execute concurrently.

Information

  • read-position: 3:21
  • mark-deleted-position: 3: -1`
  • acked messages: 3:1 ~ 3:20
time replay reading cursor.rewind
1 Start a replay reading after message redelivery(3:0)
2 Replay reading is in-progress
3 cursor.rewind, which may be triggered by many cases
4 cursor.readPosition back to the first entry(3:0)
5 read completely, send 3:0 to consumer
6 read completely, send 3:0 ~ 3:20 to consumer

Issue: the consumer received [3:0, 3:0, 3:1, 3:2....3:20]. you can reproduce the issue by the new test testMixedReplayReadingAndNormalReading("testRepeatedDelivery")`

Modifications

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode self-assigned this Jan 2, 2025
Copy link

github-actions bot commented Jan 2, 2025

@poorbarcode Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@lhotari
Copy link
Member

lhotari commented Jan 3, 2025

Please update the PR title. It currently includes [ml] and this PR doesn't seem to have anything to do with the managed ledger module.
One details about PR titles: don't add a space between the prefixes. instead of of [fix] [broker] use [fix][broker] prefix.

@lhotari
Copy link
Member

lhotari commented Jan 7, 2025

@poorbarcode PIP-282 (implementation PR #21953) was made to address known out-of-order issues in the Key_Shared implementation. PIP-282 was targeted for Pulsar 4.0 and later replaced with PIP-379 implementation. It feels that major changes to the Pulsar 3.0 Key_Shared implementation should be made carefully due to the high chance of changes causing other regressions.
This PR doesn't provide a test that reproduces the issue. Just wondering why we should keep on making major changes to Pulsar 3.0 Key_Shared implementation due to the risk of regressions. It would be better that customers move on to Pulsar 4.0 and the PIP-379 implementation where many of the problems have been addressed.

@lhotari
Copy link
Member

lhotari commented Jan 23, 2025

@poorbarcode This diff of this PR is currently messed up, so hard to see what the changes are.
Just wondering if the correct solution would be to prevent replay reads and normal reads to happen at the same time? I reverted such changes from the recent PR with this commit 06827df .
There's currently a problem that when a replay read starts, it doesn't cancel a pending read which is waiting for new entries.

@poorbarcode poorbarcode force-pushed the fix/shouldRewindBeforeReadingOrReplaying branch from 1591c06 to 3cbf13d Compare January 23, 2025 09:42
@poorbarcode
Copy link
Contributor Author

poorbarcode commented Jan 23, 2025

@lhotari

@poorbarcode This diff of this PR is currently messed up, so hard to see what the changes are.

Rebased master, it is clear now.

Just wondering if the correct solution would be to prevent replay reads and normal reads to happen at the same time? I reverted such changes from the recent PR with this commit 06827df .

  • A normal reading can not start when there is a pending reading
  • A replay reading can start when there is a pending normal reading because maybe there are no more entries to read(the normal reading is just pending there and waiting for new messages)

I will modify the Motivation to make every thing clear late

@lhotari
Copy link
Member

lhotari commented Jan 23, 2025

  • A replay reading can start when there is a pending normal reading because maybe there are no more entries to read(the normal reading is just pending there and waiting for new messages)

"the normal reading is just pending there and waiting for new messages". It's better to cancel it since it does cause later problems. There's code to handle it, but the unnecessary work could be completely avoided.

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Jan 23, 2025

@lhotari

"the normal reading is just pending there and waiting for new messages". It's better to cancel it since it does cause later problems. There's code to handle it, but the unnecessary work could be completely avoided.

  • It may lead E2E latency increases
  • The pending normal read will be looped to set and cancel
  • It may be a conflict with the feature read ahead before all messages sent are acknowledged
  • (Highlight) your solution can not solve the issue that this PR solves, see more detail at Motivation

@lhotari
Copy link
Member

lhotari commented Jan 23, 2025

  • It may lead E2E latency increases

There would be an opposite impact when handled correctly. When there's a race between a pending read and a replay read, either one will be discarded. This is why extra work is performed and that leads to increased latencies eventually since unnecessary reads are performed. There's currently no broker side caching for replay reads, so discarded replay reads will cause extra traffic between broker and bookies. To avoid the extra latency, when the pending read gets cancelled and once the replay read is complete, a new pending read should be issued so that new entries get handled immediately when they are available. I'll be doing this type of change to address #23845, which is one of the issues where a read doesn't continue when a read gets skipped.

  • The pending normal read will be looped to set and cancel
  • It may be a conflict with the feature read ahead before all messages sent are acknowledged

Please explain more about these items since I didn't catch the meaning.

  • (Highlight) your solution can not solve the issue that this PR solves, see more detail at Motivation

When I made the comment, there was no description for this PR, so I'll check the updated description.

@poorbarcode
Copy link
Contributor Author

@lhotari

BTW, we can start a new email channel to talk about this.

It may lead E2E latency increases
There would be an opposite impact when handled correctly.

I consider that the current implementation is the best solution

The pending normal read will be looped to set and cancel
It may be a conflict with the feature read ahead before all messages sent are acknowledged
Please explain more about these items since I didn't catch the meaning.

The case Read Ahead:

  • there is consumer stuck, which leads messages being pushed into redeliver-queue
  • there is consumer not stuck, which need to read new messages

I think we should not cancel pending normal read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants