[fix] [ml] fix key_shared mode delivery are order out order after a consumers reconnection #23803

poorbarcode · 2025-01-02T11:00:22Z

Motivation

There is an issue that leads to repeated consumption when replay reading and cursor.rewind execute concurrently.

Information

read-position: 3:21
mark-deleted-position: 3: -1`
acked messages: 3:1 ~ 3:20

time	replay reading	`cursor.rewind`
1	Start a replay reading after message redelivery(`3:0`)
2	Replay reading is in-progress
3		`cursor.rewind`, which may be triggered by many cases
4		`cursor.readPosition` back to the first entry(`3:0`)
5	read completely, send `3:0` to consumer
6		read completely, send `3:0 ~ 3:20` to consumer

Issue: the consumer received [3:0, 3:0, 3:1, 3:2....3:20]. you can reproduce the issue by the new test testMixedReplayReadingAndNormalReading("testRepeatedDelivery")`

Modifications

Fix the issue
Add a test for the reverted PR, see more detail here: [fix][broker] Revert "[fix][broker] Cancel possible pending replay read in cancelPendingRead (#23384)" #23855 (comment)

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: x

github-actions · 2025-01-02T11:00:51Z

@poorbarcode Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

lhotari · 2025-01-03T10:03:45Z

Please update the PR title. It currently includes [ml] and this PR doesn't seem to have anything to do with the managed ledger module.
One details about PR titles: don't add a space between the prefixes. instead of of [fix] [broker] use [fix][broker] prefix.

lhotari · 2025-01-07T10:02:07Z

@poorbarcode PIP-282 (implementation PR #21953) was made to address known out-of-order issues in the Key_Shared implementation. PIP-282 was targeted for Pulsar 4.0 and later replaced with PIP-379 implementation. It feels that major changes to the Pulsar 3.0 Key_Shared implementation should be made carefully due to the high chance of changes causing other regressions.
This PR doesn't provide a test that reproduces the issue. Just wondering why we should keep on making major changes to Pulsar 3.0 Key_Shared implementation due to the risk of regressions. It would be better that customers move on to Pulsar 4.0 and the PIP-379 implementation where many of the problems have been addressed.

lhotari · 2025-01-23T09:34:20Z

@poorbarcode This diff of this PR is currently messed up, so hard to see what the changes are.
Just wondering if the correct solution would be to prevent replay reads and normal reads to happen at the same time? I reverted such changes from the recent PR with this commit 06827df .
There's currently a problem that when a replay read starts, it doesn't cancel a pending read which is waiting for new entries.

…onsumers reconnection

poorbarcode · 2025-01-23T09:48:00Z

@lhotari

@poorbarcode This diff of this PR is currently messed up, so hard to see what the changes are.

Rebased master, it is clear now.

Just wondering if the correct solution would be to prevent replay reads and normal reads to happen at the same time? I reverted such changes from the recent PR with this commit 06827df .

A normal reading can not start when there is a pending reading
A replay reading can start when there is a pending normal reading because maybe there are no more entries to read(the normal reading is just pending there and waiting for new messages)

I will modify the Motivation to make every thing clear late

lhotari · 2025-01-23T09:51:19Z

A replay reading can start when there is a pending normal reading because maybe there are no more entries to read(the normal reading is just pending there and waiting for new messages)

"the normal reading is just pending there and waiting for new messages". It's better to cancel it since it does cause later problems. There's code to handle it, but the unnecessary work could be completely avoided.

poorbarcode · 2025-01-23T14:10:03Z

@lhotari

"the normal reading is just pending there and waiting for new messages". It's better to cancel it since it does cause later problems. There's code to handle it, but the unnecessary work could be completely avoided.

It may lead E2E latency increases
The pending normal read will be looped to set and cancel
It may be a conflict with the feature read ahead before all messages sent are acknowledged
(Highlight) your solution can not solve the issue that this PR solves, see more detail at Motivation

lhotari · 2025-01-23T14:52:25Z

It may lead E2E latency increases

There would be an opposite impact when handled correctly. When there's a race between a pending read and a replay read, either one will be discarded. This is why extra work is performed and that leads to increased latencies eventually since unnecessary reads are performed. There's currently no broker side caching for replay reads, so discarded replay reads will cause extra traffic between broker and bookies. To avoid the extra latency, when the pending read gets cancelled and once the replay read is complete, a new pending read should be issued so that new entries get handled immediately when they are available. I'll be doing this type of change to address #23845, which is one of the issues where a read doesn't continue when a read gets skipped.

The pending normal read will be looped to set and cancel

It may be a conflict with the feature read ahead before all messages sent are acknowledged

Please explain more about these items since I didn't catch the meaning.

(Highlight) your solution can not solve the issue that this PR solves, see more detail at Motivation

When I made the comment, there was no description for this PR, so I'll check the updated description.

poorbarcode · 2025-01-24T02:13:14Z

@lhotari

BTW, we can start a new email channel to talk about this.

It may lead E2E latency increases
There would be an opposite impact when handled correctly.

I consider that the current implementation is the best solution

The pending normal read will be looped to set and cancel
It may be a conflict with the feature read ahead before all messages sent are acknowledged
Please explain more about these items since I didn't catch the meaning.

The case Read Ahead:

there is consumer stuck, which leads messages being pushed into redeliver-queue
there is consumer not stuck, which need to read new messages

I think we should not cancel pending normal read.

poorbarcode self-assigned this Jan 2, 2025

github-actions bot added the doc-label-missing label Jan 2, 2025

poorbarcode mentioned this pull request Jan 2, 2025

[fix] [broker] Fix items in dispatcher.recentlyJoinedConsumers are out-of-order, which may cause a delivery stuck #23802

Merged

4 tasks

github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Jan 2, 2025

poorbarcode mentioned this pull request Jan 17, 2025

[fix][broker] Revert "[fix][broker] Cancel possible pending replay read in cancelPendingRead (#23384)" #23855

Merged

4 tasks

github-actions bot added the PIP label Jan 23, 2025

poorbarcode added 4 commits January 23, 2025 17:41

[fix] [ml] fix key_shared mode delivery are order out order after a c…

c44e8f3

…onsumers reconnection

fix for 4.0

d673504

add tests

8e0947d

fix issue

3cbf13d

poorbarcode force-pushed the fix/shouldRewindBeforeReadingOrReplaying branch from 1591c06 to 3cbf13d Compare January 23, 2025 09:42

poorbarcode added release/3.0.10 release/3.3.5 release/4.0.3 labels Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] [ml] fix key_shared mode delivery are order out order after a consumers reconnection #23803

[fix] [ml] fix key_shared mode delivery are order out order after a consumers reconnection #23803

poorbarcode commented Jan 2, 2025 •

edited

Loading

github-actions bot commented Jan 2, 2025

lhotari commented Jan 3, 2025

lhotari commented Jan 7, 2025 •

edited

Loading

lhotari commented Jan 23, 2025

poorbarcode commented Jan 23, 2025 •

edited

Loading

lhotari commented Jan 23, 2025

poorbarcode commented Jan 23, 2025 •

edited

Loading

lhotari commented Jan 23, 2025

poorbarcode commented Jan 24, 2025

[fix] [ml] fix key_shared mode delivery are order out order after a consumers reconnection #23803

Are you sure you want to change the base?

[fix] [ml] fix key_shared mode delivery are order out order after a consumers reconnection #23803

Conversation

poorbarcode commented Jan 2, 2025 • edited Loading

Motivation

Modifications

Documentation

Matching PR in forked repository

github-actions bot commented Jan 2, 2025

lhotari commented Jan 3, 2025

lhotari commented Jan 7, 2025 • edited Loading

lhotari commented Jan 23, 2025

poorbarcode commented Jan 23, 2025 • edited Loading

lhotari commented Jan 23, 2025

poorbarcode commented Jan 23, 2025 • edited Loading

lhotari commented Jan 23, 2025

poorbarcode commented Jan 24, 2025

poorbarcode commented Jan 2, 2025 •

edited

Loading

lhotari commented Jan 7, 2025 •

edited

Loading

poorbarcode commented Jan 23, 2025 •

edited

Loading

poorbarcode commented Jan 23, 2025 •

edited

Loading