feat: SIGNAL-7060 preemtively drop large messages from queue #101

seungjinstord · 2024-09-24T00:13:14Z

Related Ticket(s)

Checklist

Code conforms to the Elixir Styleguide

Problem

We added logic to drop large messages during termination, but we still observed raw :message_too_large errors bubbling up from brod code. Due to the sheer traffic in topic wms-service--firehose, these errors are observed once or twice every other day.

Details

Preemptively drop large messages from even being attempted. Currently prior to this PR, AsyncWorker.build_message_batch() would attempt to push the message out to Kafka without checking the size.

This PR will check at the AsyncWorker.queue() lifecycle, so when request comes into this GenServer to put a new message to the queue, it will check and drop the large message - if it goes over state.max_request_bytes.

The dropped messages follow the same logging code which should push out the message to DataDog.

This reverts commit 6c64425.

seungjinstord · 2024-09-24T00:26:52Z

FYI the multi-partition test is flaky due to the termination race condition. I haven't fully caught it yet - will probably deal with it in a separate ticket.

UPDATE: actually the fix was easy - so it's included here, in 44f397b
UPDATE: the fix actually was merged as a separate PR.

kinson · 2024-09-24T15:59:41Z

lib/kafee/producer/async_worker.ex

  @doc false
  def handle_cast({:queue, messages}, state) do
-    new_queue = :queue.join(state.queue, :queue.from_list(messages))
+    new_messages_queue = messages |> :queue.from_list() |> queue_without_large_messages(state.max_request_bytes)


Just to make sure I understand, the primary change here is that before this pr, we would still attempt to publish messages that are too large unless the async worker process was terminating in which case we would filter them out.

Whereas, with this change we will pre-emptively drop the large messages and never try to publish them. Is that correct?

I think that makes sense from a functional perspective, but I want to double check my understanding before approving!

Yup that's 100% correct. Originally we would ignore the state.max_request_bytes on the first send attempt, and just push first, and then handle the errors bubbling up from brod.

Adding to the context: due to increased traffic and size - we are observing brod's related processes failing and the AsyncWorker layer isn't getting the "raw" error messages with the actualy message payload (it seems brod is eating it ahead of bubbling up).

Therefore the other way would be if we already know and have set the state.max_request_byte, we should be safe in filtering the large messges out ahead of time, so the queue won't even have them.

kinson

I'm not sure how urgent this is, but perhaps it's worth waiting to get a review from @btkostner as well 🧠

seungjinstord · 2024-09-24T16:41:17Z

Yup, hopefully with the compression tweak at wms-service we won't run into the large message error as frequently

btkostner

Code looks good and it looks like it includes logging for when something is dropped from queue which is my biggest concern. I might add a bit more docs on silent failures when adding to queue just so it does not surprise devs.

Added a comment to hopefully clean up the test and make it a bit easier to understand / test what we want.

test/kafee/producer/async_worker_test.exs

…ection, and break it into separate tests

changes were done; see comments

An automated release has been created for you. --- ## [3.2.0](v3.1.2...v3.2.0) (2024-10-01) ### Features * SIGNAL-7060 preemtively drop large messages from queue ([#101](#101)) ([0baeae8](0baeae8)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

seungjinstord added 7 commits September 23, 2024 15:51

UPDATE: have send_message() drop large messages ahead of time

b17f336

UDPATE: adjust integration test based on new pre-dropping strategy

72cf3da

UPDATE: finally! raw snapshot of the fix

3620b34

Revert "UPDATE: finally! raw snapshot of the fix"

1b3179d

This reverts commit 6c64425.

UPDATE: move the terminate related log to terminate()

1f4caa6

UPDATE: with IO.inspect

97a18fb

REMOVE: IO.inspect

e6f8374

seungjinstord self-assigned this Sep 24, 2024

seungjinstord requested a review from a team as a code owner September 24, 2024 00:13

UPDATE: variable name

fd0ec14

seungjinstord requested a review from a team September 24, 2024 00:27

Merge branch 'main' into SIGNAL-7060-another-fix-for-large-message

158affc

kinson reviewed Sep 24, 2024

View reviewed changes

kinson previously approved these changes Sep 24, 2024

View reviewed changes

UPDATE: make test less flaky

44f397b

seungjinstord dismissed kinson’s stale review via 44f397b September 24, 2024 18:07

kinson previously approved these changes Sep 24, 2024

View reviewed changes

kinson requested a review from btkostner September 24, 2024 19:10

btkostner previously requested changes Sep 26, 2024

View reviewed changes

test/kafee/producer/async_worker_test.exs Outdated Show resolved Hide resolved

test/kafee/producer/async_worker_test.exs Outdated Show resolved Hide resolved

ADD: more moduledoc

ada3f2e

seungjinstord dismissed kinson’s stale review via ada3f2e September 26, 2024 17:46

seungjinstord and others added 4 commits September 26, 2024 10:54

UPDATE: move test to where it should've been - in the queue testing s…

95293aa

…ection, and break it into separate tests

UPDATE: fix typo

37dbca6

ADD: missing reference

ab77335

Merge branch 'main' into SIGNAL-7060-another-fix-for-large-message

294084f

seungjinstord requested review from btkostner and kinson September 26, 2024 18:11

Merge branch 'main' into SIGNAL-7060-another-fix-for-large-message

02ec487

btkostner approved these changes Oct 1, 2024

View reviewed changes

seungjinstord merged commit 0baeae8 into main Oct 1, 2024
12 checks passed

seungjinstord deleted the SIGNAL-7060-another-fix-for-large-message branch October 1, 2024 18:51

stord-engineering-account mentioned this pull request Oct 1, 2024

chore(main): release 3.2.0 #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SIGNAL-7060 preemtively drop large messages from queue #101

feat: SIGNAL-7060 preemtively drop large messages from queue #101

seungjinstord commented Sep 24, 2024

seungjinstord commented Sep 24, 2024 •

edited

Loading

kinson Sep 24, 2024

seungjinstord Sep 24, 2024 •

edited

Loading

kinson left a comment

seungjinstord commented Sep 24, 2024 •

edited

Loading

btkostner left a comment

feat: SIGNAL-7060 preemtively drop large messages from queue #101

feat: SIGNAL-7060 preemtively drop large messages from queue #101

Conversation

seungjinstord commented Sep 24, 2024

Related Ticket(s)

Checklist

Problem

Details

seungjinstord commented Sep 24, 2024 • edited Loading

kinson Sep 24, 2024

Choose a reason for hiding this comment

seungjinstord Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

kinson left a comment

Choose a reason for hiding this comment

seungjinstord commented Sep 24, 2024 • edited Loading

btkostner left a comment

Choose a reason for hiding this comment

seungjinstord commented Sep 24, 2024 •

edited

Loading

seungjinstord Sep 24, 2024 •

edited

Loading

seungjinstord commented Sep 24, 2024 •

edited

Loading