feat: SIGNAL-7060 filter out large messages during termination instead of sending to Kafka #95

seungjinstord · 2024-09-17T23:08:44Z

Related Ticket(s)

Checklist

Code conforms to the Elixir Styleguide

Problem

Messages larger than what Kafka allows will create an error that is not captured by existing case clause, mainly because it is happening during process termination flow.

The error tuple is as follows:
{:not_retriable, {:produce_response_error, "magnam-totam-nobis-est", 0, -1, :message_too_large}}

GenServer crashing with that error seems to be the Brod producer or the supervisor.

It only happens when terminate_send has to deal with message(s) in queue that are larger than allowed, and have pushed it to :brod.produce().

Details

I tried to catch it as another pattern match in the place where we do this check in AsyncWorker, but to no avail.

Now during when the processes are live and working, the existing case clause will handle large messages and emit them out to DataDog logs - the only case where this happens appears to be when terminate_send works with the large message (at least that's how I was able to repro the issue in tests).

Therefore the large messages are filtered out of the queue because they are already beyond what Kafka allows, and rest is passed into Kafka during termination.

We can move this logic down to send_message, but I wanted to be conservative in our changes and limit the blast radius.

End expected result is that we should be getting the "message too large" DataDog log ( we weren't getting them with this edge case scenario) with the actual message.

Then, next step is to try to trim out the messages in those messages when they are created.

… log them instead of pushing to Kafka

…t about it

seungjinstord · 2024-09-17T23:15:52Z

lib/kafee/producer/async_worker.ex

+
+  # `:erlang.external_size` won't match what we actually
+  # send, but it should be under the limit that would cause Kafka errors
+  defp kafka_message_size_bytes(message_from_queue) do


Small refactoring as this logic is used in two functions

btkostner

Nice work finding, fixing, and writing tests to confirm. This should be fine to merge as is but I did add two small optimization notes.

btkostner · 2024-09-17T23:52:31Z

test/kafee/producer/async_worker_test.exs

+    @tag capture_log: true
+    test "any leftover messages that are large during shutdown gets logged and will not publish", %{


Nice work on the tests 👍 I hope it wasn't too hard to write an accurate representation here.

It was fine, other examples were really helpful :)

btkostner · 2024-09-18T02:27:25Z

lib/kafee/producer/async_worker.ex

+    {messages_within_max_bytes, messages_beyond_max_bytes} =
+      state.queue
+      |> :queue.to_list()
+      |> split_messages_by_max_bytes(state.max_request_bytes)


Micro optimization here if you want. Instead of moving to a list then to two lists, do it in one operation with a reduce. Shouldn't be a huge issue though.

https://www.erlang.org/doc/apps/stdlib/queue.html#fold/3

Ooh I think that's a pretty significant optimization given this library is gonna be called many times haha I will work on using that approach.

Updated to use :queue.fold in a7bc084

btkostner · 2024-09-18T02:28:49Z

lib/kafee/producer/async_worker.ex

@@ -276,6 +274,10 @@ defmodule Kafee.Producer.AsyncWorker do

  @spec terminate_send(State.t()) :: :ok
  defp terminate_send(state) do
+    # We only focus on triaging the queue in state. If there are messages too big, we log and don't send.
+    # Update state with queue just with messages that are acceptable
+    state = %{state | queue: state_queue_without_large_messages(state)}


Heads up, this will be called recursively from line 293. Shouldn't be an issue but might take more time.

Ah then we can maybe take the message triaging out of terminate_send. Totally missed this recursion step, thanks!

Moved out in 36c6415

…a recursive function

…doing queue -> list -> queue

kinson

🚀🚀🚀

I don't have much to contribute here 😅 — thanks for tackling this @seungjinstord and for the quick reviews @btkostner !

test/kafee/producer/async_worker_test.exs

Co-authored-by: Sam Hunter <[email protected]>

kinson

LGTM

An automated release has been created for you. --- ## [3.1.0](v3.0.3...v3.1.0) (2024-09-19) ### Features * SIGNAL-7060 filter out large messages during termination instead of sending to Kafka ([#95](#95)) ([3bf5f81](3bf5f81)) ### Miscellaneous * Sync files with stordco/common-config-elixir ([#85](#85)) ([c377a8e](c377a8e)) * Sync files with stordco/common-config-elixir ([#87](#87)) ([5dcc51d](5dcc51d)) * Sync files with stordco/common-config-elixir ([#88](#88)) ([51f6add](51f6add)) * Sync files with stordco/common-config-elixir ([#89](#89)) ([f6b8668](f6b8668)) * Sync files with stordco/common-config-elixir ([#90](#90)) ([ba97bf3](ba97bf3)) * Sync files with stordco/common-config-elixir ([#91](#91)) ([2c013fe](2c013fe)) * Sync files with stordco/common-config-elixir ([#92](#92)) ([5da1a94](5da1a94)) * Sync files with stordco/common-config-elixir ([#94](#94)) ([33ed4d8](33ed4d8)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

seungjinstord added 4 commits September 17, 2024 15:42

UPDATE: to filter out large messages during termination flow, to just…

ad705d2

… log them instead of pushing to Kafka

ADD: test to confirm this edge case

74f0582

ADD: test to check for case of a lot of smaller messages

afd284e

UPDATE: just times two for the small message count to make a big poin…

74ec06f

…t about it

seungjinstord self-assigned this Sep 17, 2024

seungjinstord requested a review from a team as a code owner September 17, 2024 23:08

seungjinstord requested review from btkostner and a team September 17, 2024 23:08

REMOVE: unused variable

a56a651

seungjinstord commented Sep 17, 2024

View reviewed changes

btkostner previously approved these changes Sep 18, 2024

View reviewed changes

UPDATE: move the message triaging out of terminate_send because it's …

36c6415

…a recursive function

seungjinstord dismissed btkostner’s stale review via 36c6415 September 18, 2024 16:16

UPDATE: to use :queue.fold to directly create a new queue instead of …

a7bc084

…doing queue -> list -> queue

seungjinstord requested a review from btkostner September 18, 2024 16:35

UPDATE: test grammar fix

2b96d34

kinson reviewed Sep 18, 2024

View reviewed changes

test/kafee/producer/async_worker_test.exs Outdated Show resolved Hide resolved

Update test/kafee/producer/async_worker_test.exs

3fb13cd

Co-authored-by: Sam Hunter <[email protected]>

seungjinstord requested a review from kinson September 18, 2024 16:54

kinson approved these changes Sep 19, 2024

View reviewed changes

btkostner approved these changes Sep 19, 2024

View reviewed changes

seungjinstord merged commit 3bf5f81 into main Sep 19, 2024
12 checks passed

seungjinstord deleted the SIGNAL-7060-precheck-kafee-message-sizes-before-sending branch September 19, 2024 17:22

stord-engineering-account mentioned this pull request Aug 21, 2024

chore(main): release 3.1.0 #86

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SIGNAL-7060 filter out large messages during termination instead of sending to Kafka #95

feat: SIGNAL-7060 filter out large messages during termination instead of sending to Kafka #95

seungjinstord commented Sep 17, 2024

seungjinstord Sep 17, 2024

btkostner left a comment

btkostner Sep 17, 2024

seungjinstord Sep 18, 2024

btkostner Sep 18, 2024

seungjinstord Sep 18, 2024

seungjinstord Sep 18, 2024

btkostner Sep 18, 2024

seungjinstord Sep 18, 2024

seungjinstord Sep 18, 2024

kinson left a comment

kinson left a comment

		@tag capture_log: true
		test "any leftover messages that are large during shutdown gets logged and will not publish", %{

feat: SIGNAL-7060 filter out large messages during termination instead of sending to Kafka #95

feat: SIGNAL-7060 filter out large messages during termination instead of sending to Kafka #95

Conversation

seungjinstord commented Sep 17, 2024

Related Ticket(s)

Checklist

Problem

Details

Choose a reason for hiding this comment

btkostner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kinson left a comment

Choose a reason for hiding this comment

🚀🚀🚀

kinson left a comment

Choose a reason for hiding this comment