Skip to content
This repository has been archived by the owner on Nov 27, 2023. It is now read-only.

Task Supervisor for message consumers #179

Merged
merged 19 commits into from
May 25, 2020
Merged

Conversation

akoutmos
Copy link
Collaborator

@akoutmos akoutmos commented Apr 26, 2020

This PR is an initial attempt at addressing issue #46. Still in the process of adding tests and solidifying the implementation, so leaving the PR in draft until complete (also looking for implementation feedback).

Description

Originally, the GenRMQ.Consumer module would call spawn/1 to asynchronously execute a message handler. This change instead leverages Task.Supervisor to execute message processing handlers. This should be a transparent change to the end user as this does not impact how GenRMQ.Consumer concurrently executes message handlers.

This introduces the usage of OTP 21's handle_continue to ensure that the Task.Supervisor is started before any mailbox messages are processed. As a result, gen_rmq would no longer be compatible with OTP releases prior to 21.

Checklist

  • I have added unit tests to cover my changes.
  • I have improved the code quality of this repo. (refactoring, or reduced number of static analyser issues)
  • I have updated the documentation accordingly

@coveralls
Copy link

coveralls commented Apr 26, 2020

Coverage Status

Coverage increased (+0.7%) to 90.333% when pulling 72d45ee on task_supervisor_consumer into 73954a1 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 89.773% when pulling ccc0367 on task_supervisor_consumer into e120328 on master.

@akoutmos akoutmos changed the title Initial attempt at using a Task Supervisor for message consumers Task Supervisor for message consumers Apr 26, 2020
.travis.yml Show resolved Hide resolved
lib/consumer.ex Outdated Show resolved Hide resolved
@mkorszun
Copy link
Collaborator

@akoutmos looking good - left some questions.

@akoutmos
Copy link
Collaborator Author

akoutmos commented May 4, 2020

I added some preliminary tests for this functionality, updated the telemetry documentation (as there is now a telemetry event for failed tasks), and also refactored the consumer a bit to be able to act upon Tasks that it has spawned under the supervisor. There are still a few more edge cases that I want to cover in the tests, so I will keep this in draft until I feel as though I have covered those code paths.

@akoutmos akoutmos marked this pull request as ready for review May 13, 2020 19:42
@akoutmos akoutmos requested review from Shemeikka and vorce as code owners May 13, 2020 19:42
@akoutmos
Copy link
Collaborator Author

akoutmos commented May 13, 2020

I think this one is at a good point to review now. I added the ability for consumer processes to wait on in flight tasks under the Task.Supervisor before terminating and also added tests to validate the flow. I also added an additional telemetry event to report on when Tasks fail.

One question I have is if we should expose an optional configuration for how long to wait on running Tasks. Currently I am just using the default value that is set for yield_many https://hexdocs.pm/elixir/Task.html#yield_many/2

@vorce
Copy link
Collaborator

vorce commented May 14, 2020

@akoutmos I think yes we have to let users chose the task timeout. I suspect users will have very different expectations on their message processor, depending on their business logic.

@akoutmos
Copy link
Collaborator Author

I'll add that tonight and leave the default as the yield_many value.

@spier
Copy link
Contributor

spier commented May 14, 2020

@akoutmos can I just say: You are awesome :) Great contributions in this PR (and elsewhere). Thanks so much!

@akoutmos
Copy link
Collaborator Author

Thanks for the kind words @spier! I appreciate it :)

@akoutmos
Copy link
Collaborator Author

Went ahead and added the terminate_timeout option for users to configure how long to wait on in-flight Tasks. I also updated the docs to let them know about the option.

@mkorszun
Copy link
Collaborator

@akoutmos amazing job 🥇- I will reserve some time during the weekend to review it.

@@ -42,4 +42,9 @@ GenRMQ emits [Telemetry][telemetry] events for consumers. It currently exposes t
- Measurement: `%{time: System.monotonic_time}`
- Metadata: `%{module: atom, reason: atom}`

- `[:gen_rmq, :consumer, :task, :error]` - Dispatched by a GenRMQ consumer when a supervised Task fails to process a message
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

{:ok, state}
@doc false
@impl GenServer
def handle_continue(:init, state) do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely!

lib/consumer.ex Outdated
{:DOWN, ref, :process, _pid, reason},
%{module: module, config: config, running_tasks: running_tasks} = state
) do
if Map.has_key?(running_tasks, ref) do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in this case to the message? Will it be rejected/nacked?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call....we would probably want some sort of configuration so the user can decide what happens to the message if the task fails. Perhaps we also provide the option to send the message to the deadletter exchange. Also on that same note, we may also want some configuration around how long the Tasks take to complete. Currently it is set to the async_nolink value of 5 seconds. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should be able to set a callback to deal with the message? But I think the sensible default here would be to nack the message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the callback, smth like: handle_error(reason, state). Then in this callback user has all the power to ack, reject or requeue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, are we capturing here also task timeouts?

Copy link
Collaborator Author

@akoutmos akoutmos May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My latest commit adds the ability to configure a timeout for tasks. It was a little tricky to add that in given that Task.Supervisor does not provide that functionality for async_nolink, but I think I have a good implementation in place to handle this functionality. I misspoke earlier in regards to the 5 second timeout...that is for async_stream and async_stream_nolink.

As for a custom callback to handle the failure, are you thinking about rewriting the consumer module to a macro so that we can leverage defoverridable if the user does not want to use the default error handler? Or would handle_error be a required behaviour callback that needs to be implemented by the user?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akoutmos I would prefer to avoid rewriting consumer module to a macro at this stage (just to limit the scope of changes).

Since these changes will be released as a major version, requiring error callback to be implemented should be fine, right? @vorce what do you think?

We could also consider skipping error callback and just reject the messages. The problem here is that users might skip dead-letter configuration for their consumers.

Copy link
Collaborator

@vorce vorce May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkorszun @akoutmos Hm hm. Yeah I think making each consumer implement a callback is OK. At least then we encourage users to think about this case. We can have a good implementation in the example consumer that logs and nacks the message.

Copy link
Collaborator Author

@akoutmos akoutmos May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll add an additional callback handle_error that will be triggered on task exception or task timeout along with an example or 2. I agree that this is probably the way to go so that gen_rmq does not impose any assumptions upon the user. They can deal with the error the way that best fits their needs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, and thanks again for all the work @akoutmos

test/gen_rmq_consumer_test.exs Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@mkorszun mkorszun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akoutmos once again looks great. Thanks for the amazing work 👏 I agree with @vorce that it would be good to notify users whenever task fails (with default behaviour to dead-letter the message).

@akoutmos
Copy link
Collaborator Author

Another thing that I thought about today was that there is no way to throttle the consumer. With how it currently is, it can keep on consuming messages and spawning new tasks under the supervisor. Thoughts on adding an additional configuration to have the user set the number of concurrent tasks running? max_concurrent_tasks ?

@vorce
Copy link
Collaborator

vorce commented May 17, 2020

@akoutmos good point! I think we can introduce that later if it turns out to be a problem for some users. The current/master implementation has the same problem.

@akoutmos
Copy link
Collaborator Author

I added the handle_error callback for consumers that are configured to execute via Tasks. I'll need to also hook it up to trigger the handle_error callback when running in synchronous mode. But wanted to get some feedback on the approach before adding that in as well.

lib/consumer.ex Outdated
`reason` - atom denoting the type of error

## Examples:
To reject the message message that caused the Task to fail you can do something like so:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double message

@@ -0,0 +1,25 @@
defmodule GenRMQ.MessageTask do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to wrap it like this 👍

@mkorszun
Copy link
Collaborator

@akoutmos error handling looks good 👍

@akoutmos
Copy link
Collaborator Author

I wrapped up that last of the tests and added handle_error callback support for non concurrent consumers. I'll give this a final look over tomorrow...but I think from a functionality perspective there is nothing else on my todo list. Let me know if you want to see any more tests/features :)

@mkorszun
Copy link
Collaborator

@akoutmos great, thank you 🙏 I plan to merge it today/tomorrow.

@akoutmos
Copy link
Collaborator Author

Updated some docs and some minor error handling tweaks. Nothing else pops out at me for clean up. All good on this side

@mkorszun mkorszun merged commit bbe426b into master May 25, 2020
@mkorszun mkorszun deleted the task_supervisor_consumer branch May 25, 2020 19:19
@spier
Copy link
Contributor

spier commented May 25, 2020

🎉

@spier
Copy link
Contributor

spier commented May 25, 2020

@mkorszun just curious, what is the release process for the library? I remember you mentioning that we would have to release a new major(?) version for this?

@mkorszun
Copy link
Collaborator

mkorszun commented May 25, 2020

@spier we need a major release because we have dropped support for some older versions of the OTP. Release process will involve:

  • bumping lib version to 3.0.0
  • documenting differences / migration steps between v2 and v3
  • merging all above changes to master and doing a github release
  • publishing to hex.pm

@spier
Copy link
Contributor

spier commented May 26, 2020

@mkorszun Thanks for the writeup.

I am going through the steps, to understand them better. I might write a new issue to document the process for ourselves, and additional admins in the future. I will try to review what you do for this 3.0.0 release and then try myself on writing this up if that is ok?

Notes

  • bumping lib version to 3.0.0 => at least in mix.exs and README.md.
  • documenting differences / migration steps between v2 and v3 => probably in the wiki? I looked for migrations steps from v1 to v2 as a sample. maybe we did not write those then?
  • merging all above changes to master and doing a github release => releases
  • publishing to hex.pm => https://hex.pm/packages/gen_rmq

(unrelated) 30k party

Btw I realized that we are at 28173 downloads all time and currently at 803 downloads last 7 days. So we should hit the 30k downloads milestone in about 2.5 weeks. How does one throw a virtual party? :)

@mkorszun
Copy link
Collaborator

@spier sounds good. Your notes are already quite precise :)

regarding 30k party - I believe we need it 🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants