Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Flow jobs child failure handling is broken #2229

Closed
1 task done
CodingMeSwiftly opened this issue Oct 13, 2023 · 12 comments · Fixed by #2947
Closed
1 task done

[Bug]: Flow jobs child failure handling is broken #2229

CodingMeSwiftly opened this issue Oct 13, 2023 · 12 comments · Fixed by #2947
Labels
bug Something isn't working

Comments

@CodingMeSwiftly
Copy link
Contributor

Version

v4.6.3

Platform

NodeJS

What happened?

This is a follow up for #1469.

  • If a child job is failing, worker.on('failed', () => { ... } is invoked for the child job.
  • It is not invoked for the parent job failing, even if failParentOnFailure is set on the child.
  • Redis event stream shows: event: failed, jobId: xxx, failedReason: child xyz failed, prev: waiting-children, so the event is created, but the callback is never invoked.

According to #1469, this was fixed by #1481. However, it seems a regression was introduced with #1953.

Additionally, failParentOnFailure does not work with removeOnFail.
Expected behaviour: If failParentOnFailure is set for a child and the parent has removeOnFail, the parent should be removed from the queue.

How to reproduce.

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@CodingMeSwiftly CodingMeSwiftly added the bug Something isn't working label Oct 13, 2023
@roggervalf
Copy link
Collaborator

hi @CodingMeSwiftly, for the first point, worker failed events is emitted only for the current job that is moved to failed, so if the parent is also moved to failed, only the child as the current job will be send in the worker failed event, it's more useful a queueEvents instance in this case. For the 3rd point we have a test case where the event is invoked https://github.com/taskforcesh/bullmq/blob/master/tests/test_flow.ts#L1616-L1624, you can take a look

@roggervalf
Copy link
Collaborator

btw, the prs and issues your are linking are related to events that are used by queueEvents instance, workers don't read our event streams, workers only emit events for the current job that is being processed

@CodingMeSwiftly
Copy link
Contributor Author

Thank you for clearing things up about the difference between worker events and queue events.

Do you prefer to keep this issue open to track the failParentOnFailure / removeOnFail case, or should I open a dedicated issue for that (closing this issue)?

@matpen
Copy link

matpen commented Oct 15, 2023

Additionally, failParentOnFailure does not work with removeOnFail.
Expected behaviour: If failParentOnFailure is set for a child and the parent has removeOnFail, the parent should be removed from the queue.

I am experiencing this too, with BullMQ v4.12.3

@CodingMeSwiftly
Copy link
Contributor Author

This 'issue' might be more tricky to solve than initially thought.

Consider this scenario:

  • A parent waiting for two children which are running simultaneously
  • Child A fails, while Child B continues to run
  • If failParentOnFailure / removeOnFail would attempt to remove the parent, Child B could not be removed, because it is locked

According to https://docs.bullmq.io/guide/flows#jobs-removal:

If any of the jobs that would be removed happen to be locked, none of the jobs will be removed, and an exception will be thrown.

@matpen
Copy link

matpen commented Oct 18, 2023

I must say upfront that I just started using BullMQ, but my experience is that, when failParentOnFailure is set, the parent fails only once all children complete (and any one of them fails): it does not fail immediately. I especially tested this, because it is an important scenario in our use case.

In consideration of the scenario described by @CodingMeSwiftly, I would therefore dare to say that the expected behavior is the sequence as follows:

  1. Child A starts
  2. Child B starts
  3. Child A fails (note that nothing happens to the parent at this stage)
  4. Child B continues to run (this is the desired and current behavior I remarked above)
  5. Child B succeeds
  6. Parent fails in virtue of failParentOnFailure
  7. Parent shall be removed in virtue of removeOnFail: according to https://docs.bullmq.io/guide/flows#jobs-removal, this should recursively remove all children, which have meanwhile completed and therefore hold no lock on the parent anymore.

The key would be to defer the failure and removal to the end of the process. My understanding is that the former is already implemented this way, so the hope is that the expert maintainers can tap into the same mechanism to implement the latter.

@roggervalf
Copy link
Collaborator

hi @matpen, is Child A having failParentOnFailure as true as well? also for the case of removeOnFail we are still thinking on it

@matpen
Copy link

matpen commented Oct 20, 2023

@roggervalf in my actual configuration, yes: all children have both failParentOnFailure and removeOnFail.

Please find my actual config within, if it can help

await flowProducer.add({
    name: 'batch0',
    queueName: queue.name,
    data: { step: 0 },
    opts: {
        jobId: `${job.data.type}_${job.data.procedure}_batch0`,
        failParentOnFailure: true,
        removeOnComplete: true,
        removeOnFail: true,
    },
    children: childJobs.map(procedure => ({
        name: procedure,
        queueName: queue.name,
        data: { procedure },
        opts: {
            jobId: `${job.data.type}_${procedure}`,
            attempts: 2,
            backoff: {
                type: 'fixed',
                delay: 300_000,
            },
            failParentOnFailure: true,
            removeOnComplete: true,
            removeOnFail: true,
        },
    })),
});

For the general case I described in #2229 (comment), my guess is that it would make sense to fail the parent if just one of the children has failParentOnFailure and fails.

@roggervalf
Copy link
Collaborator

hi @matpen, I modified the test case a little bit and I cannot replicate your case #2241 as you can see from there, the failed event from the parent is emitted and one of the children is still active

@roggervalf
Copy link
Collaborator

pls try to modify one of our test cases to see if you can replicate your case

@matpen
Copy link

matpen commented Oct 21, 2023

Thank you, @roggervalf for trying to reproduce the case I described. By looking at the code you linked, I see now that there is a misunderstanding:

  • when you say "the parent fails", you seem to mean "the job failed event is emitted", and additionally perhaps "the job is moved to failed set"
  • when I said "the parent fails" in my comment above I meant "the parent prevents the children from running"

IIUC this means that, differently from what I described in my comment, the current sequence of events is

  1. Child A starts
  2. Child B starts
  3. Child A fails (note that nothing happens to the parent at this stage) (the system immediately fails the parent at this stage)
  4. Parent fails in virtue of failParentOnFailure, and should be removed in virtue of removeOnFail: this gives an error because Child B is still running (as documented here)
  5. Child B continues to run (this is the desired and current behavior I remarked above)
  6. Child B succeeds
  7. System is now in a bad state

If I got this right, the next question would be whether this approach is by design, or simply because it is (understandably) the most straightforward solution. And if so, whether it is possible to complicate the system and defer the removal of parents (as per the sequence in my previous comment).

Otherwise, it seems to me that failParentOnFailure cannot be really used effectively together with removeOnFail, unless we are sure that there is only one child.

@klawdyo
Copy link

klawdyo commented Mar 10, 2024

  • when I said "the parent fails" in my comment above I meant "the parent prevents the children from running"

Same problem here.
I would like prevent next items to be processed.

github-actions bot pushed a commit that referenced this issue Nov 30, 2024
## [5.30.1](v5.30.0...v5.30.1) (2024-11-30)

### Bug Fixes

* **flow:** allow using removeOnFail and failParentOnFailure in parents ([#2947](#2947)) fixes [#2229](#2229) ([85f6f6f](85f6f6f))
alexandresoro pushed a commit to alexandresoro/ouca that referenced this issue Dec 4, 2024
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [bullmq](https://bullmq.io/) ([source](https://github.com/taskforcesh/bullmq)) | dependencies | minor | [`5.29.1` -> `5.31.1`](https://renovatebot.com/diffs/npm/bullmq/5.29.1/5.31.1) |

---

### Release Notes

<details>
<summary>taskforcesh/bullmq (bullmq)</summary>

### [`v5.31.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.31.1)

[Compare Source](taskforcesh/bullmq@v5.31.0...v5.31.1)

##### Bug Fixes

-   **scheduler-template:** remove console.log when getting template information ([#&#8203;2950](taskforcesh/bullmq#2950)) ([3402bfe](taskforcesh/bullmq@3402bfe))

### [`v5.31.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.31.0)

[Compare Source](taskforcesh/bullmq@v5.30.1...v5.31.0)

##### Features

-   **queue:** enhance getJobScheduler method to include template information ([#&#8203;2929](taskforcesh/bullmq#2929)) ref [#&#8203;2875](taskforcesh/bullmq#2875) ([cb99080](taskforcesh/bullmq@cb99080))

### [`v5.30.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.30.1)

[Compare Source](taskforcesh/bullmq@v5.30.0...v5.30.1)

##### Bug Fixes

-   **flow:** allow using removeOnFail and failParentOnFailure in parents ([#&#8203;2947](taskforcesh/bullmq#2947)) fixes [#&#8203;2229](taskforcesh/bullmq#2229) ([85f6f6f](taskforcesh/bullmq@85f6f6f))

### [`v5.30.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.30.0)

[Compare Source](taskforcesh/bullmq@v5.29.1...v5.30.0)

##### Bug Fixes

-   **job-scheduler:** upsert template when same pattern options are provided ([#&#8203;2943](taskforcesh/bullmq#2943)) ref [#&#8203;2940](taskforcesh/bullmq#2940) ([b56c3b4](taskforcesh/bullmq@b56c3b4))

##### Features

-   **queue:** add getDelayedCount method \[python] ([#&#8203;2934](taskforcesh/bullmq#2934)) ([71ce75c](taskforcesh/bullmq@71ce75c))
-   **queue:** add getJobSchedulersCount method ([#&#8203;2945](taskforcesh/bullmq#2945)) ([38820dc](taskforcesh/bullmq@38820dc))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOC4xNDIuNyIsInVwZGF0ZWRJblZlciI6IjM4LjE0Mi43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiXX0=-->

Reviewed-on: https://git.tristess.app/alexandresoro/ouca/pulls/361
Reviewed-by: Alexandre Soro <[email protected]>
Co-authored-by: renovate <[email protected]>
Co-committed-by: renovate <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants