Parent job does not execute when child job fails #800

m4nuC · 2021-10-08T07:54:11Z

I have a FlowProducer that runs a parent job after a set of children jobs have been executed. There are about 10k children that are being run concurrently. I've set the option removeOnFail to true on both the children and the parents but it seems that if a child fail the execution of the parent just hangs.

Could this be a bug or is it a configuration issue on my side?

Here is what the code that create the job looks like:

export const initializeCollectionStats = async () => {
  const flowProducer = new FlowProducer({connection: redisClient, sharedConnection: true});
  const totalSupply = 10000
  const tokenJobs = [];
  for (let i = firstTokenID; i <= totalSupply; i++) {
    tokenJobs.push({
      name: 'child',
      data: {
        hello: 'world'
      },
      queueName: "queueName",
      opts: {
        removeOnFail: true,
        attempts: 2,
        timeout: 3000
      }
    })
  }

  const initJob = await flowProducer.add({
    name: 'initProject',
    data: {
      contractAddress
    },
    queueName: PROJECT_INIT_QUEUE,
    children: tokenJobs,
    opts: {
      attempts: 1
    }
  })

  return initJob
}

and the worker that run the children job looks like this:

const worker = new Worker(QUEUE_NAME, async (job) => {
    const { hello } = job.data
    try {
        const metadata = await doSomethingAsync(hello)
        return "done"
    } catch (e) {
        logger.error(e)
        return null
    }
}, { connection: redisClient, concurrency: 300, sharedConnection: true});

The text was updated successfully, but these errors were encountered:

manast · 2021-10-08T10:38:51Z

Currently by design, until all child jobs have been completed the parent job will not be processed.

m4nuC · 2021-10-11T08:41:49Z

@manast Gotcha thanks, so then the only way would be to catch errors and manually mark child jobs as completed or is there an option for this ? I assumed that removeOnFail was doing something like this.

m4nuC · 2021-10-11T10:32:15Z

@manast to clarify what I am trying to achieve is:

Process child job before parent (that's a given)
Retry child job on fail just like any other jobs
If max amount of retries is reach on children, then move job to fail
Once all child jobs are either failed or completed, then run parent.

manast · 2021-10-14T03:41:13Z

Regarding the two last points:

If max amount of retries is reach on children, then move job to fail

Once all child jobs are either failed or completed, then run parent.

This would imply new functionality. Namely that a parent job should be "processed" OR "failed" depending on different criteria. For example, it could be the case that we want the parent to fail if one child fails already, OR as you wrote complete if all children either fails or completes, etc.

Slind14 · 2021-11-22T13:50:26Z

This affects us as well. Is there a possibility to see a configurable outcome in the near future?

manast · 2021-11-22T15:33:30Z

@Slind14 what is your usecase?

Slind14 · 2021-11-22T15:41:09Z

@Slind14 what is your use-case?

We would like to keep a history of recently failed jobs within Redis. Otherwise removeOnFail would work, I guess.

bilalshaikh42 · 2022-01-20T06:51:39Z

We have the same issue. We have one top-level job with multiple sub-components. If any of the components fail, we would like for the parent job to be marked as failed so that we can log the errors into a database based on the status of the top-level job

manast · 2022-01-25T02:38:41Z

@bilalshaikh42 the problem here is that for example, you could have retries set up on the children, so you probably do not want to fail the parent at least until all the retries have been exhausted. The other common case is that a child job fails, but then you fix the reason for the failure, and then you want to manually retry the job, if the job finally completes we should complete the parent.

manast · 2022-01-25T03:52:40Z

@roggervalf you could link to this issue regarding the new feature to support failing parents.

bilalshaikh42 · 2022-01-25T12:54:34Z

@bilalshaikh42 the problem here is that for example, you could have retries set up on the children, so you probably do not want to fail the parent at least until all the retries have been exhausted. The other common case is that a child job fails, but then you fix the reason for the failure, and then you want to manually retry the job, if the job finally completes we should complete the parent.

Sure, in this case I would expect all the retires to be exhausted before the child job or the parent job is marked as failed. Similarly, even though one child job fails, I expect other child jobs would still be attempted/pass as long as they do not also depend on child job.

I had not considered the case you mention about manual intervention into a failed job to get the parent job to fail. To support that I guess there would need to be some status for parent other the failed, such as "failed-children" just as there is waiting and "waiting-children". This way if a child job is retired you could revaluate or rerun the parent.

In the way things are set up now, given just the parent job, is there a way to know when the child jobs have failed without iterating through them all? The status of the parent job will just be stuck on "wait-children" so our code was initially waiting forever instead of realizing that a child job has failed.

manast · 2022-01-28T06:13:05Z

@bilalshaikh42 we are trying to iron out different edge cases, so the case when child jobs fail will be improved. But we need to be very careful as any new change usually implies lots of new edge cases.

bilalshaikh42 · 2022-01-28T13:16:09Z

@manast yup, makes sense! Would be happy to help with feedback and/or testing out ideas

In the meantime, do you have any recommendations on the best way to check for a parent job who's dependencies have failed?

manast · 2022-02-13T10:15:43Z

@roggervalf do we have anything new to update here after the last updates to flows?

roggervalf · 2022-02-14T01:57:19Z

This is in my pending list, this week I can work on this feature 👀

C0rellana · 2022-06-06T20:57:48Z

Hi,
Is there any way to solve this problem?, I have the same problem, if the child job fails or stops, the parent job does not run

arslnb · 2023-02-11T08:54:10Z

+1 on this, with some notes:

If you have a parent job with many children jobs, and the parent job updates some sort of record (by accessing job.getChildrenValues()) it will never run if one of the child jobs fail.
If you set the child job to failParentOnFailure as true, the parent job is failed, but you cannot pick up the error with Worker.on() listener, as there is no error "thrown" in our use case (the child and parent have a different worker)
Setting removeOnFail to the child job doesn't work either

huksley · 2023-03-05T15:07:36Z

Hi, I stubled upon this issue as well. I have failing children jobs, and when all execution attempts on children jobs have been made (and failed), it should return execution to the parent job.

Perhaps adding getChildrenStates() method or similar to read states of all children job.

With current design, one should write children workers as bulletproof so they never fail, I don't think it is feasible.
Now parent stuck with queue returning jobs with getWaitingChildren() > 0

Leobaillard · 2023-06-06T21:05:23Z

Hi, this is affecting us as well. I agree with passed comments, I think it's unrealistic to expect that child jobs should never fail. Is this issue still worked on? Thanks!

roggervalf · 2023-06-06T21:44:31Z

hi @Leobaillard, we currently have failParentOnFailure option https://docs.bullmq.io/guide/flows/fail-parent, could it address your case?

Leobaillard · 2023-06-06T22:31:51Z

hi @Leobaillard, we currently have failParentOnFailure option https://docs.bullmq.io/guide/flows/fail-parent, could it address your case?

Hi! Thanks for your quick answer!

It does, in part.

There are still use cases where it would be nice to allow child jobs to fail if they are not critical to the success (or partial success) of the parent. Users would then be expected to handle this "partial" state in their business logic. This is useful to keep failed child job history while retaining the possibility to have the parent job executed.

Some sort of job report can then be generated by the app, listing the successful and failed child tasks.

theDanielJLewis · 2023-06-14T18:28:19Z

I'm in the same boat as everyone else.

It seems odd that the only two options right now for when a child fails are:

The parent does nothing and is never even added to the queue.
Use failParentOnFailure and the parent will always fail when a child fails, even if there were successful children to process.

I wish we had, either as the default or a different option, something like continueParentOnFailure that would allow children to fail but still add the parent to the queue and let the parent process separately.

Adding this option to children would allow us to let some children actually fail the parent, while letting others not affect the parent.

But doing nothing—not even appearing in the queue—just seems strange and not an obvious result.

roggervalf · 2023-06-14T18:49:40Z

hey @theDanielJLewis, we have a pr for it #1953 You also can take a look on that one, @manast and I are evaluating this new feature

theDanielJLewis · 2023-06-14T19:39:26Z

Awesome, @roggervalf! Only a week old, so I can have hope it will be merged soon. Thanks!

Nick-Lucas · 2024-01-23T21:15:14Z

This is also causing my project a problem now, thought I'd worked out how to run a tree of jobs and fetch the states afterwards to display what passed and failed, but having to detach failed children with removeDependencyOnFailure to finish the parent kind of defeats the point by hiding the errors rather than surfacing them

roggervalf · 2024-01-24T02:13:00Z

I think this issue is related to #2092

Nick-Lucas · 2024-01-24T13:44:26Z

I assume there's an underlying architectural problem, because it's not entirely clear why we need to disconnect a child and then find it again as per #2092, instead of the parent having a flag to continue on child failures and we use some API to check children manually and decide what state the Parent should end up in. We're going to end up doing that last leg regardless but it's a bit convoluted.

Anyway workaround for now seem to be:

Wrapping child jobs in try-catch-rethrow and on the final retry making the job pass but with some userland failed status as the return value
In the parent getting all the return values and deciding how to handle any failures

Arguably that's actually quite a good way to go about it regardless as it's more obvious than an incantation of several flags, and "completed" doesn't have to mean "succeeded"

rosslavery · 2024-01-24T14:11:20Z

I assume there's an underlying architectural problem, because it's not entirely clear why we need to disconnect a child and then find it again as per #2092, instead of the parent having a flag to continue on child failures and we use some API to check children manually and decide what state the Parent should end up in. We're going to end up doing that last leg regardless but it's a bit convoluted.

Anyway workaround for now seem to be:

Wrapping child jobs in try-catch-rethrow and on the final retry making the job pass but with some userland failed status as the return value

In the parent getting all the return values and deciding how to handle any failures

Arguably that's actually quite a good way to go about it regardless as it's more obvious than an incantation of several flags, and "completed" doesn't have to mean "succeeded"

The workaround you suggest is precisely what we do.

Then we have some helper utils to check for the special "succeeded but actually failed" return value, and allow the parent to continue or failed based on things like the percentage of child jobs that truly succeeded or failed.

sick-sw · 2024-09-20T15:10:12Z

Ran into the same issue here. It feels like the parent should have control to whether or not it should fail or not, based off of it's children. This is a little more intuitive. If we funneled all failures/completions etc to the parent to process, it would really make things easier, and the processor would actually hit (and leave waiting children depending on your action).

I ended up returning failed data from a try catch in the child processor and works fine (forcing a completion), although a little misleading, due to not seeing failed children in the queue. ignoreDependencyOnFailure didn't appear to do anything after testing.

It would be nice to see this feature allowing processing of the parent no matter the failures/completions etc, where the parent would decide it's fate and take action where needed.

roggervalf · 2024-09-26T01:18:57Z

hi @sick-sw, sounds like ignoreDependencyOnFailure is in fact what you are looking for https://docs.bullmq.io/guide/flows/ignore-dependency, in order to get children failures you need to use getFailedChildrenValues method if you want to do some logic with them.

manast added the enhancement New feature or request label Oct 14, 2021

roggervalf mentioned this issue Feb 24, 2022

feat(job): add removeDependencyOnFail option #1100

Closed

Chocobozzz mentioned this issue Mar 14, 2023

Video stuck in "being transcoded" state Chocobozzz/PeerTube#5691

Open

roggervalf mentioned this issue Jun 7, 2023

feat(job): add removeDependencyOnFailure option #1953

Merged

roggervalf mentioned this issue Feb 16, 2024

feat(flow): add ignoreDependencyOnFailure option #2426

Merged

roggervalf mentioned this issue Aug 1, 2024

fix(flow): fail parent on failure by default #2682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parent job does not execute when child job fails #800

Parent job does not execute when child job fails #800

m4nuC commented Oct 8, 2021

manast commented Oct 8, 2021

m4nuC commented Oct 11, 2021 •

edited

Loading

m4nuC commented Oct 11, 2021

manast commented Oct 14, 2021

Slind14 commented Nov 22, 2021

manast commented Nov 22, 2021

Slind14 commented Nov 22, 2021

bilalshaikh42 commented Jan 20, 2022

manast commented Jan 25, 2022

manast commented Jan 25, 2022

bilalshaikh42 commented Jan 25, 2022

manast commented Jan 28, 2022

bilalshaikh42 commented Jan 28, 2022 •

edited

Loading

manast commented Feb 13, 2022

roggervalf commented Feb 14, 2022

C0rellana commented Jun 6, 2022

arslnb commented Feb 11, 2023

huksley commented Mar 5, 2023 •

edited

Loading

Leobaillard commented Jun 6, 2023 •

edited

Loading

roggervalf commented Jun 6, 2023

Leobaillard commented Jun 6, 2023 •

edited

Loading

theDanielJLewis commented Jun 14, 2023

roggervalf commented Jun 14, 2023

theDanielJLewis commented Jun 14, 2023

Nick-Lucas commented Jan 23, 2024

roggervalf commented Jan 24, 2024

Nick-Lucas commented Jan 24, 2024

rosslavery commented Jan 24, 2024

sick-sw commented Sep 20, 2024

roggervalf commented Sep 26, 2024

Parent job does not execute when child job fails #800

Parent job does not execute when child job fails #800

Comments

m4nuC commented Oct 8, 2021

manast commented Oct 8, 2021

m4nuC commented Oct 11, 2021 • edited Loading

m4nuC commented Oct 11, 2021

manast commented Oct 14, 2021

Slind14 commented Nov 22, 2021

manast commented Nov 22, 2021

Slind14 commented Nov 22, 2021

bilalshaikh42 commented Jan 20, 2022

manast commented Jan 25, 2022

manast commented Jan 25, 2022

bilalshaikh42 commented Jan 25, 2022

manast commented Jan 28, 2022

bilalshaikh42 commented Jan 28, 2022 • edited Loading

manast commented Feb 13, 2022

roggervalf commented Feb 14, 2022

C0rellana commented Jun 6, 2022

arslnb commented Feb 11, 2023

huksley commented Mar 5, 2023 • edited Loading

Leobaillard commented Jun 6, 2023 • edited Loading

roggervalf commented Jun 6, 2023

Leobaillard commented Jun 6, 2023 • edited Loading

theDanielJLewis commented Jun 14, 2023

roggervalf commented Jun 14, 2023

theDanielJLewis commented Jun 14, 2023

Nick-Lucas commented Jan 23, 2024

roggervalf commented Jan 24, 2024

Nick-Lucas commented Jan 24, 2024

rosslavery commented Jan 24, 2024

sick-sw commented Sep 20, 2024

roggervalf commented Sep 26, 2024

m4nuC commented Oct 11, 2021 •

edited

Loading

bilalshaikh42 commented Jan 28, 2022 •

edited

Loading

huksley commented Mar 5, 2023 •

edited

Loading

Leobaillard commented Jun 6, 2023 •

edited

Loading

Leobaillard commented Jun 6, 2023 •

edited

Loading