Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parent job does not execute when child job fails #800

Open
m4nuC opened this issue Oct 8, 2021 · 30 comments · Fixed by #2682
Open

Parent job does not execute when child job fails #800

m4nuC opened this issue Oct 8, 2021 · 30 comments · Fixed by #2682
Labels
enhancement New feature or request

Comments

@m4nuC
Copy link

m4nuC commented Oct 8, 2021

I have a FlowProducer that runs a parent job after a set of children jobs have been executed. There are about 10k children that are being run concurrently. I've set the option removeOnFail to true on both the children and the parents but it seems that if a child fail the execution of the parent just hangs.

Could this be a bug or is it a configuration issue on my side?

Here is what the code that create the job looks like:

export const initializeCollectionStats = async () => {
  const flowProducer = new FlowProducer({connection: redisClient, sharedConnection: true});
  const totalSupply = 10000
  const tokenJobs = [];
  for (let i = firstTokenID; i <= totalSupply; i++) {
    tokenJobs.push({
      name: 'child',
      data: {
        hello: 'world'
      },
      queueName: "queueName",
      opts: {
        removeOnFail: true,
        attempts: 2,
        timeout: 3000
      }
    })
  }

  const initJob = await flowProducer.add({
    name: 'initProject',
    data: {
      contractAddress
    },
    queueName: PROJECT_INIT_QUEUE,
    children: tokenJobs,
    opts: {
      attempts: 1
    }
  })

  return initJob
}

and the worker that run the children job looks like this:

const worker = new Worker(QUEUE_NAME, async (job) => {
    const { hello } = job.data
    try {
        const metadata = await doSomethingAsync(hello)
        return "done"
    } catch (e) {
        logger.error(e)
        return null
    }
}, { connection: redisClient, concurrency: 300, sharedConnection: true});

@manast
Copy link
Contributor

manast commented Oct 8, 2021

Currently by design, until all child jobs have been completed the parent job will not be processed.

@m4nuC
Copy link
Author

m4nuC commented Oct 11, 2021

@manast Gotcha thanks, so then the only way would be to catch errors and manually mark child jobs as completed or is there an option for this ? I assumed that removeOnFail was doing something like this.

@m4nuC
Copy link
Author

m4nuC commented Oct 11, 2021

@manast to clarify what I am trying to achieve is:

  • Process child job before parent (that's a given)
  • Retry child job on fail just like any other jobs
  • If max amount of retries is reach on children, then move job to fail
  • Once all child jobs are either failed or completed, then run parent.

@manast
Copy link
Contributor

manast commented Oct 14, 2021

Regarding the two last points:

  • If max amount of retries is reach on children, then move job to fail
  • Once all child jobs are either failed or completed, then run parent.

This would imply new functionality. Namely that a parent job should be "processed" OR "failed" depending on different criteria. For example, it could be the case that we want the parent to fail if one child fails already, OR as you wrote complete if all children either fails or completes, etc.

@manast manast added the enhancement New feature or request label Oct 14, 2021
@Slind14
Copy link

Slind14 commented Nov 22, 2021

This affects us as well. Is there a possibility to see a configurable outcome in the near future?

@manast
Copy link
Contributor

manast commented Nov 22, 2021

@Slind14 what is your usecase?

@Slind14
Copy link

Slind14 commented Nov 22, 2021

@Slind14 what is your use-case?

We would like to keep a history of recently failed jobs within Redis. Otherwise removeOnFail would work, I guess.

@bilalshaikh42
Copy link
Contributor

We have the same issue. We have one top-level job with multiple sub-components. If any of the components fail, we would like for the parent job to be marked as failed so that we can log the errors into a database based on the status of the top-level job

@manast
Copy link
Contributor

manast commented Jan 25, 2022

@bilalshaikh42 the problem here is that for example, you could have retries set up on the children, so you probably do not want to fail the parent at least until all the retries have been exhausted. The other common case is that a child job fails, but then you fix the reason for the failure, and then you want to manually retry the job, if the job finally completes we should complete the parent.

@manast
Copy link
Contributor

manast commented Jan 25, 2022

@roggervalf you could link to this issue regarding the new feature to support failing parents.

@bilalshaikh42
Copy link
Contributor

@bilalshaikh42 the problem here is that for example, you could have retries set up on the children, so you probably do not want to fail the parent at least until all the retries have been exhausted. The other common case is that a child job fails, but then you fix the reason for the failure, and then you want to manually retry the job, if the job finally completes we should complete the parent.

Sure, in this case I would expect all the retires to be exhausted before the child job or the parent job is marked as failed. Similarly, even though one child job fails, I expect other child jobs would still be attempted/pass as long as they do not also depend on child job.

I had not considered the case you mention about manual intervention into a failed job to get the parent job to fail. To support that I guess there would need to be some status for parent other the failed, such as "failed-children" just as there is waiting and "waiting-children". This way if a child job is retired you could revaluate or rerun the parent.

In the way things are set up now, given just the parent job, is there a way to know when the child jobs have failed without iterating through them all? The status of the parent job will just be stuck on "wait-children" so our code was initially waiting forever instead of realizing that a child job has failed.

@manast
Copy link
Contributor

manast commented Jan 28, 2022

@bilalshaikh42 we are trying to iron out different edge cases, so the case when child jobs fail will be improved. But we need to be very careful as any new change usually implies lots of new edge cases.

@bilalshaikh42
Copy link
Contributor

bilalshaikh42 commented Jan 28, 2022

@manast yup, makes sense! Would be happy to help with feedback and/or testing out ideas

In the meantime, do you have any recommendations on the best way to check for a parent job who's dependencies have failed?

@manast
Copy link
Contributor

manast commented Feb 13, 2022

@roggervalf do we have anything new to update here after the last updates to flows?

@roggervalf
Copy link
Collaborator

This is in my pending list, this week I can work on this feature 👀

@C0rellana
Copy link

Hi,
Is there any way to solve this problem?, I have the same problem, if the child job fails or stops, the parent job does not run

@arslnb
Copy link

arslnb commented Feb 11, 2023

+1 on this, with some notes:

  • If you have a parent job with many children jobs, and the parent job updates some sort of record (by accessing job.getChildrenValues()) it will never run if one of the child jobs fail.
  • If you set the child job to failParentOnFailure as true, the parent job is failed, but you cannot pick up the error with Worker.on() listener, as there is no error "thrown" in our use case (the child and parent have a different worker)
  • Setting removeOnFail to the child job doesn't work either

@huksley
Copy link

huksley commented Mar 5, 2023

Hi, I stubled upon this issue as well. I have failing children jobs, and when all execution attempts on children jobs have been made (and failed), it should return execution to the parent job.

Perhaps adding getChildrenStates() method or similar to read states of all children job.

With current design, one should write children workers as bulletproof so they never fail, I don't think it is feasible.
Now parent stuck with queue returning jobs with getWaitingChildren() > 0

@Leobaillard
Copy link

Leobaillard commented Jun 6, 2023

Hi, this is affecting us as well. I agree with passed comments, I think it's unrealistic to expect that child jobs should never fail. Is this issue still worked on? Thanks!

@roggervalf
Copy link
Collaborator

hi @Leobaillard, we currently have failParentOnFailure option https://docs.bullmq.io/guide/flows/fail-parent, could it address your case?

@Leobaillard
Copy link

Leobaillard commented Jun 6, 2023

hi @Leobaillard, we currently have failParentOnFailure option https://docs.bullmq.io/guide/flows/fail-parent, could it address your case?

Hi! Thanks for your quick answer!

It does, in part.

There are still use cases where it would be nice to allow child jobs to fail if they are not critical to the success (or partial success) of the parent. Users would then be expected to handle this "partial" state in their business logic. This is useful to keep failed child job history while retaining the possibility to have the parent job executed.

Some sort of job report can then be generated by the app, listing the successful and failed child tasks.

@theDanielJLewis
Copy link

I'm in the same boat as everyone else.

It seems odd that the only two options right now for when a child fails are:

  1. The parent does nothing and is never even added to the queue.
  2. Use failParentOnFailure and the parent will always fail when a child fails, even if there were successful children to process.

I wish we had, either as the default or a different option, something like continueParentOnFailure that would allow children to fail but still add the parent to the queue and let the parent process separately.

Adding this option to children would allow us to let some children actually fail the parent, while letting others not affect the parent.

But doing nothing—not even appearing in the queue—just seems strange and not an obvious result.

@roggervalf
Copy link
Collaborator

hey @theDanielJLewis, we have a pr for it #1953 You also can take a look on that one, @manast and I are evaluating this new feature

@theDanielJLewis
Copy link

Awesome, @roggervalf! Only a week old, so I can have hope it will be merged soon. Thanks!

@Nick-Lucas
Copy link

This is also causing my project a problem now, thought I'd worked out how to run a tree of jobs and fetch the states afterwards to display what passed and failed, but having to detach failed children with removeDependencyOnFailure to finish the parent kind of defeats the point by hiding the errors rather than surfacing them

@roggervalf
Copy link
Collaborator

I think this issue is related to #2092

@Nick-Lucas
Copy link

I assume there's an underlying architectural problem, because it's not entirely clear why we need to disconnect a child and then find it again as per #2092, instead of the parent having a flag to continue on child failures and we use some API to check children manually and decide what state the Parent should end up in. We're going to end up doing that last leg regardless but it's a bit convoluted.

Anyway workaround for now seem to be:

  • Wrapping child jobs in try-catch-rethrow and on the final retry making the job pass but with some userland failed status as the return value
  • In the parent getting all the return values and deciding how to handle any failures

Arguably that's actually quite a good way to go about it regardless as it's more obvious than an incantation of several flags, and "completed" doesn't have to mean "succeeded"

@rosslavery
Copy link

I assume there's an underlying architectural problem, because it's not entirely clear why we need to disconnect a child and then find it again as per #2092, instead of the parent having a flag to continue on child failures and we use some API to check children manually and decide what state the Parent should end up in. We're going to end up doing that last leg regardless but it's a bit convoluted.

Anyway workaround for now seem to be:

  • Wrapping child jobs in try-catch-rethrow and on the final retry making the job pass but with some userland failed status as the return value
  • In the parent getting all the return values and deciding how to handle any failures

Arguably that's actually quite a good way to go about it regardless as it's more obvious than an incantation of several flags, and "completed" doesn't have to mean "succeeded"

The workaround you suggest is precisely what we do.

Then we have some helper utils to check for the special "succeeded but actually failed" return value, and allow the parent to continue or failed based on things like the percentage of child jobs that truly succeeded or failed.

@sick-sw
Copy link

sick-sw commented Sep 20, 2024

Ran into the same issue here. It feels like the parent should have control to whether or not it should fail or not, based off of it's children. This is a little more intuitive. If we funneled all failures/completions etc to the parent to process, it would really make things easier, and the processor would actually hit (and leave waiting children depending on your action).

I ended up returning failed data from a try catch in the child processor and works fine (forcing a completion), although a little misleading, due to not seeing failed children in the queue. ignoreDependencyOnFailure didn't appear to do anything after testing.

It would be nice to see this feature allowing processing of the parent no matter the failures/completions etc, where the parent would decide it's fate and take action where needed.

@roggervalf
Copy link
Collaborator

hi @sick-sw, sounds like ignoreDependencyOnFailure is in fact what you are looking for https://docs.bullmq.io/guide/flows/ignore-dependency, in order to get children failures you need to use getFailedChildrenValues method if you want to do some logic with them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.