Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glob output error #46

Open
taniagmangolini opened this issue Jun 11, 2024 · 2 comments
Open

Glob output error #46

taniagmangolini opened this issue Jun 11, 2024 · 2 comments

Comments

@taniagmangolini
Copy link

taniagmangolini commented Jun 11, 2024

We have some workflows running on Cromwell AWS with tasks that have glob output variables (ex. Array[File?] bamOutputs = glob("*.out.bam")).
When we run these workflows on Cromwell version 83.1-AWS it runs without any problem. However, when we run the same workflow on Cromwell 85.1-AWS tasks with glob output variables do not start and the workflow fails with a Null Pointer Exception.
It seems that when the glob has a directory specified (ex. glob("dataset/part-*")) it works, but when we remove it the task fail (ex. glob("part-*")).

Follow below two workflows that we have used to investigate the problem:

Workflow OK:

version 1.0

workflow Teste {
    input {
        File sampleR1 = "s3://bucket/teste/H5N1_S2_R1_001.fastq.gz"
        File sampleR2 = "s3://bucket/teste/H5N1_S2_R2_001.fastq.gz"
    }

    call MyTask {
       input:
            file1 = sampleR1,
            file2 = sampleR2
    }
}

task MyTask {
    input {
        File file1
        File file2
        String dockerImage = "00000.dkr.ecr.sa-east-1.amazonaws.com/prinseq:0.20.4--hdfd78af_5"
        #String dockerImage = ""
    }

    command <<<
        set -e -o pipefail
        touch prinseq.log
        mkdir dataset
        touch dataset/part-1.txt
        ls -la
    >>>

    runtime {
        docker: dockerImage
    }
    
    output {
        File log = "prinseq.log"
        Array[File] multiannoDataset = glob("dataset/part-*")
    }
}

Workflow with error

version 1.0

workflow Teste {
    input {
        File sampleR1 = "s3://bucket/teste/H5N1_S2_R1_001.fastq.gz"
        File sampleR2 = "s3://bucket/teste/H5N1_S2_R2_001.fastq.gz"
    }

    call MyTask {
       input:
            file1 = sampleR1,
            file2 = sampleR2
    }
}

task MyTask {
    input {
        File file1
        File file2
        String dockerImage = "00000.dkr.ecr.sa-east-1.amazonaws.com/prinseq:0.20.4--hdfd78af_5"
        #String dockerImage = ""
    }

    command <<<
        set -e -o pipefail
        touch prinseq.log
        mkdir dataset
        touch dataset/part-1.txt
        ls -la
    >>>

    runtime {
        docker: dockerImage
    }
    
    output {
        File log = "prinseq.log"
        Array[File] multiannoDataset = glob("part-*")
    }
}

Error:
2024-06-10 21:55:35 cromwell-system-akka.dispatchers.engine-dispatcher-34 INFO - WorkflowExecutionActor-0da47f83-b464-48a4-a5fd-fd235af2d53a [UUID(0da47f83)]: Starting Teste.MyTask
2024-06-10 21:55:36 cromwell-system-akka.dispatchers.engine-dispatcher-34 INFO - Assigned new job execution tokens to the following groups: 0da47f83: 1
2024-06-10 21:55:36 cromwell-system-akka.dispatchers.backend-dispatcher-3411 ERROR - null
java.lang.NullPointerException: null

at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.generateGlobPaths(AwsBatchAsyncBackendJobExecutionActor.scala:440)
at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.globScript(AwsBatchAsyncBackendJobExecutionActor.scala:872)
at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$globScripts$1(StandardAsyncExecutionActor.scala:296)
at scala.collection.immutable.List.map(List.scala:246)
at scala.collection.immutable.List.map(List.scala:79)
at cromwell.backend.standard.StandardAsyncExecutionActor.globScripts(StandardAsyncExecutionActor.scala:296)
at cromwell.backend.standard.StandardAsyncExecutionActor.globScripts$(StandardAsyncExecutionActor.scala:295)
at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.globScripts(AwsBatchAsyncBackendJobExecutionActor.scala:96)
at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$commandScriptContents$9(StandardAsyncExecutionActor.scala:473)
at cats.SemigroupalArityFunctions.$anonfun$map2$1(SemigroupalArityFunctions.scala:30)
at cats.data.Validated.map(Validated.scala:559)
at cats.data.ValidatedApplicative.map(Validated.scala:1044)
at cats.data.ValidatedApplicative.map(Validated.scala:1042)
at cats.SemigroupalArityFunctions.map2(SemigroupalArityFunctions.scala:30)
at cats.SemigroupalArityFunctions.map2$(SemigroupalArityFunctions.scala:29)
at cats.Semigroupal$.map2(Semigroupal.scala:51)
at cats.syntax.Tuple2SemigroupalOps.mapN(TupleSemigroupalSyntax.scala:39)
at cromwell.backend.standard.StandardAsyncExecutionActor.commandScriptContents(StandardAsyncExecutionActor.scala:442)
at cromwell.backend.standard.StandardAsyncExecutionActor.commandScriptContents$(StandardAsyncExecutionActor.scala:393)
at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.commandScriptContents(AwsBatchAsyncBackendJobExecutionActor.scala:96)
at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.uploadScriptFile(AwsBatchAsyncBackendJobExecutionActor.scala:521)
at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.executeAsync(AwsBatchAsyncBackendJobExecutionActor.scala:532)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:1154)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:1146)
at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.executeOrRecover(AwsBatchAsyncBackendJobExecutionActor.scala:96)
at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:65)
at cromwell.core.retry.Retry$.withRetry(Retry.scala:46)
at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:61)
at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:65)
at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:88)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
at akka.actor.Actor.aroundReceive(Actor.scala:539)
at akka.actor.Actor.aroundReceive$(Actor.scala:537)
at cromwell.backend.impl.aws.AwsBatchAsyncBackendJobExecutionActor.aroundReceive(AwsBatchAsyncBackendJobExecutionActor.scala:96)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:614)
at akka.actor.ActorCell.invoke(ActorCell.scala:583)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
at akka.dispatch.Mailbox.run(Mailbox.scala:229)
at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

@xquek
Copy link

xquek commented Oct 15, 2024

We experienced the same issue here.

i have a draft PR which i think kinda fix the issue the issue here but it is kinda of a mess so maybe there is a more elegant way of fixing this issue.

With the draft PR, it seems to be working for us - but im not really sure if this will break EFS as i am only testing it with S3 files
https://github.com/henriqueribeiro/cromwell/pull/45/files

maybe @geertvandeweyer or @henriqueribeiro could take a look ? I am happy to clean up my PR after and to split extract addSharedMemory from the rest of the code.

Let me know thanks !

@geertvandeweyer
Copy link
Collaborator

@taniagmangolini : can it be a mixed issue here ?

you mention that Array[File?] glob() didn't work. That's correct. My recent PR 55 on optional in/out files should handle that.

Your example here refers to non-optional files Array[File] , and the glob points to a folder containing no files mapped by the glob (you glob outside the dataset folder). I haven't tested this, but maybe cromwell complaints on globs returning no files ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants