Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move BUSCO.nf process to a bash script #393

Closed
jfy133 opened this issue Feb 20, 2023 · 0 comments
Closed

Move BUSCO.nf process to a bash script #393

jfy133 opened this issue Feb 20, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request low-priority

Comments

@jfy133
Copy link
Member

jfy133 commented Feb 20, 2023

Description of feature

As this makes ugly warning messages when running -with-tower

I'm shamefully tagging @skrakau to look into this as the bash script scares me too much to try doing it myself 😅 (sorry!)

WARN: Tower request field `tasks.script` exceeds expected size | offending value: `
    # ensure augustus has write access to config directory
    if [ Y = "Y" ] ; then
        cp -r /usr/local/config/ augustus_config/
        export AUGUSTUS_CONFIG_PATH=augustus_config
    fi

    # place db in extra folder to ensure BUSCO recognizes it as path (instead of downloading it)
    if [ Y = "Y" ] ; then
        mkdir dataset
        mv bacteria_odb10 dataset/
    fi

    # set nullgob: if pattern matches no files, expand to a null string rather than to itself
    shopt -s nullglob

    # only used for saving busco downloads
    most_spec_db="NA"

    if busco --lineage_dataset dataset/bacteria_odb10         --mode genome         --in MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa         --cpu "2"         --out "BUSCO" > MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log 2> MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.err; then

        # get name of used specific lineage dataset
        summaries=(BUSCO/short_summary.specific.*.BUSCO.txt)
        if [ ${#summaries[@]} -ne 1 ]; then
            echo "ERROR: none or multiple 'BUSCO/short_summary.specific.*.BUSCO.txt' files found. Expected one."
            exit 1
        fi
        [[ $summaries =~ BUSCO/short_summary.specific.(.*).BUSCO.txt ]];
        db_name_spec="${BASH_REMATCH[1]}"
        most_spec_db=${db_name_spec}
        echo "Used specific lineage dataset: ${db_name_spec}"

        if [ Y = "Y" ]; then
            cp BUSCO/short_summary.specific.${db_name_spec}.BUSCO.txt short_summary.specific_lineage.${db_name_spec}.MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa.txt

            # if lineage dataset is provided, BUSCO analysis does not fail in case no genes can be found as when using the auto selection setting
            # report bin as failed to allow consistent warnings within the pipeline for both settings
            if egrep -q $'WARNING:      BUSCO did not find any match.' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log ; then
                echo "WARNING: BUSCO could not find any genes for the provided lineage dataset! See also MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log."
                echo -e "MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa    No genes" > "MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.failed_bin.txt"
            fi
        else
            # auto lineage selection
            if { egrep -q $'INFO:       \S+ selected' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log                 && egrep -q $'INFO: Lineage \S+ is selected, supported by ' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log ; } ||                 { egrep -q $'INFO: \S+ selected' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log                 && egrep -q $'INFO: The results from the Prodigal gene predictor indicate that your data belongs to the mollicutes clade. Testing subclades...' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log                 && egrep -q $'INFO:   Using local lineages directory ' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log ; }; then
                # the second statement is necessary, because certain mollicute clades use a different genetic code, are not part of the BUSCO placement tree, are tested separately
                # and cause different log messages
                echo "Domain and specific lineage could be selected by BUSCO."
                cp BUSCO/short_summary.specific.${db_name_spec}.BUSCO.txt short_summary.specific_lineage.${db_name_spec}.MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa.txt

                db_name_gen=""
                summaries_gen=(BUSCO/short_summary.generic.*.BUSCO.txt)
                if [ ${#summaries_gen[@]} -lt 1 ]; then
                    echo "No 'BUSCO/short_summary.generic.*.BUSCO.txt' file found. Assuming selected domain and specific lineages are the same."
                    cp BUSCO/short_summary.specific.${db_name_spec}.BUSCO.txt short_summary.domain.${db_name_spec}.MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa.txt
                    db_name_gen=${db_name_spec}
                else
                    [[ $summaries_gen =~ BUSCO/short_summary.generic.(.*).BUSCO.txt ]];
                    db_name_gen="${BASH_REMATCH[1]}"
                    echo "Used generic lineage dataset: ${db_name_gen}"
                    cp BUSCO/short_summary.generic.${db_name_gen}.BUSCO.txt short_summary.domain.${db_name_gen}.MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa.txt
                fi

                for f in BUSCO/run_${db_name_gen}/busco_sequences/single_copy_busco_sequences/*faa; do
                    cat BUSCO/run_${db_name_gen}/busco_sequences/single_copy_busco_sequences/*faa | gzip >MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_buscos.${db_name_gen}.faa.gz
                    break
                done
                for f in BUSCO/run_${db_name_gen}/busco_sequences/single_copy_busco_sequences/*fna; do
                    cat BUSCO/run_${db_name_gen}/busco_sequences/single_copy_busco_sequences/*fna | gzip >MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_buscos.${db_name_gen}.fna.gz
                    break
                done

            elif egrep -q $'INFO:       \S+ selected' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log && egrep -q $'INFO: No marker genes were found. Root lineage \S+ is kept' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log ; then
                echo "Domain could be selected by BUSCO, but no more specific lineage."
                cp BUSCO/short_summary.specific.${db_name_spec}.BUSCO.txt short_summary.domain.${db_name_spec}.MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa.txt

            elif egrep -q $'INFO:       \S+ selected' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log && egrep -q $'INFO: Not enough markers were placed on the tree \([0-9]*\). Root lineage \S+ is kept' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log ; then
                echo "Domain could be selected by BUSCO, but no more specific lineage."
                cp BUSCO/short_summary.specific.${db_name_spec}.BUSCO.txt short_summary.domain.${db_name_spec}.MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa.txt

            elif egrep -q $'INFO:       \S+ selected' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log && egrep -q $'INFO: Running virus detection pipeline' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log ; then
                # TODO double-check if selected dataset is not one of bacteria_*, archaea_*, eukaryota_*?
                echo "Domain could not be selected by BUSCO, but virus dataset was selected."
                cp BUSCO/short_summary.specific.${db_name_spec}.BUSCO.txt short_summary.specific_lineage.${db_name_spec}.MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa.txt
            else
                echo "ERROR: Some not expected case occurred! See MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log." >&2
                exit 1
            fi
        fi

        for f in BUSCO/run_${db_name_spec}/busco_sequences/single_copy_busco_sequences/*faa; do
            cat BUSCO/run_${db_name_spec}/busco_sequences/single_copy_busco_sequences/*faa | gzip >MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_buscos.${db_name_spec}.faa.gz
            break
        done
        for f in BUSCO/run_${db_name_spec}/busco_sequences/single_copy_busco_sequences/*fna; do
            cat BUSCO/run_${db_name_spec}/busco_sequences/single_copy_busco_sequences/*fna | gzip >MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_buscos.${db_name_spec}.fna.gz
            break
        done

    elif egrep -q $'ERROR:      No genes were recognized by BUSCO' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.err ; then
        echo "WARNING: BUSCO analysis failed due to no recognized genes! See also MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.err."
        echo -e "MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa    No genes" > "MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.failed_bin.txt"

    elif egrep -q $'INFO:       \S+ selected' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log && egrep -q $'ERROR:        Placements failed' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.err ; then
        echo "WARNING: BUSCO analysis failed due to failed placements! See also MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.err. Still using results for selected generic lineage dataset."
        echo -e "MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa    Placements failed" > "MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.failed_bin.txt"

        message=$(egrep $'INFO: \S+ selected' MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.log)
        [[ $message =~ INFO:[[:space:]]([_[:alnum:]]+)[[:space:]]selected ]];
        db_name_gen="${BASH_REMATCH[1]}"
        most_spec_db=${db_name_gen}
        echo "Used generic lineage dataset: ${db_name_gen}"
        cp BUSCO/auto_lineage/run_${db_name_gen}/short_summary.txt short_summary.domain.${db_name_gen}.MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa.txt

        for f in BUSCO/auto_lineage/run_${db_name_gen}/busco_sequences/single_copy_busco_sequences/*faa; do
            cat BUSCO/auto_lineage/run_${db_name_gen}/busco_sequences/single_copy_busco_sequences/*faa | gzip >MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_buscos.${db_name_gen}.faa.gz
            break
        done
        for f in BUSCO/auto_lineage/run_${db_name_gen}/busco_sequences/single_copy_busco_sequences/*fna; do
            cat BUSCO/auto_lineage/run_${db_name_gen}/busco_sequences/single_copy_busco_sequences/*fna | gzip >MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_buscos.${db_name_gen}.fna.gz
            break
        done

    else
        echo "ERROR: BUSCO analysis failed for some unknown reason! See also MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_busco.err." >&2
        exit 1
    fi

    # additionally output genes predicted with Prodigal (GFF3)
    if [ -f BUSCO/logs/prodigal_out.log ]; then
        mv BUSCO/logs/prodigal_out.log "MEGAHIT-MetaBAT2-test_minigut.unbinned.1.fa_prodigal.gff"
    fi

    # if needed delete temporary BUSCO files
    if [ Y ]; then
        find . -depth -type d -name "augustus_config" -execdir rm -rf "{}" \;
        find . -depth -type d -name "auto_lineage" -execdir rm -rf "{}" \;
        find . -depth -type d -name "run_*" -execdir rm -rf "{}" +
    fi

    cat <<-END_VERSIONS > versions.yml
    "NFCORE_MAG:MAG:BUSCO_QC:BUSCO":
        python: $(python --version 2>&1 | sed 's/Python //g')
        R: $(R --version 2>&1 | sed -n 1p | sed 's/R version //' | sed 's/ (.*//')
        busco: $(busco --version 2>&1 | sed 's/BUSCO //g')
    END_VERSIONS
    `, size: 10488 (max: 10240)
    ```
@jfy133 jfy133 added the enhancement New feature or request label Feb 20, 2023
@jfy133 jfy133 changed the title Move BUSCO_QC process to a bash script Move BUSCO.nf process to a bash script Feb 20, 2023
@jfy133 jfy133 closed this as completed May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request low-priority
Projects
None yet
Development

No branches or pull requests

2 participants