Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow genome selection inconsistent with few bugs: fixed vs runtime #1132

Open
jennaj opened this issue Nov 19, 2015 · 5 comments
Open

Workflow genome selection inconsistent with few bugs: fixed vs runtime #1132

jennaj opened this issue Nov 19, 2015 · 5 comments
Assignees

Comments

@jennaj
Copy link
Member

jennaj commented Nov 19, 2015

Few issues, some are tool related, but might be global.

Example tools: Tophat2 and SAM-to-BAM

Impact:

  • SAM-to-BAM genome selection can only be done within the editor, manually, without benefit of an active list of dbkeys. There is no way to select a genome from the history. No prompt is given to the user that if a custom genome is to be used, it must be converted to a custom build first, and that the dbkey must be a known (due to no active dbkey list).
  • Tophat2 genome selection can only be done from within the editor and when accessing a built-in index (does not include custom builds). Changing a fixed genome selection at runtime is not possible. Selecting a custom genome directly from the history is possible at runtime.

Detail Tophat2:

  • Fixed genome selection from built-in indexes (active list in editor)
  • Runtime genome selection from the history
  • Does NOT allow runtime genome selection from built-in indexes (even though the workflow submission form has the icon for this function) - Bug?

Detail BAM-to-SAM:

  • Fixed genome selection from built-in indexes but list is not active in the editor or submission form - Bug?
    • In editor, if an original extracted dbkey is altered to be set at runtime, it cannot be recovered when switching back to fixed mode
    • In editor, the dbkey is manually entered and can be left blank which will always trigger an error - Bug?
    • In view (not submission form), if the genome is not fixed and to be selected at runtime, option does not state Select at runtime for all tools (example tools compared below in graphics)
  • Editor does NOT allow option for runtime genome selection from the history - Enhancement?

Graphics demonstrating behavior:

Workflow Editor

  • editor sam2bam fixed genome option editor sam2bam fixed genome option
  • editor sam2bam runtime dbkey option editor sam2bam runtime dbkey option
  • editor tophat fixed dbkey option from index editor tophat fixed dbkey option from index
  • editor tophat runtime dbkey option from history editor tophat runtime dbkey option from history

Workflow Submission

  • running sam2bam fixed dbkey option click change (nothing can be selected) running sam2bam fixed dbkey option click change
  • running sam2bam runtime dbkey option (always blank, no way to enter/select genome) running sam2bam runtime dbkey option
  • running tophat fixed dbkey option from index (clicking on icon to modify genome selection does nothing)
    running tophat fixed dbkey option from index

Workflow View

  • view sam2bam dbkey fixed option (correct/expected example) view sam2bam dbkey fixed option
  • view sam2bam dbkey runtime option (no information about genome selection occurring at runtime given, see next for how view should look) view sam2bam dbkey runtime option
  • view tophat dbkey runtime option (correct/expected example)
    view tophat dbkey runtime option
@guerler guerler self-assigned this Nov 19, 2015
@jennaj
Copy link
Member Author

jennaj commented Nov 20, 2015

Related #979

@guerler
Copy link
Contributor

guerler commented Nov 20, 2015

@jennaj Thanks a lot for all the details and screenshots. I have tried to answer all questions:

  1. "SAM-to-BAM genome selection can only be done within the editor, manually, without benefit of an active list of dbkeys."
    Response: The genome selection is configured (in the tool xml) such that dbkeys are filtered by the metadata of the selected dataset. The list is empty because no dataset can be selected in the workflow editor.

  2. "There is no way to select a genome from the history."
    Response: The workflow editor does not support the selection of datasets from the history.

  3. "No prompt is given to the user that if a custom genome is to be used, it must be converted to a custom build first, and that the dbkey must be a known (due to no active dbkey list)."
    Response: This is an enhancement. We should have a separate git issue for this.

  4. "Tophat2 genome selection can only be done from within the editor and when accessing a built-in index (does not include custom builds). Changing a fixed genome selection at runtime is not possible. Selecting a custom genome directly from the history is possible at runtime."
    Response: This is a bug of the current run workflow tool form and will be resolved once we switch to the upcoming revised version.

  5. "In editor, if an original extracted dbkey is altered to be set at runtime, it cannot be recovered when switching back to fixed mode"
    Response: This is an enhancement we should implement. It makes this feature more user friendly.

  6. "In editor, the dbkey is manually entered and can be left blank which will always trigger an error - Bug?"
    Response: Yes, its a bug. The current run workflow form and/or the web-controller should properly validate the inputs. See also: 4.

  7. "In view (not submission form), if the genome is not fixed and to be selected at runtime, option does not state Select at runtime for all tools (example tools compared below in graphics). Editor does NOT allow option for runtime genome selection from the history - Enhancement?"
    Response: Yes, its an enhancement. This is related to Conditional selects must be pre-specified in workflow, not runtime configurable #979. It seems that the challenge here is that conditionals and repeats often alter the number of input datasets and the run workflow tool form is unable to account for new inputs and workflow pipes. However, we should at least enhance this and allow conditionals/repeats without dataset implications to be set at run time.

Thanks again.

@jennaj
Copy link
Member Author

jennaj commented Nov 30, 2015

Thanks Sam. I think we are in the same place. Plus I can clarify a bit. When I say make a ticket - that is my suggestion. If we all think that the SAMtools suite should be different, then that is something we can decide. It has usage implications that our community has already run into. And will more often with the generic workflow modeling (migration to open source shared workflow repos).

(github ticket is a placeholder for those we decide to actually create then link back here)

  1. "SAM-to-BAM genome selection can only be done within the editor, manually, without benefit of an active list of dbkeys."
    Response: The genome selection is configured (in the tool xml) such that dbkeys are filtered by the metadata of the selected dataset. The list is empty because no dataset can be selected in the workflow editor.

Yes, but I think that we should consider turning that off for this tool set. No genome filtering based on inputs. Why? If this tool is included it freezes the workflow to a specific genome, unless one has edit permissions and want to change the workflow itself. As before, and a biggie, Galaxy is now being abstracted with other workflow engines. Plus, most of our other tools allow genome selected/modification at run-time. (except when it is not working - the editor AND the run-time form indicate that genome selection can be made in either for certain tools, like Tophat - which is why I included it as an comparision example). Does that makes sense? This could mean a github ticket.

  1. "There is no way to select a genome from the history."
    Response: The workflow editor does not support the selection of datasets from the history.

It doesn't for this tool set, but it does for Tophat. This tool should be like the other. Actually, all tools that have a genome selection function should have this be the same (including the part about using a genome from the history - within the editor - that was in another observation I made). This is probably the same github issue as for 1)

  1. "No prompt is given to the user that if a custom genome is to be used, it must be converted to a custom build first, and that the dbkey must be a known (due to no active dbkey list)."
    Response: This is an enhancement. We should have a separate git issue for this.

I agree - github issue. Might be the same as 1) and 2). In effect, make the ref genome selection just like Tophat and most other tools. From a list of indexed genomes, same as if using it in the history (and without any filtering by inputs). Workflows are not always created from histories and we should support the design of workflows that have as few hard-set parameters as possible. Ref genome is a biggie to have hard-set. Prevents a ton of valid/common use cases.

  1. "Tophat2 genome selection can only be done from within the editor and when accessing a built-in index (does not include custom builds). Changing a fixed genome selection at runtime is not possible. Selecting a custom genome directly from the history is possible at runtime."
    Response: This is a bug of the current run workflow tool form and will be resolved once we switch to the upcoming revised version.

OK, so once Tophat is functional again we can use it as a model for other tools (again, this is why I picked it to compare to, but in the process uncovered the issues with it). I didn't look specifically to see which tools have this ref genome selection fully implimented. But the Samtools suite needs to. Rest should be examined (are probably just a few or few suites). Could be a master github ticket for all tools that need an update (as a check list) then pull-requests are attached as done?
Goal for ref genome selection across tools in workflow editor & run-time??
1 - ref genome from list of known builds, including custom builds in editor
2 - ref genome from the history in editor
3 - ref genome can be modifed at run-time, A) from list of known builds or 2) from the histroy
4 - UI has the same functionality and looks the same in all tools in all views for ref genome selection - e.g. consistency

  1. "In editor, if an original extracted dbkey is altered to be set at runtime, it cannot be recovered when switching back to fixed mode"
    Response: This is an enhancement we should implement. It makes this feature more user friendly.

Ok, this is a smaller change. I'll submit it as a enhancement github ticket

  1. "In editor, the dbkey is manually entered and can be left blank which will always trigger an error - Bug?"
    Response: Yes, its a bug. The current run workflow form and/or the web-controller should properly validate the inputs. See also: 4.

We can test this once the new version is on main, see if it still a bug, and needs a github issue

  1. "In view (not submission form), if the genome is not fixed and to be selected at runtime, option does not state Select at runtime for all tools (example tools compared below in graphics). Editor does NOT allow option for runtime genome selection from the history - Enhancement?"
    Response: Yes, its an enhancement. This is related to Conditional selects must be pre-specified in workflow, not runtime configurable #979. It seems that the challenge here is that conditionals and repeats often alter the number of input datasets and the run workflow tool form is unable to account for new inputs and workflow pipes. However, we should at least enhance this and allow conditionals/repeats without dataset implications to be set at run time.

It is related to #979. That one is a third style of genome selection in the editor and run-time function (as well as the tool form, but that is a style issue unrelated to function. I consider style secondary - I have only asked for functional consistency in the above - and 979 is also about functionality).

Although certainly using the same style in all tools (at least those we can change) is a big win. Maybe it needs to be moved out to a distinct function then called by tools (?). Or maybe it is and just not all tools use it (yet?).

Thanks for the great comments and help. I know this is a monster ticket. It is just the more I looked at a way to solve a user's problem with samtools for a workflow a nest of other issues popped out.

I'll wait for final feedback then make the tickets. Linking back here because I am SURE none of us want to go through the concepts in painful detail again, ha :)

@yhoogstrate
Copy link
Member

I am not sure if the following is of any help, but maybe it is so I'll post it. When I had a similar problem I fixed this at the tool side as follows (https://github.com/ErasmusMC-Bioinformatics/featurecounts_galaxy_wrapper/blob/master/featurecounts/featurecounts.xml):

instead of using 2 selects:

            <param name="source_select" type="select" label="GFF/GTF Source">
                <option value="indexed_filtered">Use a built-in index (which fits your reference)</option>
                <option value="history">Use reference from the history</option>
            </param>

I added an extra select that is designed for running in workflows:

            <param name="source_select" type="select" label="GFF/GTF Source">
                <option value="indexed_filtered">Use a built-in index (which fits your reference)</option>
                <option value="history">Use reference from the history</option>
                <option value="attribute">Use a built-in index based on the 'metadata.dbkey' attribute; ideal in workflows</option>
            </param>

When a tool gets executed, the last option (attribute) is selected, it will look for the dbkey in the input file, and scans the data table to obtain the corresponding entry:

    #if $reference_gene_sets_source.source_select == "indexed_filtered"
        "$reference_gene_sets_source.reference_gene_sets"
    #else if $reference_gene_sets_source.source_select == "history"
        "$reference_gene_sets_source.reference_gene_sets"
    #else
        #*
            This is a workaround to obtain the "genome.fa" file that
            corresponds to the dbkey of the alignments.
            Because this file is "calculated" during run-time, it can
            be used in a workflow.
        *#
        "${ filter( lambda x: str( x[0] ) == str( { alignment.metadata.dbkey:True for alignment in $alignments }.keys()[0] ), $__app__.tool_data_tables[ 'gene_sets' ].get_fields() )[0][2] }"
    #end if

@jennaj
Copy link
Member Author

jennaj commented Dec 4, 2015

Thanks @yhoogstrate !

The concept seems intersting to me, but let's see what @guerler and others think. Seem like it would be very useful to include enhanced genome select options, specifcally for workflow use, in such core tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants