Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sylph tool Wrapper #1518

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open

Sylph tool Wrapper #1518

wants to merge 18 commits into from

Conversation

tcollins2011
Copy link
Contributor

Moving the sylph tool wrapper pull request to galaxytools due to the metadeata file associated with the .syldb database having a size of 6mb.

@tcollins2011
Copy link
Contributor Author

@astrovsky01 Connecting you to the PR

Copy link
Owner

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tcollins2011!

The DB is tiny? Is it the wrong DB?
A few initial comments inline.

tools/sylph/sylph.xml Outdated Show resolved Hide resolved
tools/sylph/sylph.xml Outdated Show resolved Hide resolved
tools/sylph/sylph.xml Outdated Show resolved Hide resolved
tools/sylph/sylph.xml Outdated Show resolved Hide resolved
<expand macro="input_database"/>
</inputs>
<outputs>
<data format="tabular" name="output" label="${tool.name} on ${on_string}" from_work_dir="output.tsv"/>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add here as metadata the column names if you like. they seem to be static.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They query and profile commands both generate an output tabular file, but are slightly different (i.e for output_1.tabular and output_5.tabular are similar but slightly different, and only difference is query vs profile). We could split this into two tools if you think that would be valuable, though?

tools/sylph/tool_data_table_conf.xml.sample Outdated Show resolved Hide resolved
tools/sylph/tool_data_table_conf.xml.test Outdated Show resolved Hide resolved
tools/sylph/test-data/test R1.fq Outdated Show resolved Hide resolved
Copy link
Collaborator

@bebatut bebatut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tcollins2011! I put some comments.

A big one: it seems that both Python scripts come from https://github.com/bluenote-1577/sylph-utils and there is no LICENSE in the repository. So we are not allowed to take the scripts and add them here.
It might be better to find a way (maybe talking with authors) so they make the scripts available via conda maybe. That would avoid duplication

tools/sylph/.shed.yml Outdated Show resolved Hide resolved
tools/sylph/.shed.yml Outdated Show resolved Hide resolved
tools/sylph/.shed.yml Outdated Show resolved Hide resolved
tools/sylph/macros.xml Outdated Show resolved Hide resolved
Comment on lines 178 to 180
<param name="functions" type="select" label="ANI querying or taxonomic profiling" help="ANI(Average Nucleotide Identity) is a measure of nucleotide-level similarity between the genomes of two microogranisms. Profile querying involves comparing functional or sequence profiles derived from the genomic data.">
<option value="profile">Taxanomic Profile</option>
<option value="ani">ANI Query</option>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it would not more sense to have 2 Galaxy tools (one for sylph profile, one for sylph query).
That would be closer of how people would use it via command-line. What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tcollins2011 any opinion here?

tools/sylph/sylph.xml Outdated Show resolved Hide resolved
#else:
&& ln -s "$database_select.metadata" "database.tsv.gz"
#end if
&& python "$__tool_directory__/sylph_to_taxprof.py" -s output.tsv -m database.tsv.gz -o metaphlan_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to put in a # if "metaphlan" in $function.outputs:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metaphlan tool must be run regardless if krona, metphlan, or both are requested. The krona formatted file is generated from the metaphlan converted file. The #if part is about if we hand it back to the user

tools/sylph/sylph.xml Outdated Show resolved Hide resolved
tools/sylph/sylph.xml Outdated Show resolved Hide resolved
tools/sylph/sylph.xml Outdated Show resolved Hide resolved
@astrovsky01
Copy link
Contributor

astrovsky01 commented Oct 9, 2024

@bgruening

The DB is tiny? Is it the wrong DB? A few initial comments inline.

The DB we used in this for testing is basically just e. coli, and is as minimal as possible to run tests againstt. Even at that size, though, the metadata file was too big for the repo. That's also why the final test is just checking command text. The next tool we were planning to work on was one to generate sylph db and their associated metada files so users can build their own. In the meantime, I added a readme clarifying how to get the pre-made db and metadata.

tools/sylph/macros.xml Outdated Show resolved Hide resolved
@tcollins2011
Copy link
Contributor Author

@bgruening @astrovsky01 We could instead wrap a larger database for more thorough testing. We essentially made as small a database as possible when we trying to push this to tools-iuc, but if you think it would provide more thorough testing we could instead swap it out for a larger database. In theory though, the tiny database should allow for fully functional testing within galaxy.

tcollins2011 and others added 4 commits October 9, 2024 09:53
Adding a profile to the tool

Co-authored-by: Björn Grüning <[email protected]>
Changing to correct repository owner

Co-authored-by: Bérénice Batut <[email protected]>
Improving description of tool

Co-authored-by: Bérénice Batut <[email protected]>
Improved description of the tool

Co-authored-by: Bérénice Batut <[email protected]>
@bgruening
Copy link
Owner

@bgruening @astrovsky01 We could instead wrap a larger database for more thorough testing. We essentially made as small a database as possible when we trying to push this to tools-iuc, but if you think it would provide more thorough testing we could instead swap it out for a larger database. In theory though, the tiny database should allow for fully functional testing within galaxy.

All good, I was just confused because you said the DB is 6MB and I wanted to check if the submitted file was a mistake.

Updated due to publication

Co-authored-by: Bérénice Batut <[email protected]>
@astrovsky01
Copy link
Contributor

it seems that both Python scripts come from https://github.com/bluenote-1577/sylph-utils and there is no LICENSE in the repository.

@bebatut the krona converter is actually a modified version of the one wrapped in the Metaphlan tool in galaxy. I believe it's from Galaxy, too, since I don't see it in the metaphlan original repo

@bgruening
Copy link
Owner

It looks like test files are missing and the limiting is not passing.

@tcollins2011
Copy link
Contributor Author

@bgruening Alright I think everything should be passing now except the file size check.

<param name="functions" value="profile"/>
<param name="min_num_kmers" value="49"/>
</conditional>
<output name="output" value="output_2.tabular" compare="sim_size"/>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use sim_size on text outputs, use asserts :)

<when value="cached">
<param label="Select a sylph database" name="sylph_database" type="select">
<options from_data_table="sylph_databases">
<validator message="No Sylph databases are available" type="no_options" />
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add here a filter for version 1 or something like that ... and then we include version 1 in the test file.

Whenever the tool changes to DB layout we increase this version.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping, you are not filtering the DB here according to the version

Copy link
Owner

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me, just 2 small comments.

@tcollins2011
Copy link
Contributor Author

@bgruening Sylph has been split into two separate tools now both within this directory. Additionally, I see that we are still failing the file size check, but I'm not sure if it is possible to reduce the database metadata files to a smaller size.

Copy link
Owner

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tcollins2011! Added a few more smaller comments!

<token name="@LICENSE@">MIT</token>
<token name="@DB_SELECTOR@"><![CDATA[
#if $database_select.select == 'cached':
ln -s '$database_select.sylph_database.fields.path/database.syldb' 'database.syldb' &&
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ln -s '$database_select.sylph_database.fields.path/database.syldb' 'database.syldb' &&
ln -s '$database_select.sylph_database.fields.path/database.syldb' 'database.syldb' &&

]]></token>
<token name="@SINGLE_INPUT@"><![CDATA[
#if $sketch.input.ext == 'fastqsanger'
#set $ext = 'fastq'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#set $ext = 'fastq'
#set $ext = 'fastq'

Comment on lines +101 to +102
<requirement type="package" version="3.11.9">python</requirement>
<requirement type="package" version="2.2.3">pandas</requirement>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<requirement type="package" version="3.11.9">python</requirement>
<requirement type="package" version="2.2.3">pandas</requirement>
<requirement type="package" version="3.11">python</requirement>
<requirement type="package" version="2.2">pandas</requirement>

</xml>
<xml name="creator">
<creator>
<organization name="University of Toronto" url="https://github.com/bluenote-1577/sylph"/>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to acknowledge you, the Galaxy tool creator.

@@ -0,0 +1 @@
sylph_db sylph_db ${__HERE__}/sylph_db month,year
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

month,year should be something .. here isn' it?

It could be simply 1

<when value="cached">
<param label="Select a sylph database" name="sylph_database" type="select">
<options from_data_table="sylph_databases">
<validator message="No Sylph databases are available" type="no_options" />
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping, you are not filtering the DB here according to the version

-t \${GALAXY_SLOTS:-4}
database.syldb
sylph_sketches/*.sylsp
-o $output
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-o $output
-o '$output`

Please check if the rest is also quoted

</param>
<!-- Only permitting fastq as tool input only allows fastq and fastq.gz as file ext -->
<when value="single">
<param name="input" type="data" format="fastq,fastq.gz,fastqsanger,fastqsanger.gz" label="Single-end input reads"/>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed, can we not just go with the multiple=true option?

<option value="single">Individual single-end reads</option>
<option value="single_group">Group of single-ended reads</option>
<option value="paired">One set of paired-end reads</option>
<option value="paired_group"> Group of paired-ended reads</option>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<option value="paired_group"> Group of paired-ended reads</option>
<option value="paired_group">Group of paired-ended reads</option>

@@ -0,0 +1,210 @@
<tool id="sylph_profile" name="sylph profile" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="23.0">
<expand macro='description'/>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be specific to this tool? (and the other)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants