Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HISAT2 strandedness parameter #1470

Merged
merged 12 commits into from
Sep 26, 2017
Merged

Conversation

mblue9
Copy link
Contributor

@mblue9 mblue9 commented Sep 8, 2017

Submitting this PR to fix the bug reported here #1455

The suggested fix is from the reporter @NCEichner

@bgruening
Copy link
Member

Can we add a test and increase the version number?
Thanks @mblue9!

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 12, 2017

Hi @bgruening thanks for taking a look at this. I've added a test and increased the version number.

Something I don't understand yet with the wrapper versioning is how come I did not have to bump the version number for this kallisto bugfix #1463 but I do need to now for this bugfix (or should I really have bumped the kallisto version?)?

I added a test but I don't know if it's a good one, as all it does is check that a bam is output when the rna_strandedness parameter is set to F. I was thinking a better test could be to check that the header of the bam in the output contains "--rna-strandness FR". Is it possible to check the text of a BAM file header in a test?

@@ -78,7 +78,15 @@
<token name="@strandedness_parameters@">
#if str($spliced_options.spliced_options_selector) == "advanced":
#if str($spliced_options.rna_strandness).strip() != '':
--rna-strandness $spliced_options.rna_strandness
#if str($input_format.paired.paired_selector) == 'paired':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you use a tool for checking indentation? my eyes don't seem to be very reliable :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mblue9 Unfortunately I'm not aware of tool to check indentation for the Cheetah code inside XML, so basically it's my eyes (on GitHub) or searching for 4 spaces in the text editor.

Copy link
Member

@shiltemann shiltemann Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mblue9 no worries, @nsoranzo just has special eagle-eye superpowers the rest of us don't possess ;) 🦅 👀

(but most editors have a setting to show whitespace characters and/or indentation guides, which helps for me)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the tips!

@@ -464,7 +464,7 @@
<output name="output_unaligned_reads_l" file="test_unaligned_reads.fasta" />
<output name="output_unaligned_reads_r" file="test_unaligned_reads.fasta" />
</test>
<test><!-- Ensure fastqsanger.gz works -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave the comment here and below, any specific reason to remove them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to remove so I put those comments back. I only removed them because they were getting in the way of me commenting out tests while doing my own testing. I just forgot to put them back. I put them back after your review here but then I removed them again for more testing and then forgot to put them back again :( They're back again now.

<param name="history_item" ftype="fasta" value="phiX.fa" />
<param name="forward" ftype="fastqsanger" value="hisat_input_1_forward.fastq" />
<param name="reverse" ftype="fastqsanger" value="hisat_input_1_reverse.fastq" />
<param name="rna_strandedness" value="F" />
Copy link
Member

@nsoranzo nsoranzo Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/rna_strandedness/rna_strandness/
This may actual cause the output bam to be different, I imagine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this highlighted that my test really was terrible! That strandness parameter was never used because of the typo so my test was checking against a incorrect bam! The lines diff check doesn't catch that but checking the header would, so if there is ever a way in the future to check a bam header that could be cool to help make tests like this one more robust.

@nsoranzo
Copy link
Member

Something I don't understand yet with the wrapper versioning is how come I did not have to bump the version number for this kallisto bugfix #1463 but I do need to now for this bugfix (or should I really have bumped the kallisto version?)?

I'm not sure this was ever formalised, but you generally need to increase the version number for:

  • modification of requirements
  • adding/removing params and outputs, or changing their type
  • everything else that may change how the tool works in a workflow.

You don't need to increase the version number for:

  • cosmetic changes, e.g. modifying tool name, description or help, param/output label or help

Changing the command (like in this case) is borderline, because it doesn't hinder workflow compatibility, but changes the output, so it may be a problem for reproducibility, if you really want to reproduce wrong results!
In my opinion, these changes don't need a version bump, but it's fine if you do it.

@yhoogstrate
Copy link
Member

yhoogstrate commented Sep 12, 2017

Hi there, great to see things getting picked up :)

I am not sure on this fix. It looks like the problem is more the way the input is requested. Currently the 'rna_strandness' is part of the spliced options. Would it not be better to put this close to the input format stuff? For example in a paired-end/sngle-end conditional?

<conditional name="mate_pairs">
	<input type=select> 
		<option value="single-end">
		<option value="paired-end">
	</option>

	<when pe>
		<conditional name="fastx" label="fasta/fastq">
			<input files>
			<input files>
		</conditional>
		
		<input name="strandness" help="strandness selection for pe">
			<option value="FR">
			<option value="RF">
		</input>
	</when>
	
	<when se>
		<conditional name="fastx" label="fasta/fastq">
			<input files>
		</conditional>
		
		<input name="strandness" help="strandness selection for se">
			<option value="F">
			<option value="R">
		</input>
	</when>
</conditional>

Then in the command you can run something like this, without hacking your values back:

 --rna-strandness {$mate_pairs.strandness}

- fixed the rna-strandness bug, now HISAT2 works correctly with paired-end stranded reads
- made the HISAT2 wrapper more consistent with the Bowtie2 wrapper
- removed parameters not listed in HISAT2 options (e.g. --ma, --dovetail)
- added some more parameters as options (e.g. no-templatelen-adjustment)
- updated HISAT2 version to 2.1.0 (and also updated the test data outputs)
- added options to output new summary format and summary file
- added test for the summary file output
- updated wrapper version
- added sections
@mblue9
Copy link
Contributor Author

mblue9 commented Sep 17, 2017

To fix this bug I've tried to make the change to how the input is requested, as suggested by @yhoogstrate (I thought that was a really good suggestion), however this was a more complicated wrapper than what I've worked on so far and I got a bit tangled up in errors for a bit.
But I managed to get myself out with the help of the Bowtie2 wrapper and in the process ended up making the HISAT2 wrapper more like the Bowtie2 one, as I found that wrapper easier to follow. I'm not sure if what I've done here is ok as I've made a lot of changes, in these new commits I've:

  • fixed the strandedness bug, now HISAT2 works correctly with paired-end stranded reads
  • made the HISAT2 wrapper more consistent with the Bowtie2 wrapper (I've changed the outputs so not sure if it might also fix this issue reported here HISAT2 in workflows: renaming output is broken #1478)
  • removed parameters not listed in HISAT2 options (e.g. --ma, --dovetail)
  • added some more parameters as options (e.g. no-templatelen-adjustment)
  • updated HISAT2 version to 2.1.0 (and also updated the test data outputs)
  • added options to output new summary format and summary file (the new summary format is compatiable with MultiQC https://github.com/ewels/MultiQC/blob/master/docs/modules/hisat2.md)
  • added test for the summary file output
  • updated wrapper version
  • added sections (I like the sections for reducing clutter but don't know if it's ok to add them like I've done here?)
  • I started adding in validators to the advanced options but then thought I should check if I should continue with this?

Now I don't know if in the process of trying to fix this bug I've made too many changes!! If you think so let me know and I can change things back.

@yhoogstrate
Copy link
Member

yhoogstrate commented Sep 18, 2017

@mblue9 I see two strand selection drop downs for the paired-end and paired-end (collection) data:

image

Is this intended?

I started adding in validators to the advanced options but then thought I should check if I should continue with this?

This is great but it doesn't need to test every (e.g. boolean) parameter because there are quite a lot. Conditionals are more important to test since they may introduce bugs if not all 's are correctly implemented. What I also prefer are small comments explaining what's being tested (like <!-- Ensure fastqsanger.gz works -->).

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 18, 2017

Thanks for the review @yhoogstrate!

The two drop downs are intended, in that both options are given by HISAT- one is for if the sequencing assay was stranded (the more common option I would say), the other is generally --fr for paired-end sequencing, but can differ for mate-pair sequencing, see here: https://gatkforums.broadinstitute.org/gatk/discussion/6327/paired-end-mate-pair

In general, paired-end reads tend to be in a FR orientation, have relatively small inserts (~300 - 500 bp), and are particularly useful for the sequencing of fragments that contain short repeat regions. Mate-pair fragments are generally in a RF conformation, contain larger inserts (~3 kb), and enable sequence coverage of genomic regions containing large structural rearrangements. Tandem reads can result from inversions and rearrangements during library preparation.

Thanks for the info on the tests! I'll take another look at what conditionals I should test and add some comments to the tests (I agree, they're helpful for quickly seeing what the tests are doing).

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 19, 2017

Hi @yhoogstrate I've now:

  • added in more validators
  • added comments to the tests to try to explain what they're doing
  • tried to make the stranded option a little clearer in the help section (having two very similar parameters is confusing!)

@yhoogstrate
Copy link
Member

Hi @mblue9, great, all the changes make me really happy! Sorry to say but I don't have time to go over everything today, but I will try to do this asap! This absolutely deserves a careful review.

one is for if the sequencing assay was stranded

I get it now and I should have seen this before shouting out loud.

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 21, 2017

😄 thanks @yhoogstrate glad they make you happy. I did make a lot of changes, including updating the test data, so it would be great to have a careful review from a few pairs of 👀

@yhoogstrate
Copy link
Member

yhoogstrate commented Sep 22, 2017

  • fixed the strandedness bug, now HISAT2 works correctly with paired-end stranded reads
    • Nice
  • made the HISAT2 wrapper more consistent with the Bowtie2 wrapper (I've changed the outputs so not sure if it might also fix this issue reported here HISAT2 in workflows: renaming output is broken #1478)
    • I am not sure but I believe the action type="metadata" is updated automatically in later versions of Galaxy. If this is indeed the case I believe all the dbkey metadata actions can be removed?
  • removed parameters not listed in HISAT2 options (e.g. --ma, --dovetail)
    • I cross-checked this in the hisat code. The argument is still applicable [https://github.com/infphilo/hisat2/blob/master/hisat2.cpp#L659] but functionality is disabled [https://github.com/infphilo/hisat2/blob/master/hisat2.cpp#L1378] so this makes sense to me
  • added some more parameters as options (e.g. no-templatelen-adjustment)
    • Good
  • updated HISAT2 version to 2.1.0 (and also updated the test data outputs)
    • This makes sense too, also the appropriate lines_diff.
  • added options to output new summary format and summary file (the new summary format is compatiable with MultiQC https://github.com/ewels/MultiQC/blob/master/docs/modules/hisat2.md) + added test for the summary file output
    • good and I really like that it is part of a selection box that is by default set to false
  • updated wrapper version
    • this is necessary indeed
  • added sections (I like the sections for reducing clutter but don't know if it's ok to add them like I've done here?)
    • I like this kind of structuring, but for example Select reference genome and Select reads could be merged together. If they are kept in a section I think they need to be expanded=True. I even wouldn't mind to take these out of sections.
  • I started adding in validators to the advanced options but then thought I should check if I should continue with this?
    • Some are not necessary (in_range while min and max are enabled) but I think they provide better interactivity as validators give a direct response (correct me if I'm wrong). No opinion on this so it's good unless somebody else objects

I added some changes myself:

  • s/[ ]$// whitespace removal at the line ends
  • paired end strandness parameter moved into the paired end macro
  • A bit more and updated text to the help section of strandness parameter
  • Moved some stuff for int_quals from label to help

I still see a small thing that could be changed in the hisat2 wrapper. The disallow softclipping+ is requesting to disallow something. Such settings are always subject to discussion, but over several courses we have learned it is often easier to reprase this to 'allow softclipping' and put it by default on yes/true. This does not exactly reflect the commandline input but is more intuitive.

You have recently suggested to update the iuc standards and added:

  • Provide a tool_data_table_conf.xml.test file, which is an uncommented version of tool_data_table_conf.xml.sample containing the path to the loc file for testing: <file path="${__HERE__}/test-data/data_table_name.loc" />
    I think this is a great oppurtinity to showcase ;) (I suppose this requires profile="17.05"?)

@nsoranzo
Copy link
Member

Some are not necessary (in_range while min and max are enabled) but I think they provide better interactivity as validators give a direct response (correct me if I'm wrong). No opinion on this so it's good unless somebody else objects

<validator type="in_range" min="0" /> and min="0" attribute added directly to the param element are completely equivalent, with the second one being more concise and probably preferred. So I'd remove the redundant in_range validators.

Provide a tool_data_table_conf.xml.test file, which is an uncommented version of tool_data_table_conf.xml.sample containing the path to the loc file for testing:
I think this is a great oppurtinity to showcase ;) (I suppose this requires profile="17.05"?)

That's been possible for a while and is not related to the tool XML syntax/interpretation (it's just for testing), so it doesn't require the new profile.

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 26, 2017

Thanks @nsoranzo and @yhoogstrate for the detailed feedback! And the edits! Especially like the XS info in the help. @yhoogstrate your other suggested changes sound good to me too so I've made them.

In this commit I've:

  • removed the no longer necessary actions=metadata dbkey from outputs, and also all the other actions from outputs (they are apparently also not necessary as the tests pass without them?)
  • removed the sections around reference genome and reads. I think that should also fix this workflow issue here: HISAT2 in workflows: renaming output is broken #1478 (although am still wondering if that's the best strategy for renaming samples or if there's a better way)
  • removed the redundant "in_range" with "min" validators
  • changed to Allow soft-clipping = true rather than Disallow softclipping = false (I agree that seems better)
  • added the tool_data_table_conf.xml.test file and created a test for using an in-built reference genome

@bgruening
Copy link
Member

Awesome stuff! Thanks a bunch @mblue9!!! And thanks for the review @yhoogstrate and @nsoranzo.

@bgruening bgruening merged commit 118b80f into galaxyproject:master Sep 26, 2017
@mblue9
Copy link
Contributor Author

mblue9 commented Sep 27, 2017

Great!! Will this new version get picked up soon by Main? As this bug will be affecting the results of people there who are using HISAT2 with paired-end stranded data.

@mblue9
Copy link
Contributor Author

mblue9 commented Sep 27, 2017

P.S. The GTN tutorials using HISAT2 will need to be updated (screenshots and text) if they're using the new version of the tool, I just created an issue for that: galaxyproject/training-material#552

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants