Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiQC: Update + fixing issues #1821

Merged
merged 12 commits into from
Apr 20, 2018
Merged

Conversation

bebatut
Copy link
Member

@bebatut bebatut commented Apr 8, 2018

Solved with this PR:

Not solved with this PR:

  • Support list of pairs for FastQC output (MultiQC for list of pairs (of FastQC output) #1658): technical limitation
  • Fix samtools flagstat weird output (MultiQC - samtools flagstat weird output #1608): it seems to be a MultiQC error
  • Use "element_identifier" as "sample name" for all tools (MultiQC - Use "element_identifier" as "sample name" for all tools #1595)
    • Tools properly using element_identifier as sample name
      • Bamtools
      • Bismark
      • Bowtie 2
      • BUSCO
      • GATK (BaseRecalibrator or VariantEval output)
      • HISAT2
      • HTSeq
      • Samtools
      • SnpEff
      • STAR
      • TopHat2: Sample name = element_identifier + align
    • Tools using some file content as sample name
      • Bcftools: Sample name = 3rd column of the line starting byID (https://github.com/ewels/MultiQC/blob/v1.5/multiqc/modules/bcftools/stats.py#L41)
      • Cutadapt/Trim Galore!: Sample name = last name of line starting with Command line (https://github.com/ewels/MultiQC/blob/v1.5/multiqc/modules/cutadapt/cutadapt.py#L104)
      • deepTools: Sample names = element_identifier(s) on which deepTools has been run
      • FastQC: Sample name = 2nd column of line starting with Filename
        Unknown status (the status of element_identifier parsing for these tools is unverified)
      • featureCounts: Sample names = column names with the element_identifier(s) on which featureCounts has been run
      • HiCExplorer: Sample name = concatenation of the path + 2nd column of the line starting with File
      • Flexbar: Sample name = 2nd column of line starting with Filename
      • Kallisto: Sample name = 2nd column of line starting with Run File (looks like a log output needs to be added to the kallisto wrapper to work with multiqc, see here log file from Kallisto  MultiQC/MultiQC#440)
      • MACS2: multiqc can work with macs2 .xls, could try to add to multiqc wrapper, see here: MACS2 reports MultiQC/MultiQC#632
      • Picard: Sample name = value after INPUT on the line starting with # picard.analysis
      • Prokka: Sample name = what is after the first 2 words on the line starting with organism
      • QUAST: Sample names = column names with the element_identifier(s) on which QUAST has been run
      • RSeQC: Gene Body Coverage uses element_identifier the other modules (xmls) don't
      • Salmon: multiqc can work with salmon, user request here: https://biostar.usegalaxy.org/p/27537/#27547
      • Samblaster
      • SortMeRNA: Sample name = extracted from the command line in the input file
      • Trimmomatic: Sample name = extracted from the command line in the input file
      • VCFTools

bebatut added 6 commits April 6, 2018 17:25
- Update to version 1.5
- Add new supported tools
- Escape identifiers
- Use os.path.join to generate path
- Add more token as shared code
@lparsons
Copy link
Contributor

lparsons commented Apr 9, 2018

@bebatut I can't replicate #1624 anymore, so I guess I'd say it's "fixed". Conda dependency resolution is black magic to me, and I suspect that something changed with an underlying package (and not with MultiQC packages) that both broke and then subsequently fixed things... 🤷‍♂️

At any rate, this is some excellent work! Thanks.

@nekrut
Copy link
Contributor

nekrut commented Apr 9, 2018

See also #1805

@bebatut
Copy link
Member Author

bebatut commented Apr 9, 2018

@nekrut I added the SnpEff module too and some tests 😄

@bebatut
Copy link
Member Author

bebatut commented Apr 9, 2018

Thanks @lparsons for the check!

@bebatut bebatut changed the title [WIP] MultiQC: Update + fixing issues MultiQC: Update + fixing issues Apr 13, 2018
@bebatut
Copy link
Member Author

bebatut commented Apr 13, 2018

I think this PR is ready for review 😄

@lparsons
Copy link
Contributor

Fantastic work @bebatut! One thought I had was that for tools that put the input filename into the output (e.g. BCFTools, cutadapt, etc.) perhaps it would be a good "best practice" to first link the input files using the element_identifier. That would be helpful not only when MultiQC parses the files, but when the user is looking at them too. @bgruening any thoughts?

@bgruening
Copy link
Member

@lparsons you mean linking inside the cutadapt wrapper?

@lparsons
Copy link
Contributor

@bgruening Yes, within the wrapper of tools that include the input filenames as part of the output, link the input files first to a name that includes the element_identifier so they output files look a bit nicer and are more interpretable. We already do that for some tools, but we could include it as "best practice".

@bgruening
Copy link
Member

I feel a little bit if this is really needed and a best practice it should be part of the job staging and Galaxy should provide a variable that we can simply use. So Galaxy does the linking for us and a tool can use it.

We should move this discussion to an other issue I guess. If you are fine with the PR please approve and/or merge it.

Copy link
Contributor

@lparsons lparsons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@mblue9
Copy link
Contributor

mblue9 commented Apr 14, 2018

Thanks a lot for working on this @bebatut I'm excited to use it!
I tested a few of the modules with collections and they seem to work well e.g. fastqc, hisat2, featurecounts so looks good to me 👍
@lparsons I'll try to get the cutadapt wrapper working properly with element_identifier, as you're right, would be good to have that, as otherwise cutadapt can't be easily used with multiqc (samples get called meaningless names e.g. "input_f"). PR for that is here: #1828

Copy link
Contributor

@mblue9 mblue9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested adding some text into the "Report title" field and if I do that, I get an empty "0 bytes" html output (tested with a picard markduplicates output), if I add the same text to the "Custom comment" the report is correctly produced with the comment.

@@ -647,6 +769,9 @@ sp:
</when>
</conditional>
</repeat>
<param name="title" type="text" value="" optional="true" label="Report title" help="It is rinted as page header"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/rinted/printed/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -411,15 +511,18 @@ sp:
<!--<option value="salmon">Salmon</option>-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the module names from RSeQC here? "(bam_stat, gene_body_coverage, infer_experiment, ...)"
as why list them for rseqc and then not picard etc? I don't think they're needed here imho.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done... Thanks

@bebatut
Copy link
Member Author

bebatut commented Apr 16, 2018

@mblue9 Thanks for having tested the title and comment. I fixed the issue (change in the generated file names) and added some tests

@mblue9
Copy link
Contributor

mblue9 commented Apr 17, 2018

Thanks a lot @bebatut ! LGTM Will I merge it?

@joachimwolff
Copy link
Contributor

HiCExplorer does not work if you use files with identical name in Galaxy.
Error message:
Fatal error: Exit code 1 () ln: failed to create symbolic link 'multiqc_WDir/hicexplorer_0/Galaxy10-__ob__hicBuildMatrix_raw_QC__cb___txt': File exists

Just upload the following file twice.
Galaxy10-[hicBuildMatrix_raw_QC].txt

@bebatut
Copy link
Member Author

bebatut commented Apr 17, 2018

This isuse is probably not limited to HiCExplorer module. We probably need to fix that for all modules by adding a sub folder for each file (if we want to keep the file identifier intact)

@mblue9
Copy link
Contributor

mblue9 commented Apr 18, 2018

HiCExplorer does not work if you use files with identical name in Galaxy.

Just to check as am curious, are there use cases for analysing files with identical names? As that doesn't seem like a good idea to me, wouldn't it be better to recommend using unique file names? (saying that I do have the current situation where macs gives the same name to all files if using a control with collections which is not good imo)

@joachimwolff
Copy link
Contributor

joachimwolff commented Apr 18, 2018

I appreciate that this is a special case. It happens e.g. if you run a workflow with multiple input (Batch processing) and all QC reports are collected in the end, if they have an identical Galaxy name, MultiQC crashes.

It is quite easy to fix it, just add an unique prefix/suffix to each file and we are more stable.

@lparsons
Copy link
Contributor

@joachimwolff My concern would be ensuring that the sample identifiers remain the unadulterated element_identifier across all tools. I'd be more in favor of some sort of check in the tool xml for uniqueness first. Not sure if that's doable with existing validators, but perhaps adding a new validator or falling back to coding it into the wrapper.

@bgruening
Copy link
Member

@joachimwolff that is a bigger issue as we want to preserver the element_identifier.
@lparsons yes this is doable, I discussed this with @bebatut already.

As this was "broken" before and currently is I don't consider this as blocking and will merge now as it contains so many other useful fixes.

Thanks @bebatut!

@bgruening bgruening merged commit 13875b1 into galaxyproject:master Apr 20, 2018
@lparsons
Copy link
Contributor

@bebatut Fantastic work on MultiQC, I find it a very useful tool. I was wondering if you had any insight as to why Salmon is commented out. I was looking into enabling it in the Galaxy wrapper but thought I'd see if there was already some work done... Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants