Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GATKTool so that more tools can comfortably extend it directly #4341

Open
droazen opened this issue Feb 5, 2018 · 7 comments
Open

Comments

@droazen
Copy link
Contributor

droazen commented Feb 5, 2018

As discussed in #2471 (comment), we need to refactor GATKTool so that all non-Spark tools can comfortably extend it rather than extending CommandLineProgram directly, as some tools currently do. In particular, we need to:

  • Provide a mechanism for subclasses to selectively disable engine-wide arguments such as -I completely (and also the ability to override with their own version of an argument).

  • Access necessary datasources outside of the engine package.

  • Add the ability to register input metadata such as sequence dictionaries, so that standard validation rules can be enforced across the toolkit.

  • Add the ability for each tool to change the defaults for engine arguments such as --interval-merging-rule

@droazen droazen self-assigned this Feb 5, 2018
@droazen droazen added the Engine label Feb 5, 2018
@droazen droazen added this to the Engine-1Q2018 milestone Feb 5, 2018
@droazen
Copy link
Contributor Author

droazen commented Feb 5, 2018

@samuelklee Anything else you'd add to this list?

@samuelklee
Copy link
Contributor

samuelklee commented Feb 5, 2018

This looks like a good start! I'm not quite sure how you are planning to handle dictionary validation, specifically, but you can take a look at the CNV plotting tools (PlotDenoisedCopyRatios and PlotModeledSegments) to see what level of validation we currently do. We can discuss further in person if you like.

(Also, note that those tools take a sequence dictionary as an input to specify which contigs should be plotted; typically, this will be a subset of the full dictionary that excludes alt contigs, etc. Requiring this sequence-dictionary input is somewhat vestigial; previous versions of the pipeline did not include dictionaries in the headers of all CNV data files. Part of making these tools into GATKTools could include switching over to -L to specify regions for plotting.)

Finally, are the changes to -imr mentioned in #2471 going to be addressed in a separate issue?

@droazen
Copy link
Contributor Author

droazen commented Feb 5, 2018

@samuelklee For -imr, you just need a way to override the default setting, correct? Or is it more than that? I've added a bullet point to the list above that captures what I think you need, but feel free to supplement it with additional requirements.

@samuelklee
Copy link
Contributor

Overriding the defaults would get us most of the way there. Right now, we perform the following check on the IAC and fail if the defaults aren't changed to values that the CNV tools require, which is awkward:

   /**
     * Validate that the interval-argument collection parameters minimally modify the input intervals.
     */
    public static void validateIntervalArgumentCollection(final IntervalArgumentCollection intervalArgumentCollection) {
        Utils.validateArg(intervalArgumentCollection.getIntervalSetRule() == IntervalSetRule.UNION,
                "Interval set rule must be set to UNION.");
        Utils.validateArg(intervalArgumentCollection.getIntervalExclusionPadding() == 0,
                "Interval exclusion padding must be set to 0.");
        Utils.validateArg(intervalArgumentCollection.getIntervalPadding() == 0,
                "Interval padding must be set to 0.");
        Utils.validateArg(intervalArgumentCollection.getIntervalMergingRule() == IntervalMergingRule.OVERLAPPING_ONLY,
                "Interval merging rule must be set to OVERLAPPING_ONLY.");
    }

If we override defaults, we'd still perform the check to make sure the user didn't muck with them, but it'd still be nicer than forcing the user to change the original defaults on their own.

However, there are still two more awkward points: 1) there is no value for -interval-set-rule that does nothing to the incoming intervals (you must either union or intersect), and 2) we have to specify our own padding argument in PreprocessIntervals that is distinct from -interval-padding, since we want to implement our own padding. The first can be easily addressed by adding an option to do nothing to the intervals; I'm not so sure what the best way to handle the second would be, so we can punt on it if it'd be more work than it's worth---these are relatively minor pain points, in the end.

@samuelklee
Copy link
Contributor

Note #4439, which concerns a Picard tool that might also need options for interval merging exposed. Just something to be aware of---I'm guessing that it's probably a bit ambitious to have identical options for all interval inputs to both Picard and GATK tools?

@droazen
Copy link
Contributor Author

droazen commented Apr 16, 2019

One part of this ticket is done: #4964 added accessors that allow direct descendants of GATKTool to directly access engine datasources, while still forbidding direct access for tools that extend a Walker base class (except for Walker types living in the engine package, which still have access).

@samuelklee
Copy link
Contributor

Not sure if there's a more relevant open issue, but just making note of #6924 here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants