Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update STAR version + new options for STARsolo #5060

Merged
merged 36 commits into from
Feb 17, 2023

Conversation

lldelisle
Copy link
Contributor

FOR CONTRIBUTOR:

  • - I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • - License permits unrestricted use (educational + commercial)
  • - This PR adds a new tool or tool collection
  • - This PR updates an existing tool or tool collection
  • - This PR does something else (explain below)

@lldelisle
Copy link
Contributor Author

@mtekman , @wm75, @JasperO98 , @pavanvidem, @ieguinoa
If you are interested in reviewing...
Thanks

@lldelisle
Copy link
Contributor Author

Just to be clear:

  • I did not updated the bam for the STAR tests (the sizes were matching and passing tests).
  • When using outSAMattributes CB or UB, the way the bam is sorted is changed but I did not make it appear to the user.

Copy link
Contributor

@mtekman mtekman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my side

@pavanvidem
Copy link
Member

pavanvidem commented Jan 17, 2023

@lldelisle I can look at it in the evening.

I have the following features/params/UI rearrangements that I have in my TODO for STARSolo.

  • Smart-seq mapping sometimes fails with
EXITING because of fatal error: buffer size for SJ output is too small
Solution: increase input parameter --limitOutSJcollapsed

this can be fixed with adding --limitOutSJcollapsed param

  • ​Allow compatibility with CEL-seq. The main difference to 10x that I see is that we do not have a barcode whitelist. Instead, we have lengths of barcodes and UMIs in CEL-seq. Check this issue for an example call. We can already build that command with custom chemistry options in our current wrapper. But we have to make --soloCBwhitelist optional and adapt the cheetah.
  • When filtering is turned on, we should also additionally output the non-filtered data. At the moment, it only outputs the filtered matrix. This does not allow users to try other filtering options for example with DropletUtils.
  • ​Organize the UI better. Some options are not compatible with all protocols. For example, I am not sure whether --soloUMIfiltering is used for Smart-seq.

Do you think is it reasonable to add (some of) them to this PR?

@lldelisle
Copy link
Contributor Author

@pavanvidem sure.
Do you want to write it or do you want me to write it?

@lldelisle
Copy link
Contributor Author

I just realize that the Chemistry parameter propose "Cell Ranger v2" while Cell Ranger is the name of the software, for me it should be "Chromium chemistry v2"...

@pavanvidem
Copy link
Member

@pavanvidem sure. Do you want to write it or do you want me to write it?

I will work on it. If you wish please contribute :)

@pavanvidem
Copy link
Member

I just realize that the Chemistry parameter propose "Cell Ranger v2" while Cell Ranger is the name of the software, for me it should be "Chromium chemistry v2"...

Good cath, it is correct.

@lldelisle
Copy link
Contributor Author

@pavanvidem , here is what I have in mind, tell me if this would work for you:

  • For --limitOutSJcollapsed, I can just take the block limits from STAR to STARsolo
  • CEL-seq, I see 2 possibilities:
    • We add a new solo_type "CEL-seq" which would use:
    --soloType CB_UMI_Simple 
    --soloCBstart 7
    --soloCBlen 6  
    --soloUMIstart 1  
    --soloUMIlen 6  
    --soloBarcodeMate 2 
    --soloCBwhitelist None
    --clip5pNbases 0 36
    
    • Or we simply set soloCBwhitelist optional in CB_UMI_Simple, this is more flexible.
  • For the Counts/UMI outputs, I think most of the case people don´t really want to see the counts, I propose that we add a parameter to make them appear.
  • For --soloUMIfiltering you are right this is only compatible with –soloUMIdedup 1MM CR. I can make a conditional.

@pavanvidem
Copy link
Member

  • True, there are already Limits in STAR. Maybe not all limits are useful, but it doesn't hurt to add them too. Then we can move Limits block to macros.
  • I like the 2nd option. With the first option, we're not flexible in choosing barcode and UMI lengths.
  • Then, we can add a check box to make the unfiltered results to appear.
  • Perfect

FYI, I just started working on the above options sequentially.

@lldelisle
Copy link
Contributor Author

Perfect, tell me if you need help.

make `--soloCBwhitelist` optional and add limits macro for Smart-Seq
@bernt-matthias
Copy link
Contributor

@lldelisle
Copy link
Contributor Author

Good catch @bernt-matthias , thanks. Can you target your new PR to my branch?
@pavanvidem where are your with your implementation?

@pavanvidem
Copy link
Member

Good catch @bernt-matthias , thanks. Can you target your new PR to my branch? @pavanvidem where are your with your implementation?

last week I created a patch on your fork: lldelisle#2

@bernt-matthias
Copy link
Contributor

I just pushed 033cd4a .. hape that is fine

plus:

- removed a few redundant name attribs
- increases profile
Copy link
Member

@pavanvidem pavanvidem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lldelisle Fantastic! I have a final comment on cell filtering. Does it make sense to allow droplet-based filtering for Smart-Seq? Technically it is possible. But do people use this option? Maybe I'm trying to optimize too much. I'm already very happy with the current state of the wrapper and it can be merged. Thanks a lot for your efforts!!

@lldelisle
Copy link
Contributor Author

Maybe you are right, it does not make a lot of sense but then this suppose to put the conditional block filter inside the conditional block sc instead of the advanced setting. I don't mind doing it but I feel like it is ok to keep all options even if the user selected 'Smart-seq'. Tell me what you prefer.

@pavanvidem
Copy link
Member

As I said, it is all good and It does not break anything. Ready to merge from my side.

@lldelisle
Copy link
Contributor Author

@bgruening I think @pavanvidem is not a 'member' and his approval does not allow me to merge the PR. Could you approve it or you prefer a second review?

Copy link
Contributor

@wm75 wm75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a bit of (important) housekeeping.
For everything else I trust you and other reviewers :)

tools/rgrnastar/macros.xml Show resolved Hide resolved
tools/rgrnastar/rg_rnaStar.xml Outdated Show resolved Hide resolved
tools/rgrnastar/rg_rnaStarSolo.xml Outdated Show resolved Hide resolved
@@ -5,7 +5,8 @@
the index versions in sync, but you should manually adjust the +galaxy
version number. -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<tool id="rna_star_index_builder_data_manager" name="rnastar index versioned" tool_type="manage_data" version="@IDX_VERSION@" profile="19.05">

is what this comment refers to.
This PR should, because of the linked macros file, trigger deployment of a new version of the DM, too, so you need to bump the DM version to version="@IDX_VERSION@+galaxy1"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean we should change the version because we changed STAR version or because of something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess bumping the IDX_VERSION_SUFFIX does not hurt, but I do not understand why we should do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to indicate that we use the STAR version @TOOL_VERSION@ instead of the STAR version @IDX_VERSION@ to build the index...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernt-matthias @lldelisle Just to explain things: the reason for symlinking the macros file is to keep the IDX_VERSION in one place only so that when you update the tool wrapper to a STAR version that requires a newer index format, you'd automatically deploy also a DM that can create these indexes.

The "downside" is that any changes to the tool wrapper macros file will silently affect the DM. So in this case the next version of the DM will use the 2.7.10b version of star for building indexes. These should be identical to ones built with older versions, but it's good to bump the DM wrapper version to be able to trace things back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems legit.

Could a more expressive filter help here. For instance we could just store the star version that was used to create an entry in the datatable ... and then just filter datatable entries for a min (or max) required star version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is easy to add a new column in a table, this suppose to change the table (which happened when we add a new column for the 'genomeVersion')...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would require a new table. I'm also not so sure whether that would improve the situation much.
min and max version checks in tool wrappers also need quite some discipline to maintain, and the max check in particular doesn't work backwards, i.e., at the time of writing a tool wrapper version the max value is typically unknown still so there's always at least one wrapper version that will display all newer index versions.
What would be comparably easy to do is to remove the symlink and have the DM use its own macro, which then needs to be maintained separately, but would maybe come with fewer surprises.

Anyway, I don't think this should hold back this PR any longer. If we want to decouple the DM from the tool wrapper, we should do it in its own PR where the decision will be more discoverable than as part of a giant PR like this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@lldelisle
Copy link
Contributor Author

Let see if I did not break everything...

Copy link
Contributor

@wm75 wm75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes, looks great to me now!

@lldelisle
Copy link
Contributor Author

Youhou! Who wants to click or find something to change?

@wm75
Copy link
Contributor

wm75 commented Feb 17, 2023

Merging is ok with you @bernt-matthias ?

@bernt-matthias bernt-matthias merged commit ae6b59a into galaxyproject:main Feb 17, 2023
@bernt-matthias
Copy link
Contributor

Sure. Thanks @lldelisle

@lldelisle lldelisle deleted the starsolo_update branch February 17, 2023 08:30
@lldelisle
Copy link
Contributor Author

🎉 this was a HUGE PR.

@lldelisle
Copy link
Contributor Author

A great collaboration. I love you galaxy people.

@lldelisle
Copy link
Contributor Author

This worth probably a blog post...

@pavanvidem
Copy link
Member

Let see if I did not break everything...

The most important thing is that this was a huge progress. Thanks! I'm using this tool very frequently for 10x and Smart-seq data. I also have some iCell (needs the optional barcode whitelist param), cel-seq and large Smart-seq (needs limit params) data in the queue. I can already tell in the next weeks if we broke something ;-)

@bgruening
Copy link
Member

This worth probably a blog post...

YEAH! 💯

@bernt-matthias
Copy link
Contributor

Have we fixed all those issues #5060 (comment).. If so we can close them.

@bernt-matthias
Copy link
Contributor

Hmm. Can't find the workflow run for the merge. Can someone double check?

@bernt-matthias
Copy link
Contributor

At least the updates did not arrive at the TS yet.

@bernt-matthias
Copy link
Contributor

Arg, the commit message contained [no ci] ...

@bgruening
Copy link
Member

I take care. Thanks @bernt-matthias

@lldelisle
Copy link
Contributor Author

Sorry I just found a bug see #5144

@lldelisle
Copy link
Contributor Author

In fact 2 bugs... 😔 I PR a fix in #5145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants