Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

symlink specific gridss index files #54

Merged
merged 4 commits into from
Jun 6, 2024
Merged

symlink specific gridss index files #54

merged 4 commits into from
Jun 6, 2024

Conversation

casslitch
Copy link
Contributor

genome_gridss_index folder may contain additional files that are staged by nextflow if they are part of the inputs (e.g. someone might have the .dict file in this folder if they created it via gridss.PrepareReference). This causes an error when we try to symlink the entire contents of the folder. Suggestion: symlink only specific files of interest.

genome_gridss_index folder may contain additional files that are staged by nextflow if they are part of the inputs (e.g. someone might have the .dict file in this folder if they created it via gridss.PrepareReference). This causes an error when we try to symlink the entire contents of the folder. Could we change this line to symlink the specific files of interest?
@casslitch casslitch requested a review from scwatts June 4, 2024 13:45
Copy link

github-actions bot commented Jun 4, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit d0a920f

+| ✅ 177 tests passed       |+
#| ❔   5 tests were ignored |#
!| ❗  15 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: assets/multiqc_config.yml
  • nextflow_config - Config manifest.version should end in dev: 1.0.0
  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
  • schema_params - Schema param panel not found from nextflow config
  • schema_params - Schema param genome_version not found from nextflow config
  • schema_params - Schema param genome_type not found from nextflow config
  • schema_params - Schema param ref_data_hmf_data_path not found from nextflow config
  • schema_params - Schema param ref_data_panel_data_path not found from nextflow config
  • schema_params - Schema param ref_data_virusbreakenddb_path not found from nextflow config
  • schema_params - Schema param ref_data_hla_slice_bed not found from nextflow config

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.14.1
  • Run at 2024-06-05 14:24:46

@scwatts
Copy link
Collaborator

scwatts commented Jun 5, 2024

Thanks for opening the PR! I don't think this affects users where the GRIDSS index was prepared with the builtin functionality but understand the current set up may have issues for externally created GRIDSS indexes that contain clashing file names with other staged files as you've pointed out.

I'm not sure what the best approach is here - only support precisely the expected GRIDSS index fileset or support all/common deviations from these expectations?

Leaning towards the former but will accommodate in the case that you feel other users may experience the same issue. If that's the case, I'd suggest one of two approaches:

  • explicitly include all index files: find -L <dir> -regex '.*\.\(amb\|ann\|pac\|gridsscache\|sa\|bwa\|img\|alt\)'
  • force overwrite any existing files/symlinks: ln -sf <args>

While the former requires strong alignment with expected files I'd probably prefer it since I find it safer than forceful replacement. And if making this change it would be good to apply to all instances where the GRIDSS index is used for consistency.

@casslitch
Copy link
Contributor Author

Hi Stephen,

Thanks for looking into this!
I agree, this won't affect users who've generated the index from the built-in functionality. My thinking was that other users might be affected if they've pre-generated the gridss index themselves.

I like the first option of explicitly including all index files.

However, based on the current module, the genome_bwa_index folder isn't staged. Therefore only the files in the genome_gridss_index folder will be picked up (I think that's just the img and gridsscache files).
Are the bwa index files required for this module? If so, I think 'path genome_bwa_index' needs to be added the inputs.

@scwatts
Copy link
Collaborator

scwatts commented Jun 5, 2024

Are the bwa index files required for this module? If so, I think 'path genome_bwa_index' needs to be added the inputs.

Ah, the reference genome indexes have been rearranged recently and the new GRIDSS index directory contains the following files:

  • GRIDSS index: *.gridsscache
  • BWA index image: *.img
  • BWA index files: *.amb, *.ann, *.pac, *.sa, *.bwt
  • ALT file (optional): *.alt
Current GRIDSS index directory for GRCh38_hmf (click to show)
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.amb
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.ann
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bwt
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gridsscache
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.img
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.pac
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.sa

Sorry for the confusion!

My thinking was that other users might be affected if they've pre-generated the gridss index themselves. I like the first option of explicitly including all index files.

Okay, let's make this change then. Can you adjust each of the symlink/find commands for the GRIDSS index directory on your PR branch according to the first option above and also change the merge PR base branch to dev? Once merged, I'll cherry-pick your commits into features-block-one-preview.

@casslitch
Copy link
Contributor Author

casslitch commented Jun 5, 2024

Thanks Stephen, sounds good!
Thanks for describing the new structure. Just to clarify, should I still add genome_bwa_index to the inputs for this module? For example, if the user has pre-generated the bwa index files and they sit in a different folder to genome_gridss_index, then they won't be staged.
I agree this won't affect users who are using the downloaded genome_gridss_index folder which contains all the necessary files.

@scwatts
Copy link
Collaborator

scwatts commented Jun 5, 2024

No, the user will need to place all the BWA index files under the same GRIDSS index directory themselves. I think this is okay since (1) it is required to create the GRIDSS index files anyway, and (2) the BWA index files are not used anywhere else in the workflow.

Once the changes have been made, let's review and test!

Accomodates users using externally created GRIDSS indexes that may contain clashing file names.
@casslitch casslitch changed the base branch from features-block-one-preview to dev June 5, 2024 09:34
@casslitch
Copy link
Contributor Author

Thanks Stephen for explaining, agree with that logic!

@scwatts
Copy link
Collaborator

scwatts commented Jun 5, 2024

I'm going to rebase your commits on top of dev and then force push - you may need to delete your local branch and fetch it again.

I noticed I made a typo in the find command for 'bwa', that should be 'bwt' instead. If you can update that I'll test these changes tomorrow.

@scwatts scwatts force-pushed the casslitch-patch-1 branch from 6a26bee to fb49fd0 Compare June 5, 2024 11:52
@scwatts
Copy link
Collaborator

scwatts commented Jun 5, 2024

The 'Run pipeline stubs' check is showing that the new back-slashes need to be escaped:

find -L ${genome_gridss_index} -regex '.*\\.\\(amb\\|ann\\|pac\\|gridsscache\\|sa\\|bwt\\|img\\|alt\\)'

Escaped back-slashes.
@scwatts
Copy link
Collaborator

scwatts commented Jun 6, 2024

Okay, changes now look good.

I reproduced the issue with the dev branch using simulated WGS t/n data then successfully tested these changes for the GRCh38_hmf reference genome with both the default GRIDSS index directory and a 'custom' one that contains .{fai,dict} files.

@casslitch casslitch merged commit b5c1daf into dev Jun 6, 2024
4 checks passed
@casslitch casslitch deleted the casslitch-patch-1 branch June 6, 2024 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants