Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New NCBI to AnnData tutorial #4480

Merged
merged 64 commits into from
Dec 13, 2023
Merged

Conversation

hexhowells
Copy link
Collaborator

New tutorial that takes raw NCBI data and processes it into an AnnData object with metadata annotations. Requires some proof reading and minor updates still!

@shiltemann shiltemann marked this pull request as ready for review November 9, 2023 10:42
@shiltemann shiltemann requested a review from a team as a code owner November 9, 2023 10:42
@shiltemann
Copy link
Member

@hexhowells awesome, thanks a lot!

Could you put a copy of the input data on Zenodo? We have automations which automatically populate a Galaxy Data Library with tutorial data from Zenodo. This way learners who are dealing with bad internet connections or speeds can import the data directly from the data library, circumventing their own networks.

@hexhowells
Copy link
Collaborator Author

I've made a Zenodo record and linked to it/added it to the tutorial, It requires review before being published but should be good after that!

Copy link
Collaborator

@mtekman mtekman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I just had some small word changes.

Question: do the users really need to manually download the data first, or is that part an optional step in the tutorial?

@hexhowells
Copy link
Collaborator Author

Downloading the data manually is an optional step however being able to find and download the relevant data is part of the tutorial for those who may not be familiar with the process.

@shiltemann
Copy link
Member

@hexhowells agreed, still important to show the standard process. The Zenodo copy can be the fallback option.

You should also be able to import the tar file from NCBI/GEO directly into Galaxy via URL, then use Galaxy's "Unzip" tool to unpack it into a collection with all the individual files.

@hexhowells
Copy link
Collaborator Author

@shiltemann I was initially going to use the Unzip tool in Galaxy but was getting errors when trying to unzip the .GZ files, I'm not entirely sure what is causing the issue.

Copy link
Member

@pavanvidem pavanvidem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very useful tutorial! Thanks @hexhowells !!

Comment on lines 128 to 137
GSM5353214_PA_AUG_PB_1A_S1.dge.txt
GSM5353215_PA_AUG_PB_1B_S2.dge.txt
GSM5353216_PA_PB1A_Pool_1_3_S50_L002_dge.txt
GSM5353217_PA_PB1A_Pool_2_S107_L004_dge.txt
GSM5353218_PA_PB1B_Pool_1_2_S74_L003_dge.txt
GSM5353219_PA_PB1B_Pool_2_S24_L001_dge.txt
GSM5353220_PA_PB1B_Pool_3_S51_L002_dge.txt
GSM5353221_PA_PB2A_Pool_1_3_S25_L001_dge.txt
GSM5353222_PA_PB2B_Pool_1_3_S52_L002_dge.txt
GSM5353223_PA_PB2B_Pool_2_S26_L001_dge.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the manual download. Then please move this step under Obtaining the Data section. So the order is as follows. Get the data into Galaxy (either by manual download or from Zenodo), then Look at the metadata, and finally add tags.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added later as there are 53 raw files but only 10 needed for the tutorial, the process of figuring out which files are needed is done in the "Understanding the Data" section, I'll see if I can reorganise it for it to better make sense.

> - *"Find pattern"*: `batch`
> - *"Replace with"*: `replicate`
>
> 2. {% tool [Cut](Cut1) %} with the following parameters:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you could speed this section up by removing the 'Cut' option and using the Multi-Join
(combine multiple files)
(Galaxy Version 1.1.1) instead, using the initial barcode column as the key, and THEN cutting once to remove it


Lets now add the replicate column which tells us which rows are part of pools of the same patient and tumor location.

> <hands-on-title>Create replicate metadata</hands-on-title>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few screenshots of where you get this data from the spreadsheet would be good, and if you needed lines from the paper or supplemental data/figures/text to help you decode what the hell was happening, those screenshots would be good. It's walking the user through how you were able to figure this out

Copy link
Collaborator

@mtekman mtekman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tutorial, it just gets a bit long with the commands I think and maybe reward the user with some kind of image, or info box, or something to help them feel satisfaction that the large parameter tool they just executed did something big.

>
{: .hands_on}

We will now add a column to indicate which sample each row came from using the sample ID's described earlier.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a big command should be rewarded with some kind of image or screenshot or something to engage the user again.

> 3. **Rename** {% icon galaxy-pencil %} output `Specimen Metadata`
>
{: .hands_on}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

@pavanvidem
Copy link
Member

Workflow and tests are missing. @hexhowells do you have a workflow ready?

@hexhowells
Copy link
Collaborator Author

@pavanvidem Yes the tutorial was made from this workflow I built: https://usegalaxy.eu/u/hexhowells/h/ncbi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice addition!

@pavanvidem
Copy link
Member

@hexhowells I guess, moving tip snippet out of Tag your datasets tip should fix the linting error.

Copy link
Member

@pavanvidem pavanvidem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that hopefully fixes the linting error

Copy link
Member

@pavanvidem pavanvidem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the links to the first 2 files, tested the tutorial, and added links to an example history and the workflow. Ready to merge! Thanks @hexhowells for this great tutorial!

@bgruening bgruening merged commit 9db8751 into galaxyproject:main Dec 13, 2023
3 checks passed
@bgruening
Copy link
Member

🎉 !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants