Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we pull data from NCBI GEO? #5

Open
jjbivona opened this issue Dec 2, 2020 · 4 comments
Open

How do we pull data from NCBI GEO? #5

jjbivona opened this issue Dec 2, 2020 · 4 comments
Assignees
Labels
docs Improvements or additions to documentation

Comments

@jjbivona
Copy link

jjbivona commented Dec 2, 2020

Is it similar to using the wget function from Zenodo?

@cgpu
Copy link
Collaborator

cgpu commented Dec 2, 2020

@jjbivona You can use both wget if you know exactly what Dataset you are interested in, here is an example:

Let's say you want to access data from this project, that you found by navigating to NCBI.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68849

You can find in this pages the ftp links (very similar to http).
Here's how I retrieve this wget command for a file from this NCBI GEO dataset:

# in the terminal window type:
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE68nnn/GSE68849/suppl/GSE68849_non-normalized.txt.gz

We can also to do this from R, take a look at this forum answer:
https://www.biostars.org/p/335682/

I will make a note to add this in the wiki of the repository here, https://github.com/lifebit-ai/dry-bench-skills-for-researchers/wiki, thanks for pinging 👍 , this will be valuable for more people.

@cgpu cgpu self-assigned this Dec 2, 2020
@cgpu cgpu added the docs Improvements or additions to documentation label Dec 2, 2020
@jjbivona
Copy link
Author

jjbivona commented Dec 2, 2020

Thank you!!

I was able to do it using the wget function. I also tried using R, but the GEOquery package isn't updated to work with 3.6. Are there ways around this? It seems like a useful function to quickly pull the data and keep everything within R.

I also noticed that the file from GEO was a .gz file. I tried to get the first couple of lines using
head GSE68849_non-normalized.txt.gz
But nothing happens. I'm guessing it needs to be converted to .csv

@cgpu
Copy link
Collaborator

cgpu commented Dec 2, 2020

The .gz denotes that the file is compressed @jjbivona.

To decompress the retrieve .gz file

# In the command line
gunzip GSE68849_non-normalized.txt.gz

After that you can read it as is into R, no need to have it as csv as the function data.table::fread() is very welcoming to most formats, tsv, txt.

# In R
results <- data.table::fread(file = "GSE68849_non-normalized.txt")
head(results)

GEOqury installation

To install this Bioconductor library

follow the instructions in the page and copy the installation command:

https://bioconductor.org/packages/release/bioc/html/GEOquery.html

BiocManager::install("GEOquery", update = FALSE)

This worked for me in Lifebit CloudOS.

@jjbivona
Copy link
Author

jjbivona commented Dec 3, 2020

Works now! Thank you.

In the future if I get an out of date package from BiocManager will update = FALSE solve the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants