Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-human species #19

Open
guandailu opened this issue Oct 7, 2022 · 3 comments
Open

Non-human species #19

guandailu opened this issue Oct 7, 2022 · 3 comments

Comments

@guandailu
Copy link

Can use this software in non-human species, e.g. cattle? If yes, how I can build pre-trained model? If human model can be extended to other species?

@jmschrei
Copy link
Owner

jmschrei commented Oct 8, 2022

Hi @FarmOmics.

Yes, Avocado can be applied to any compendium of bulk genomic experiments. However, you need many experiments across tissues and assays for Avocado to be accurate. I don't know whether cattle have that many genomic experiments performed in them.

Yes, the human model can be extended to other species (see https://www.biorxiv.org/content/10.1101/801183v3) if you have an alignment between species genomes or can remap the reads from the experiments performed in human to the cattle genome. The first is less computationally intensive, because you don't need to remap several thousand experiments. However, you still need to have many experiments performed in cattle.

Let me know if you have any other qustions.

@guandailu
Copy link
Author

I have cattle chipseq for five marks and ~20 tissues, if this set of data is enough to train a model?
To train a model, your input data is npz format, e.g. E117.H3K9me3.pilot.arcsinh.npz, I am wondering the detailed step how I can prepare such kind of data ? I do have −log 10 p-values for chipseq signals, by the way?
If I want to integrate human chipseq to train the model, can I liftover human chipseq signals to cattle coordinates?

@jmschrei
Copy link
Owner

The way that Avocado is set up is that it can make predictions, even across species, for any assay that is measured at least once and any cell type that is assayed at least once. However, the predictions will be higher quality the more assays are available and the more related they are to the activity you're trying to predict. If you're trying to predict the binding of a very cell type-specific TF and only have a few histone modifications, you probably won't get great accuracy. But, if you're just trying to predict transcription from those histone modifications, you'll likely do pretty well because many histone modifications are correlated with expression.

The way you get your data into the model is just by extracting the -log10 p-values from your bigWig, probably using pyBigWig, and binning those values at 25bp resolution, taking the average across the positions. You can drop the last bin if your genome isn't divisible by 25.

Lifting over across species is more challenging because I didn't write clean code for that part. If you have som compute available, I'd actually recommend that you remap the human experiments you think are relevant to the cattle genome. The mapper will automatically take care of all the issues you might have using an alignment chain file (which I did). LiftOver would probably work as well.

Let me know if you have any other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants