Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

npScarf for a metagenome bin? #11

Open
liuxianghui opened this issue Sep 26, 2017 · 3 comments
Open

npScarf for a metagenome bin? #11

liuxianghui opened this issue Sep 26, 2017 · 3 comments

Comments

@liuxianghui
Copy link

I assume that npScarf is designed for Single species bacteria. Anyhow, I want to check how it works for my bacteria in metagenomics sample. With Illumina MISEQ data, I did the assembly with SPADES and further contig binning using MetaBAT.
I try to change the workflow for one of the good bins.
using bin1.fasta as spades.fasta
mapping the nanopore reads to bin1.fasta to create sam file.
jsa.np.npscarf -input ONT.sam --spadesDir='spades' -format sam -seq bin1.fasta -prefix bin1_spades > a.log
However, I found that not like the example dataset, this takes a long time.... and the output a.log becomes huge and I have to kill the job... Please kindly suggest me if it is ok to run this way for a metagenome bin?

@hsnguyen
Copy link
Collaborator

The assembly graph from metagenomics data might be too complicated to traverse for exhautive gap-filling. I suggest to run without the --spadeDir option and see how it's going.

Cheers,

@liuxianghui
Copy link
Author

Could you kindly explain more about this process using spades graph?
( please also specific in which java file the process is applied?).
It seems that without spades folder it is still ok to run the scaffolding process but ends with more scaffolds... Are the two contigs only merged if they are mapped to the same nanapore read and they meet certain criteria? Could you explain more about the criteria. Is it the reason why it can not get applied to metagenomics bin?
To solve the problem, is it ok to try the following steps?

  1. For a Metabat bin, extract the MiSeq pair end short reads mapped to those contigs in the bin.
    1'. also extract the contigs mapped by those pair end short reads.
  2. Do a mapping of nanapore long reads to contigs in this bin and generate the sam file.
  3. Run the jsa.np.npscarf process and scaffold those contigs in the bin.

It identifies the long reads that are aligned to two unique contigs, thereby establishing the relative
position (that is, distance and orientation) of these contigs. To minimize the effect
of false positives that can arise from aligning noisy long reads, npScarf groups reads
that consistently support a particular relative position into a bridge and assigns the
bridge a score based on the number of supporting reads and the alignment quality
of these reads. When two unique contigs are connected by a bridge, they are
merged into one larger unique contig. npScarf uses a greedy strategy based on
Kruskal’s algorithm39, which merges contigs from the highest scoring bridges. In
the newly created contig, the gap is temporarily filled with the consensus sequence
of the reads forming the bridge. npScarf then identifies repetitive contigs that are
aligned to this consensus sequence, and uses these contigs to fill in the gap.

@hsnguyen
Copy link
Collaborator

npScarf currently use spades's assembly graph for the gap-filling step. Instead of the consensus sequence, it will now try to find a path from the assembly graph that can practically connect two contigs.

We also have a version that work from assembly graph from the beginning but it's still experimental. If you clone the current git repository, you can play with jsa.dev.newScarf. It has a GUI to visualize how the assembly graph being resolved when we have long reads as bridges. The red vertices represent alleged unique contig while the black ones are repetitive or artifact contigs and could be ignored. It worked for simple assembly graph (~500 nodes) but if you load in metagenomics data, it'd be too much to handle.

To sum up, SPAdes assembly graph for metagenomics data is complicated and not yet supported by the tool. So for your problem, if you can bin the paired-end reads and run SPAdes on that subset (pretend we assemble 1 isolate only), the resulted assembly graph would be simpler and possible to handled by jsa.np.npScarf (or jsa.dev.newScarf). But again, we couldn't guarantee anything since metagenomics assembly is difficult problem and you can even get errors, chimeric contigs right from Illumina assembly step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants