Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

homopolish modpolish Index error #47

Open
giriarteS opened this issue May 25, 2022 · 6 comments
Open

homopolish modpolish Index error #47

giriarteS opened this issue May 25, 2022 · 6 comments

Comments

@giriarteS
Copy link

Can modpolish be used after racon+medaka? I am trying to polish my assembly (flye+6 rounds of racon+medaka) with modpolish but I get the following error:
"python3 /mnt/DATA/git_repos/homopolish/homopolish.py modpolish -a medaka/consensus.fasta -q /mnt/Backup/Genomes_test/raw_dna/minion/F.verticillioides/MRC826E/MRC826E_trim_fastq.gz -s /mnt/DATA/git_repos/homopolish/fungi.msh -o homopolish -t 50

Downloaded GCA_003316995.2_ASM331699v2_genomic.fna.gz
Downloaded GCA_003316975.2_ASM331697v2_genomic.fna.gz
Downloaded GCA_011035275.1_ASM1103527v1_genomic.fna.gz
Downloaded GCA_011036225.1_ASM1103622v1_genomic.fna.gz
Downloaded GCA_000149555.1_ASM14955v1_genomic.fna.gz
Downloaded GCF_000149555.1_ASM14955v1_genomic.fna.gz
Downloaded GCA_003317015.2_ASM331701v2_genomic.fna.gz
Downloaded GCA_011036015.1_ASM1103601v1_genomic.fna.gz
Downloaded GCA_011036905.1_ASM1103690v1_genomic.fna.gz

Reference = [/mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_003316995.2_ASM331699v2_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_003316975.2_ASM331697v2_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_011035275.1_ASM1103527v1_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_011036225.1_ASM1103622v1_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_000149555.1_ASM14955v1_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCF_000149555.1_ASM14955v1_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_003317015.2_ASM331701v2_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_011036015.1_ASM1103601v1_genomic.fna.gz.fna.gz, /mnt/Backup/Genomes_test/assembly/flye/F.verticillioides/MRC826E/homopolish/debug/homologous_sequences/GCA_011036905.1_ASM1103690v1_genomic.fna.gz.fna.gz]
Query = [medaka/consensus.fasta]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = homopolish/debug/ANI.txt

INFO [thread 0], skch::main, Count of threads executing parallel_for : 1
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 30851510
INFO [thread 0], skch::Sketch::index, unique minimizers = 4853959
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 1098191) ... (7088, 1)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 26.7354 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 75.2743 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.0381457 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished

load Homo

Reference = [homopolish/debug/All_homologous_sequences.fna.gz]
Query = [medaka/consensus.fasta]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = homopolish/debug/truth_ANI.txt

INFO [thread 0], skch::main, Count of threads executing parallel_for : 1
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 30851510
INFO [thread 0], skch::Sketch::index, unique minimizers = 4853959
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 1098191) ... (7088, 1)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 26.885 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 75.9234 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.0205193 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished
[M::mm_idx_gen::0.8991.00] collected minimizers
[M::mm_idx_gen::1.061
2.05] sorted minimizers
[M::main::1.0612.05] loaded/built the index for 83 target sequence(s)
[M::mm_mapopt_update::1.134
1.98] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 83
[M::mm_idx_stat::1.1761.94] distinct minimizers: 4248851 (99.24% are singletons); average occurrences: 1.018; average spacing: 9.991
[M::worker_pipeline::19.965
14.14] mapped 49912 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -cx asm5 --cs=long -t 50 medaka/consensus.fasta homopolish/debug/All_homologous_sequences.fna.gz
[M::main] Real time: 20.085 sec; CPU: 282.508 sec; Peak RSS: 4.439 GB
127.97561192512512
Traceback (most recent call last):
File "/mnt/DATA/git_repos/homopolish/homopolish.py", line 102, in
main()
File "/mnt/DATA/git_repos/homopolish/homopolish.py", line 77, in main
getPos(fixData,FLAGS.debug)
File "/mnt/DATA/git_repos/homopolish/modules/mod_polish.py", line 80, in getPos
H_misAry,H_AllAry = getPileUpAry(fixData,fixData.output_dir+"debug",fixData.output_dir+"debug/All_homologous_sequences.fna.gz")
File "/mnt/DATA/git_repos/homopolish/modules/mod_polish.py", line 146, in getPileUpAry
misAry,posData_ary = MismatchPileup(fixData.output_dir+"debug/truth.paf",len(fixData.seq))#get SNP position ATCG array
File "/mnt/DATA/git_repos/homopolish/modules/mod_polish.py", line 440, in MismatchPileup
coverage[start_pos] += 1
IndexError: index 2852431 is out of bounds for axis 0 with size 2852431"

Gloria

@ythuang0522
Copy link
Owner

Hi @giriarteS,

We haven't tested modpolish on fungi yet. Have you ever successfully run the original homopolish polish? If so, can you give us your fasta genome and fastq reads for debugging? modpolish is a new feature for genomes extensively edited by modifications untrained in the ONT basecalling. If your genomes are not extensively modified, you don't need to run modpolish.

@giriarteS
Copy link
Author

I run homopolish polish but I got the following for each contig:
python3 /mnt/DATA/git_repos/homopolish/homopolish.py polish -a medaka/consensus.fasta -m R9.4.pkl -s /mnt/DATA/git_repos/homopolish/fungi.msh -o homopolish -t 50

[2022/05/25 17:24] INFO: Stage: Select closely-related genomes
TIME Select closely-related genomes: 0 MINS 0 SECS.
This contig contig_75 closely-related genome is less than 5, not to polish...
[2022/05/25 17:24] INFO: RUN-ID: contig_197

@giriarteS
Copy link
Author

I successfully run homopolish polish:
homopolish.py polish -a medaka/consensus.fasta -m R9.4.pkl -l reference/F.verticillioides/GCF_000149555.1_ASM14955v1_genomic.fna.gz -o homopolish -t 25

@ythuang0522
Copy link
Owner

ythuang0522 commented May 27, 2022

We found the error was due to an insufficient number of related genomes (<5), a bug in modpolish but not original homopolish. It looks to us there are six in GenBank but only one in RefSeq. Therefore, as you have done, the local db option -l can specify the related genomes downloaded on your own. However, I would suggest you concatenate all the six related genomes into one fasta not just provided only one.

@giriarteS
Copy link
Author

I polished my genomes with several rounds of racon, then medaka, then homopolish. If I want to use illumina reads to polish with Pilon, should homopolish be run before or after pilon?

@ythuang0522
Copy link
Owner

It was trained on Medaka-polished genomes. We would recommend running homopolish right after Medaka and then followed by pilon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants