Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bascalling simulated reads #26

Closed
capoony opened this issue Sep 4, 2024 · 3 comments
Closed

bascalling simulated reads #26

capoony opened this issue Sep 4, 2024 · 3 comments
Labels
question Further information is requested

Comments

@capoony
Copy link

capoony commented Sep 4, 2024

Dear all,

fantastic software, many thanks!

I have a very naive question. After simulating reads for AmpliconSeq data, I would like to basecall the reads with guppy and then use these FASTQ reads to benchmark my downstream pipeline. Unfortunately, I do not get a single read back that passed the quality filtering What are the correct configs for guppy? I fear I am useing the wrong settings here:

######## load dependencies #######

module load ONT/guppy_6.2.1_gpu

######## run analyses #######

guppy_basecaller \
--input_path data_${name}/fast5_pass \
--config dna_r10.4.1_e8.2_400bps_sup.cfg \
--compress_fastq \
--save_path SUP \
-x "cuda:0"

Please find below also the first few lines of the summary file.

filename	read_id	run_id	batch_id	channel	mux	start_time	duration	num_events	passes_filtering	template_start	num_events_template	template_duration	sequence_length_template	mean_qscore_template	strand_score_template	median_template	mad_template	scaling_median_template	scaling_mad_template
Flow1_pass_c97995_1.fast5	002206d6-3aae-48ed-ad5a-2bf5528ba046	c979953ae6ed4445ad3de5a23d1f2a4c	0	2845	1	8.000000	1.643500	1314	FALSE	8.000000	1314	1.643500	659	5.047206	2.888203	101.321495	21.638645	101.321495	21.638645
Flow1_pass_c97995_1.fast5	004c33d0-d1d2-48f9-b0f8-0358f7d187cc	c979953ae6ed4445ad3de5a23d1f2a4c	0	1940	1	4.000000	1.365250	1092	FALSE	4.000000	1092	1.365250	484	7.027078	2.824660	100.736664	21.931059	100.736664	21.931059
Flow1_pass_c97995_1.fast5	007fb85d-538c-4f4d-8f2a-37c3e44fcfb8	c979953ae6ed4445ad3de5a23d1f2a4c	0	1855	1	8.000000	1.842250	1473	FALSE	8.000000	1473	1.842250	781	4.801814	2.935820	100.151840	22.369680	100.151840	22.369680

Thanks a lot,

Martin

@mattloose
Copy link

Hi,

I'm fairly certain that @Adoni5 will chim ein with some comments here, but icarust is not going to give you high quality base called data - the simulations are not precise - they are merely good enough to get back mappable data. As a consequence the quality filter is best ignored. I certainly wouldn't use the sup model unless you really want too.... it's not going to improve your data as the signals are not simulated in a way that we would expect them to perform well.

Icarust is designed to test adaptive sampling workflows and real time analysis but you shoudl not trust the qualities of the base calls.

I hope that helps.

Matt

@mattloose mattloose added the question Further information is requested label Sep 4, 2024
@capoony
Copy link
Author

capoony commented Sep 4, 2024

Dear Matt, that indeed helps a lot, many thanks!

best, Martin

@Adoni5
Copy link
Contributor

Adoni5 commented Jan 9, 2025

Pretty much what Matt said - The simulated read identities for R10 data is about 0.9, which will align, but is probably made up of pretty low Phred quality score bases and results in low Q score reads!

The idea of Icarust was test adaptive sampling frameworks and real-time analysis tools, and once simulated reads were of sufficient quality to do this, we stopped working on improving their identity!

@Adoni5 Adoni5 closed this as completed Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants