Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with fastq extractor when basecalled with Dorado: ValueError: dictionary update sequence element #0 has length 1; 2 is required #34

Open
ClairePt opened this issue Nov 18, 2024 · 3 comments
Labels

Comments

@ClairePt
Copy link

Hello,

Whenever I try to use toulligqc on files that have been basecalled with Dorado, I get the following error: ValueError: dictionary update sequence element #0 has length 1; 2 is required

I have encountered no issues when using the exact same command on the same sequencing data, the only difference being that the basecall was performed with Guppy.

Here is the exact command I use:
srun -p COMPUTE toulligqc --report-name TEST --fastq /home/eleonore.durand/Nanopores_set2_2024_fastq_NO_BACKUP/10_CR2_dorado_hac.fastq --html-report-path TOULLIGQC/TEST/report.html

And here is the full error message I get:
`ToulligQC version 2.7.1

  • Initialize extractors
  • Start Toulligqc info extractor
  • End of Toulligqc info extractor (done in 0m0.00s)
  • Start fastq extractor
    Traceback (most recent call last):
    File "/eep/softwares/miniconda/envs/toulligqc-2.7.1/bin/toulligqc", line 10, in
    sys.exit(main())
    ^^^^^^
    File "/eep/softwares/miniconda/envs/toulligqc-2.7.1/lib/python3.12/site-packages/toulligqc/toulligqc.py", line 426, in main
    extractor.init()
    File "/eep/softwares/miniconda/envs/toulligqc-2.7.1/lib/python3.12/site-packages/toulligqc/fastq_extractor.py", line 60, in init
    self.dataframe_1d = self._load_fastq_data()
    ^^^^^^^^^^^^^^^^^^^^^^^
    File "/eep/softwares/miniconda/envs/toulligqc-2.7.1/lib/python3.12/site-packages/toulligqc/fastq_extractor.py", line 253, in _load_fastq_data
    run_info = self.check_fastq()
    ^^^^^^^^^^^^^^^^^^
    File "/eep/softwares/miniconda/envs/toulligqc-2.7.1/lib/python3.12/site-packages/toulligqc/fastq_extractor.py", line 360, in check_fastq
    metadata = dict(x.split("=") for x in first_line.split(" ")[1:])
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ValueError: dictionary update sequence element #0 has length 1; 2 is required
    srun: error: seed: task 0: Exited with exit code 1`

I have tried several fixes, including using specifically pandas==2.1.4 and numpy==1.26.4; but I always get the same error message.
What could I do to fix this?

Thank you for your help,

Best regards,

Claire

@alihamraoui
Copy link
Member

Hi @ClairePt

Sorry for the delayed answer. Could you please provide the head of your FASTQ file so I can reproduce this issue?

Best regards,
Ali

@alihamraoui alihamraoui added the bug label Dec 9, 2024
@ClairePt
Copy link
Author

ClairePt commented Dec 9, 2024

Hello @alihamraoui ,
No problem, thank you for your answer. Below is the head of one of the fastq that were basecalled using Dorado:

@429a1c46-216d-4cfd-a67f-cdcf32c77ed5 st:Z:2024-06-11T13:10:52.535+00:00 RG:Z:2712a6db1cae52ac66ba3db53192fa918793254c_dna_r10.4.1_e8.2_400bps_hac@v5.0.0 DS:Z:gpu:Quadro RTX 3000
TTTTTTTTTTTATTTTTTAGTAAAAAACGTAACATCTTGATTTCATTGATACCACATAAGTATTGAAGATCTTTATTATTACAAGAAACTCACTTCATCACATTGAGCTGACAAAATCAACAACATCAACCACATCCATGATTCTCGGTTTCGAATTCACCACTCTACAGTTCGATCTAATTTCACCTTGCTTCCCTGTTAATGGTGGAAAGGTTTCCATCCTTATCATTGCCTTAACGAACGCGTCGAAAAATTTCCCTTGACCATCAGCGTATGCCCGGACTAATGGGATAGTGTCTTTAGCATCAGGGCTTGAGAATAACTCTTGGTCGCTTTGGATAAGACCTTTGTTCTCTTTGAGAATTCACATAGTATTTGTTGTCAAAGAGTGTTGGCGTACGTAGATAAAAATGTTGCCCCAAGACACTTGGTTTCCATTACGCGGACATTGCTTTCTTAGTGTACTCGAGTATAAGGTGCTCAAGGCTTGGGTCGGGTTTACCGGTGCTACTGAAGTTGTAAAGCCGATCCATTATAAACTGACATTGGTTTTTGCCAAAGGTGTGACCACCGGAAAGAGCAACGAGATCCGAAGGAAGATCGAGTCCGACTTTTTTGAAGCTATCCTTAAGTTGTTTAAGGGTAAAGAATGGAGCTGGAAGGTTATCATTAGCAAGATCCATAAAACCTCTTAAGCTATCTCTTCTTCCACTTGGAACCATCCATGAAGGACCTCCCGCCAAAACGACAGATTCTTGAGCCGCGATTGCGAGCAAATCAGCATCCATGAAACTGTTCTCGGACATGCTTTCTCCACGGCAGCTTTCATTTCATCGATAACATCGAAACCTCGAGCCGACTTAGCGTTCCCAAACGCGTCCTTCTCCGTCCTAAACGATGTTGTGTTGTCTTAACAATATAGACGCATCACATCCATTAACGAAGCAGTCATGGAAATGAAGACGAAGGATGCTTCGCAGCGATGCGAGGGTCTGATCTCAGAGCTTTGACAATGGTTTTGGTAGCAATGTCGAAGACTTGAGGACAAGTTTTATCGTAAAAGGAAGGACTTAAGTTCAGCATGAGACAATGAGACTTGAAGAAGAATAAGAACAAATCCTAATTTTGTCAAAGAAAATGACTTTCTGATCTTGCTCTGACTCTAAGAGAATGTTTTGGTTATCGTTTAGTTTTGAGTGAGTCCCAGCAATATCAG
+
=?003<CBADH8,,,28@<))BGIKFCB+'''922022.<9=?>HPJHKGKJE::99;-'+32/'?:0/011211010==55399:>A<:77968:EFEIGEDFCDCBBADCD@998+*()--'&%%&(*55?>>==?>;,//0=7?3;FIINJKRLPJIRSJSSJIHI@DEIJIGF-('')1,+.65766666>>>BG+(()=@???GC@??@<:89=63224E=:77=75435/,-+*,,-/..1<D:IGHGHLJHHISLPFKJGFFLKSOD4271=<:HIIIHJHHHNSKH52//9;ACBCDEDAGFG;7888,,,,BDEHGGGHQSHPE@@@CLFFGHLGLMFIL1///99>BGE111/002444348777564))0/.//83345;:220232.-*)))))'(('%$$$$-+&&&+69:-3323BBDDFFA41276<:9+++6....1/***,45348<1&&&&)*5:2267999BA870/( @c2f2e3f7-429d-47fa-80fe-9841be2e53bf st:Z:2024-06-11T13:10:53.366+00:00 RG:Z:2712a6db1cae52ac66ba3db53192fa918793254c_dna_r10.4.1_e8.2_400bps_hac@v5.0.0 DS:Z:gpu:Quadro RTX 3000 ATGTTGTGGTGTACTGGTTACGAGTACGTATTGCTTTGGAGCTCTCGATTCCGTGGGACGAGCTCAACAAAGCAGCCGCCCCGTCCTACCTATTTAAAGTTTGAGAATAGGTCGAGGGCGTTGCGCCCGATGCCTCTAATCATTGGCTTTACCCGATAGAACTCGCGTCCGAGCTCCAGCTATCCTGAGGGAAACTTCGGAGGAACCAGCTACTAGATGGTTCGATTAGTATTTCGCCCCTATACCCAAGTCAGACGAACGATTTGCACGTCAGTATCGCTGCGGGCCTCCACCAGAGTTTCCTCTGGCTTCGCCCCGCTCAGGCATAGTTCACCATCTTTCGGGTCCCGACAGGCATGCTCACACTCGAACCCTTCTCAGAAGATCAAGGTCGGTCGGCGGTGCACCCGCGAGGGATCCCGCCAATCAGCTTCCTTGCGCCTTACGGGTTTACTCACCCGCTGACTCGCACACATGTAGACTCCTTGGTCCGTGTTTCAAGACGGGTCGAATGGGGAGCCCACAGGCCGACGCCCGGAGCACGCAGATGCCGAGGCACGCCGTAAAGGCGCGTGCTGCAGACCACGATCACGGTAGCGACGTCTCCGCGGGCGTATGAAGAGAGCCCGGGCTTAGGCCACCACCGTAATCCGCGTCGGTCCACGTCCCCGAATCGATCGGCGGACCGGATTTCTCCGTTCCGCATCCGACCGGGACGCGTCGCGGCCCCCATCCGCTTCCCTCCCGACAATTTCAAGCACTCTTTGACTCTCTTTTCAAAGTCACTTTCATCTTTTACCTCGCGGTACTTGTTCGCTATCGGTCTCGCCCATCTAGCCTTGGACGGAATTTACCGCCCGATTGGGGCTGCATTCCCAAACAACCCGACTCGTAGACAGCGCCTCGTGATGCGACAGGGTCCAGGGCACGACGGGGCTCTCACCCTCTCTCCGGCGCCCCTTTCCAGGGAACTTGAGCCCGGTCCGTCGCTGAGGACGCTTCTCCAGACTACAATTCGAACGCCGAGGACGTCCGATTTTCAAGCTGGGCTCTTCCCGGTTCTCGCCGTTACTAAGGGAATCCTTGTTAGTTTCTTTTCCTCCGCTTGTTGATATGCTTAAACTCAGCGGGTGATCCGCCCTAGGTTTAAGCATATCAACAAGCGGAGGAAAAGAAGAACTAACAAGGATTCCCTTAGTAACGGCAAGCGGGAAGGCCCGGCTTGAAAATCGGACGTCCTCCACGTACGGAGAAGCATCCTGAGCGACGGACCGGGCTCAAGTTCCCTGGAAGGTTCGGGAGAGAGCCCGTCGTGCAAGTATCACATCACGAGGCGCTGTCTACAGTCGGGTTGTTTGGGAATGCAGCCCCAATCGGGCGTAATTCCGTGCAAAGCGGGCGAGACCGATAGCGAAGCACCGCGAGG + &&+-/(&%$$$&&%&&'''%$#$$%))117667;CBH1127540+)+,,1;956...09-*''*)),.7997>@@AB97789:CEJLFEFFCBCEKJSKMJICB>B><=C?D1102=@>553349,(?AEHLSIMDDEIMNJQPJGGHINIIHG(''(>>@ADJEFIGJIHINJKKLJMKIJHPKSRHGQFENFBA?<=<>?<6111D?@@DGGGHGHEHHHLBCBCC>>;<<<@DDGEABEJKHIF65())&)*365.3'(,,--5447@CFDFGGA==>GKHFGFD==>>?>?CB<=<512:5**+46=>6558NFJSIEHHEEDFKFHKEFGFFHHCA@<66:9:5,)&((%%%&&&74454//)))-/-1143-=:9:9?CFBC:7C:H<SLJFFHCD==ABDDDFJPSSRNIIRIGGMF:978**)222333488C?88<>GEDDKIHIKHOJJ:998A:88;<<?@=;=GHEC1../A>==:;<?JDFFSEDDCABFGGCB222IIHEAA@@da<;;;A9944430+''(+@5555;<9980..,+--/8;6>8<><985411133128<<D;==BE@7:::DGKIBBBBFHOOMSJFMHFEEIIIFJEEIJKS30009888B=<<GIJNSLKEDHFEG=;99??>5KIFMJHLLKFKKII?<//-00()8::<::;<><;723>>>AIIQSSKGSLGGGJISSLNLSKHIKHFG76:81+)()/4.(&%&.(0756678CFIKGMMIGEJBBIFCB/,++)(+,.//%%$%%%,,,734BFHSPHJKKOLGDHFR<;;;DHFMPGF<;;=<>A?EDIJIJJNCCDFDDB878>=@@A@AASJSQNIDCBCG>666?HF6443;==<::,,,;<?@?:('(,,4429@C8DEEGGHJ@@HFGFLLORLHJIGJHLILIHPDPDI?=999:ACGEFEHE899;GIFISGEEDRIIHEEFFFHIGPQLSJJJNONRAKJINFEFHESSPNISE@@402++((,.;;<=<>=;;>IJMIGGHKKJIF<?EDIIFDSIHH>HJHFSPISSMHISSLMIKHSLSRICBAAHLIFIHE7,970)411014'&&,,+,25>ADHDGDCCDGAAFF==<8,157:;9:==@?@:8423011001///'&%&%##$'''&'+(('')./7:<<520/-,,--..-))+(('''38::6))+0,**('(),345769879744444337551.+&'&&(&&''($##$%.++((%%$$$$$$(()())(()).123.,,,0002.-.0++++120,-/0./..064433<=75458764-,&'')))&%$$##$&$$%&&%%$&()-./)((''(''&$$$(''''
@0d45b1a7-60dd-4026-a4e2-e29abf867a10 st:Z:2024-06-11T13:11:11.629+00:00 RG:Z:2712a6db1cae52ac66ba3db53192fa918793254c_dna_r10.4.1_e8.2_400bps_hac@v5.0.0 DS:Z:gpu:Quadro RTX 3000
TTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAACAAATTTTGGATGATTGAAAACAGCAGAATCATTCCTATAACATTCACAATCTTTGACACCAACATTGTCTTCAACTCATCAAAGCATCGCTAAGCTCCAGACACGCCACGAAAGAAAGAGAAAAACAGATAGAAACCGTCGCGCATTAACCATCATTACTCCAACTTCTCGAGTTCTGACATCCTTTCGATGATCGCCCTTTTGCTTGTTTCTTGGAGCTCCCCAGCAGAATGAATTCATCGAGTATCAGATACACCTTTGTGGAAATTAAACACCAAATCTAGCTCACAGACATTGCTGAAGAAATGGTGTTTAATTTCCACAAGGTGTATCTGATACTCGATGAATTCTTGCTGGGGAGCTCCAAGAAACAAGCGGAGGGAGATCATCGAAAGGAATGGTTAATGCGTGCTAGTCCATCTGTTTTTCTCTTTTTCTTTCGTGGCCGTATCTGGAGCTTTAGCGATGCTTTGATGAGTTGAAGACAATGTTGGTGTCAAAGATTGTGAATGTTATAGGGAATGATTCTGCTGTTTTCAATCATCCAAAATTTGTTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

@alihamraoui
Copy link
Member

Hi @ClairePt,

It seems that your FASTQ sequence name format is new—it looks like a SAM tag format. ToulligQC doesn't currently support this.

This issue will be fixed in the upcoming version 2.7.2, which will be available soon (hopefully next week).

Thanks!
Ali

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants