Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd compression on the R2.bz2 file for mrsa-illumina tutorial #4730

Closed
jennaj opened this issue Feb 16, 2024 · 8 comments · Fixed by #4732
Closed

Odd compression on the R2.bz2 file for mrsa-illumina tutorial #4730

jennaj opened this issue Feb 16, 2024 · 8 comments · Fixed by #4732
Labels

Comments

@jennaj
Copy link
Member

jennaj commented Feb 16, 2024

Tutorial: https://training.galaxyproject.org/training-material/topics/assembly/tutorials/mrsa-illumina/tutorial.html

Zenodo: https://zenodo.org/records/4534098

Ok --- https://zenodo.org/record/4534098/files/DRR187559_1.fastqsanger.bz2
Fails "as truncated data" with FastQC --- https://zenodo.org/record/4534098/files/DRR187559_2.fastqsanger.bz2

The second file R2 in bz2 format is tossing errors with FastQC at both usegalaxy.org and usegalaxy.eu.

If that same file is uncompressed but otherwise unchanged, it works fine with FastQC.

If that file is left bz2 compressed, it fails just FastQC but works with the tutorial's workflow for the other steps.

Compression problem? Doesn't seem to be actually truncated.

ORG https://usegalaxy.org/u/jen-galaxyproject/h/genome-assembly-of-mrsa-using-illumina-miseq-data-1
EU didn't run myself but it was reported here https://help.galaxyproject.org/t/fastqc-fails-in-mrsa-genome-assembly-tutorial/11703

@jennaj jennaj added the bug label Feb 16, 2024
@hexylena
Copy link
Member

Ugh this was always a nightmare, had so many issues with the .bz2 files. Incredibly frustrating, and the worfklow was always very difficult about the format, and I never understood why.

a test of the file exits without error

$ bzip2 -t DRR187559_2.fastqsanger.bz2                             
bzip2 -t DRR187559_2.fastqsanger.bz2  7.43s user 0.01s system 96% cpu 7.723 total

and agreed not truncated

$ cat DRR187559_2.fastqsanger | tail
+DRR187559.451780 451780 length=209
CCCCCGGGGGGGGGGGGF@FGGGGGGGF9FFFGGGGGGFGGGGGGGGGGGGFGGGGGGGGGFFGGEFGGFD7FFGEFFGGGGGGGFGGGGGGGFFFGG<FGGGGGGGGFG9E+A<EF95=,FFFGGGGCFDA,@;D,,=B=,4@>=DFFA=EGC>8@FC,@FFC6,6@9DF@F@+0@,DFGA?,@,8?FF+68=>C9;FCFFFCCF**0
@DRR187559.451781 451781 length=165
GGTGTTGATAAGTAGCGTTCACCATTACTCGGCAATACTGTTACTACTGTTTTACCTTTTCCTAATTCTTTTGCTTTTTGAATGGCAGCATAAATCGCAGCACCTGATGAAATACCTGCTAAAATACCTTCCTCTTTAGCAACTCGACGAGACATTTCCATCGCT
+DRR187559.451781 451781 length=165
CCCCCGGGFGGGGGGGG@CGGGGGGGGGGGGGGGCEFFCFCCAFCFGGFGGEAFGEFC9FGCGAFGGGGFGG9<F9FGGGGGCF8EFGDFDFGCAD7FFGGGGGGGGFFCGDE<F8FFGGGFF9,A,ACFC=ECCBD@;EGFGED8?>+@EEGFA,,@@D,3@DC
@DRR187559.451782 451782 length=164
CCACCATAAGAAACACCAGTGTCTTGATTAATTCTATAATTAGATATTGATTTATCATTTAGTAATTTTTCTATTGTATTATAAATTTCTTTAAACTAGTTCATAATTTTTGTCAAAATGAAAGAATAATTTATTTTTTCTAGTTAAATTAATAATAAATAATA
+DRR187559.451782 451782 length=164
CCCCCGGGFF9CEFFA;FF<@<FGGFFFE<DGFGGGGGGGGCDFGGGCGG9FGGGFEGFGGGGGFGGFGGGE<@FAFGGGGGAEFGEFGFGG9@CFC,59EFGGG9AFDFE7EFG,EE,4AAFACGDGAFF9EF9;>E7>EAAEDGECDC>@FGG,@,9;;,,3

In the workflow+tutorial for the nanopore they're decompressed automatically, for the exact same issue, maybe we add that step to the tutorial. Or do you have another suggestion for fixing it? bz2 should be supported :(

@bebatut
Copy link
Member

bebatut commented Feb 16, 2024

I can create a new Zenodo entry with data in a different format

@hexylena
Copy link
Member

or it's a fastqc problem and they can't actually handle this bz2 and we stop accepting bz2 in the fastqc wrapper

s-andrews/FastQC#48 we aren't using a concatenated file, but, it could still be down to fastqc's support for standard bzip2 files vs the java implementation that they're using which they might switch away from

@hexylena
Copy link
Member

I recompressed with compression levels 1-9 and they all are read fine by fastqc, including and up to 9 which is what's used by the existing file.

$ file *.bz2
1.bz2:                       bzip2 compressed data, block size = 100k
2.bz2:                       bzip2 compressed data, block size = 200k
3.bz2:                       bzip2 compressed data, block size = 300k
4.bz2:                       bzip2 compressed data, block size = 400k
5.bz2:                       bzip2 compressed data, block size = 500k
6.bz2:                       bzip2 compressed data, block size = 600k
7.bz2:                       bzip2 compressed data, block size = 700k
8.bz2:                       bzip2 compressed data, block size = 800k
9.bz2:                       bzip2 compressed data, block size = 900k
DRR187559_2.fastqsanger.bz2: bzip2 compressed data, block size = 900k
$ fastqc 9.bz2
application/x-bzip2
Started analysis of 9.bz2
Approx 5% complete for 9.bz2
Approx 10% complete for 9.bz2
...

but that also is exactly what's described in their issue:

The test would be that if you decompress the file you have to a raw fastq, and then re-compress that to a bz2 file, is it then able to be read correctly?

so we could upload a replacement for one of these, easily I guess

@hexylena
Copy link
Member

hexylena commented Feb 16, 2024

The header of a Bzip2 file includes a BZh# where # is the compression level (this can appear elsewhere in the stream, but) this is also the header. Eg. BZh2 appears once in my 10 files, only in the one compressed at level 2. BZh9 however...

$ grep BZh9 *.bz2 -c                                        
1.bz2:0
2.bz2:0
3.bz2:0
4.bz2:0
5.bz2:0
6.bz2:0
7.bz2:0
8.bz2:0
9.bz2:1
DRR187559_2.fastqsanger.bz2:259

this file is definitely concatenated! (but it's still very much the java bzip2 library's fault for not being able to read this when the other implementations can.)

edit: the source files: https://ddbj.nig.ac.jp/public/ddbj_database/dra/fastq/DRA008/DRA008776/DRX178031/ exhibit the exact issue, they are concatenated for some reason.

@hexylena
Copy link
Member

I've found another use of this dataset in the wild and contacted him since he's a colleague in µbinfie, his tutorial uses Trimmomatic and that also fails, so, here is a re-issued record https://zenodo.org/records/10669812 with re-build bzip2 files that should not experience this issue (confirmed with fastqc + trimmomatic.)

hexylena added a commit that referenced this issue Feb 16, 2024
@hexylena
Copy link
Member

if any of y'all wanna approve #4732

@jennaj
Copy link
Member Author

jennaj commented Feb 20, 2024

Thank you for digging into this more! So odd but glad solved now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants