Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error converting a vcf #217

Closed
mufernando opened this issue May 21, 2024 · 3 comments · Fixed by #222
Closed

error converting a vcf #217

mufernando opened this issue May 21, 2024 · 3 comments · Fixed by #222
Labels
bug Something isn't working
Milestone

Comments

@mufernando
Copy link

I have a 75Mb VCF that I tried converting using vcf2zarr CLI.

Following the instructions, I first created the intermediate ICF using

vcf2zarr explode chrY_mcc-746-kray-filtered.vcf.gz chrY_mcc-746-kray-filtered.exploded

which worked out fine.

Then, I tried converting to the final vcfzarr using encode:

vcf2zarr encode chrY_mcc-746-kray-filtered.exploded chrY_mcc-746-kray-filtered.zarr -vv

This is the verbose output I get https://gist.github.com/mufernando/805a39a636c71b21c1b25a8ff6417e49

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/users/rodrigmu/micromamba/envs/marmoset/bin/vcf2zarr", line 8, in <module>
    sys.exit(vcf2zarr_main())
             ^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/bio2zarr/cli.py", line 331, in encode
    vcf2zarr.encode(
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/bio2zarr/vcf2zarr/vcz.py", line 934, in encode
    vzw.encode_all_partitions(
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/bio2zarr/vcf2zarr/vcz.py", line 898, in encode_all_partitions
    with core.ParallelWorkManager(num_workers, progress_config) as pwm:
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/bio2zarr/core.py", line 274, in __exit__
    wait_on_futures(self.futures)
  File "/home/users/rodrigmu/micromamba/envs/marmoset/lib/python3.12/site-packages/bio2zarr/core.py", line 97, in wait_on_futures
    raise exception
ValueError: Codec does not support buffers of > 2147483647 bytes

Ideas on what might be going on here? I'm happy to share the VCF.

@jeromekelleher
Copy link
Contributor

Thanks for the bug report @mufernando!

I think the problem is with your call_PL field:

DEBUG Initialised <zarr.core.Array '/call_PL' (24739, 746, 66) int32>

Your PL field leads to an inner dimension of 66! 🫢

So, each variant chunk of this array is 10000 * 746 * 66 * 4 which is more than 2147483647. The immediate bug here then is that we're not giving a usable error message. More generally, PL fields are a major problem (#185 and linked discussion) but we know how they should be dealt with.

In the short term, I'd suggest creating a schema and dropping the call_PL field (assuming you're not using it):

vcf2zarr mkschema  chrY_mcc-746-kray-filtered.exploded > schema.json
... edit JSON to remove delete the call_PL field
vcf2zarr encode chrY_mcc-746-kray-filtered.exploded -s schema.json chrY_mcc-746-kray-filtered.zarr

@jeromekelleher jeromekelleher added the bug Something isn't working label May 22, 2024
@jeromekelleher jeromekelleher added this to the 0.1.0 release milestone May 22, 2024
@mufernando
Copy link
Author

mufernando commented May 22, 2024

I confirmed removing the PL field it works! For future reference, you can remove it from the schema using jq:

vcf2zarr explode example.vcf.gz example.exploded
vcf2zarr mkschema example.exploded | jq 'del(.fields[] | select(.name == "call_PL"))' > example.json
vcf2zarr encode example.exploded example.zarr --schema example.json

@jeromekelleher
Copy link
Contributor

I just merged a fix which should raise a more helpful error message @mufernando - would you mind trying it out on your data please? The message will need some links to as-yet unwritten docs about PL fields, but hopefully it's pointing people in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants