Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools query 1.18 prints new lines when looping through samples #1969

Closed
jamigo opened this issue Aug 1, 2023 · 5 comments
Closed

bcftools query 1.18 prints new lines when looping through samples #1969

jamigo opened this issue Aug 1, 2023 · 5 comments

Comments

@jamigo
Copy link

jamigo commented Aug 1, 2023

bcftools query 1.18 prints new lines when looping through samples.

bcftools query's previous versions:

$ bcftools query -f '[%TGT]\n' file.vcf
A/C
C/G
G/T

bcftools query's previous versions:

$ bcftools query -f '[%TGT] ' file.vcf
A/C C/G G/T

bcftools query 1.18

$ bcftools query -f '[%TGT] ' file.vcf
A/C
C/G
G/T
@pd3
Copy link
Member

pd3 commented Aug 1, 2023

This is an intentional change

bcftools/NEWS

Lines 89 to 91 in f6a4ae6

* bcftools query
- Force newline character in formatting expression when not given explicitly

although it may need some fine tuning as I can see.

Why to force the newline in the first place? In offline polls there was not a single user who said the explicit newline was a good thing, therefore I decided to insert it automatically, as I could not think of a use case where it would be beneficial to print the entire VCF as a single line. If you are aware of a use case where this is desired and preferred, I am happy to add a command line option to override the default.

As for the behavior in 1.18, whenever newline is given explicitly, the program will not interfere with user formatting. But when the newline is not given, it will be inserted to avoid a very common error. Here it can be somewhat unpredictable, as I just tested on this case

$ bcftools query -f '%REF [ %GT]' rmme.vcf
G  0|0 0|0 0|0 0|0 0|1 0|1 0|1 0|1 1|0 1|0 1|0 1|0 1|1 1|1 1|1 1|1
G  0|0 0|1 1|0 1|1 0|0 0|1 1|0 1|1 0|0 0|1 1|0 1|1 0|0 0|1 1|0 1|1

but

$ bcftools query -f '[ %GT]' rmme.vcf | head -3
 0|0
 0|0
 0|0

So the program makes a decision on its own, whether to place the newline per-sample or per-site, depending on the context - i.e is the expression site-oriented or is it sample-oriented?

This may not be the best behavior, I am open to a discussion.

@jamigo
Copy link
Author

jamigo commented Aug 2, 2023

I must admit that having to write explicitly '\n' in the format was a slightly uncomfortable surprise for me years ago when I started using bcftools query, but I also must say that I've been taking advantage of it (well, from its absence when needed) for years when printing just a few genotypes from a VCF file.

I understand the rationale behind such decission, but as a programmer I wouldn't recommend letting the program decide whether to write a '\n' character or not depending on the site or sample orientation of the query. Instead, IMHO, I would definitely go for a program option.

In fact, my suggestion would be to leave the default option to be to write '\n' as this is the only way the code would be retrocompatible with all previous bcftools versions (so older codes wouldn't need to be rewritten) and add an option to write new lines like e.g. 'perl -l' does. If, for any reason, you still want to move to write '\n' by default, then I would definitely consider adding an option to allow forcing the older behaviour if needed.

Right now, in the original example where I need to print 3 genotypes only in a row, the only thing I can to right now is to pipe bcftools query's output to tr, as there's no way now to get the previous output directly from bcftools query:

$ bcftools query -f '[%TGT] ' file.vcf
A/C
C/G
G/T

$ bcftools query -f '[%TGT]' file.vcf | tr '\n' ' '
A/C C/G G/T

Thank you anyway for opening this issue for discussion.

@pd3
Copy link
Member

pd3 commented Aug 2, 2023

Backward compatibility was a big concern. However, the decision was to change the behavior anyway as it is extremely unlikely anyone is using expressions without a newline in automated pipelines, it's just too impractical for vast majority of VCFs out there.

There are two counter arguments against the perl -l example. First, typing and extra \n is exactly the same amount of work as typing another option -l. Second, it would not fulfill the primary aim of the change: to protect users from accidental streaming of an entire VCF on their terminals, which happens a lot unfortunately.

However, I accept the proposal to add a backward compatibility option.

Also I agree that whenever the program does an automatic insertion of the newline, the behavior must be very clear and understandable, which currently it is not.

Therefore I propose new default behavior:

  • if the expression contains a newline character, do nothing
  • if there is no newline character and -N, --disable-automatic-newline is given, do nothing
  • if there is no newline character and -N is not given, insert newline at the end of the expression

@jamigo
Copy link
Author

jamigo commented Aug 2, 2023

This sounds perfectly reasonable to me. Backwards compatibility would then only require adding '-N' if needed. Again, thank you for the discussion and for the great work.

@pd3 pd3 closed this as completed in c7cbe0b Aug 3, 2023
@pd3
Copy link
Member

pd3 commented Aug 3, 2023

This is now modified in c7cbe0b, as discussed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants