-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faster/simpler version of transpose, with stream-oriented min and max #2679
Conversation
…n and max Define stream-oriented min/1 max/1 so that a much faster and much shorter implementation of `transpose` can be provided, while preserving symmetry between min and max. (Nothing in this change would preclude adding stream-oriented versions of min_by and max_by in the future.)
src/builtin.jq
Outdated
@@ -6,6 +7,8 @@ def sort_by(f): _sort_by_impl(map([f])); | |||
def group_by(f): _group_by_impl(map([f])); | |||
def unique: group_by(.) | map(.[0]); | |||
def unique_by(f): group_by(f) | map(.[0]); | |||
def min(s): reduce s as $x (first(s); if $x < . then $x else . end); | |||
def max(s): reduce s as $x (first(s); if $x > . then $x else . end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using stream twice causes problem with input
. For example, users will expect echo 1 | jq -n 'min(input)'
to behave as same as echo 1 | jq -n '[input] | min'
, but the former yields break error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the thunk s
is side-effect-free, then this is OK, but a) we do have some side-effect-having builtins as @itchyny points out, b) if and when we add an FFI this will become more of a problem.
I suggest: def min(xs): reduce xs as $x ([]; if length == 0 then [$x] elif $x < .[0] then setpath([0]; $x) else . end) | select(length > 0)[0];
. Similarly for max/1
. We use this reduce xs as $x ([]; ...)
trick where if the reduction state ends up being the empty array then that means there were no xs (xs
was empty
).
BTW, I like the Haskell convention of using one letter for the item and "pluralizing" it to get the name of the stream we're iterating. xs
is the plural of "x" ($x
). I will be adopting that convention in my jq code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And not evaluating twice to produce the first $x
also improves performance even for side-effect-free xs
. Probably not much in most cases, but hey.
min(1,2,0.1) | ||
null | ||
0.1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests with max, and also with empty
as its argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/builtin.jq
Outdated
| reduce range(0; $max) as $j | ||
([]; . + [reduce range(0;$length) as $i ([]; . + [ $in[$i][$j] ] )] ) | ||
end; | ||
def transpose: [range(0; max(.[]|length)) as $i | [.[][$i]]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very readable. Thanks!
It turns out that the boxing technique is noticeably faster than using setpath. Also revise transpose to use max/0 as that is faster than using the new max(s).
@@ -6,6 +7,19 @@ def sort_by(f): _sort_by_impl(map([f])); | |||
def group_by(f): _group_by_impl(map([f])); | |||
def unique: group_by(.) | map(.[0]); | |||
def unique_by(f): group_by(f) | map(.[0]); | |||
# max(s) and min(s) use boxing technique for the sake of `input`: | |||
def max(s): | |||
reduce (s|[.]) as $x (null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version in my earlier comment didn't allocate an array for every $x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but setpath can slow it down, as it did in my tests. Fixing input
would be so much nicer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean "fixing input
would be so much nicer"?
As for setpath
... we're using it on a zero-or-one element array, so it should not allocate at all after the first time... I'll check that out later.
Add a test that min(null) yields null. |
@itchyny @nicowilliams - Thanks for your reviews. I've done some performance comparisons, and have updated this branch accordingly. Specifically, according to my one-platform timings:
So transpose now uses max/0, which still makes it MUCH faster than the old (1.6) transpose, but it does In the spirit of supporting stream-orientation, though, I propose Should there be an ER to change |
The last part can be .[], using select & .[0] is faster? |
@itchyny wrote:
But we have to take care of the case of |
That's rather surprising. |
@nicowilliams wrote:
As I mentioned somewhere, I was just using some simple test cases. As also mentioned, I'm hoping this kerfuffle will lead to a rethinking of |
There is no problem with |
def max(s): | ||
reduce (s|[.]) as $x (null; | ||
if . == null then $x | ||
else if $x > . then $x end # for speed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omitting else clause is a new feature, so we shouldn't use it in builtins? I'm ok with this anyway.
@nicowilliams wrote:
That's an odd thing to say here, as the whole basis of this kerfuffle is the difference between the two. It would simplify so many things if However, I hold little hope that |
Both
There is no need for |
Before I added When there is no choice but to do multiple evaluation that should be documented in the builtin's docs. In this case there is an alternative to multiple evaluation. Now, a pure-functions-only jq was very useful, and a jq that has impure functions is a bit harder to use in some cases -like this one-, but the bit of impurities I added allow an online alternative to I admit that losing purity is sad, but mature functional programming languages generally have ways to have side-effects. The trick is generally to segregate those (like What I'd like to do about this at some point is have function attributes indicating whether a function is pure, and then we could have an |
@nicowilliams wrote:
Yes, I understand that my original proposal for min/1 and max/1 is fundamentally incompatible with |
@nicowilliams - FYPI: These are the u+s timings I got for nicomax (your original version), a setpath-free version thereof, and the "boxed" version. All three defs are shown below.
The curious thing is that when computing Using jq-1.6-226-g7d424fd-dirty
|
That's not the issue. The issue is that
Now, suppose you have a function
But now suppose that End of stream has nothing to do with anything. It's the side-effect of reading from the input stream. It's a lot like |
@nicowilliams - Rest assured, I understand that my original proposal for min/1 and max/1 is fundamentally incompatible with |
After all, what's the benefit of using reduce? On my laptop, the reduce definition of min runs 2.5 times slower than the following definition. def min(s): [s] | if . != [] then min else empty end; |
That's pretty funny, and sad. Surely there's some number of outputs of |
Well, and of course |
@itchyny wrote:
Indeed, in the case of And indeed, jq's But I thought it would be nice to leave the stream-oriented min and max in because, as @nicowilliams has often elaborated/explained/emphasized, stream-orientation is a goal in itself, e.g. to avoid memory issues. Consider also the potential short-circuiting possibilities. And the overhead of boxing is not so great. Postscript re gojq: It's interesting that Here are some timings, with apologies for the slightly cryptic presentation. (All the programs use gojq [] gojq max/0 gojq max using reduce naively |
This PR is superseded by #2758 Streaming versions of min/max can be the subject of a future PR. |
Define stream-oriented min/1 max/1 so that a much faster and much shorter implementation of
transpose
can be provided, while preserving symmetry between min and max.(Nothing in this change would preclude adding stream-oriented versions of min_by and max_by in the future.)
See #2472