-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We need persistent floating-point rounding mode control #1384
Comments
The best way to fit new rounding behavior into WebAssembly would be as new instructions (rather than an instruction prefix or a persistent rounding mode or something like that). Could you list the new rounding instructions you would like to see? |
@tlively Fair enough. I've never been a huge fan of prefixes, either. To be clear, there's rounding in the sense of "round an existing float to an integer", and then rounding in the sense of "nudge the units-of-last-place (ULP) according to the infinitely precise result". In the first case, we can get by with the usual suspects: a. Round nearest-or-even. In the second case, we can apply the same rounding policies, but to the ULP instead of the integer part. Implicitly, we first of all have to know the maximum error bounds on the infinitely correct result, because if they're too big, then the ULP is still undetermined. For arithmetic operations, this is straightforward and already done right in hardware. For operations which may produce irrational results, it's first necessary to know the error bounds on the truncated portion (due to series truncation). One hopes that the CPU designers did this right. I have serious doubts about that, but it might be correct for fsqrt, which AFAIK is your only such operation. You probably can't ever support other transcendental functions in a well-defined and cross-platform way because the CPU designers probably didn't really nail the error terms. (Said functions would need to be implemented by the programmer in terms of your existing arithmetic operations. Slow, but well-defined, cross-platform, and futureproof. Lots of cores, lots of free cycles, no worries. The only other option would be to hope that all CPU vendors implement the same transcendental functions in exactly the same way. Things get really complicated if that happens to be true because they're all wrong in the same way, which is why I think the software emulation approach is a necessary evil.) Obviously, the first case can be done after the fact because we just need to dispose of the fractional part. But in the second case, everything needs to happen in one instruction because once an instruction completes, any extra bits of precision have been forgotten. So if you want individual instructions, you need stuff like "multiply and round the ULP toward negative infinity" and "square root and truncate everything after the ULP". The good news is that, if you do it this way, then it will fit like a glove with nVidia, which is probably more relevant going forward than AMD64. The bad news is that you'll have to employ some smarts to avoid touching the control word too often on the latter. And by the way, all your existing arithmetic instructions would now to be suffixed with "and round nearest-or-even". I'll pause here. Are we on the same page? |
Yes, the philosophy here sounds good. We've had similar discussions over in the SIMD proposal. The next part is making sure that the proposed instructions have (reasonably) efficient lowerings on relevant architectures (at least x86/x86-64/ARM). I'm not sure how nvidia fits into the picture, though. WebAssembly doesn't typically run on GPUs. It would also be good to mention what kind of languages or applications would benefit from having these instruction in order to motivate working on them. |
"WebAssembly doesn't typically run on GPUs." -- It certainly does, in 2029. But allow me to suggest a potentially better solution for the second case. It seems to me that the only practical application for ULP rounding is interval math, full stop. Therefore, you could cut down the explosion of instructions and improve performance by providing native interval support. So "divide 64-bit floats" becomes "divide 64-bit float intervals". Yes, there are lots of corner cases, but by doing it this way, you would solve all those problems once and for all, allowing higher-level languages to translate interval types directly into WASM. I am entirely sympathetic to the need to avoid CISC nonsense, but intervals are so powerful and so universal that I think it's worth being a little CISCy in this case. The other reason to implement native intervals is that higher-level languages do not, to my knowledge, have instructions for ULP manipulation. In many languages, it wouldn't even mean anything because floats are undefined garbage anyway. But it's easy to imagine adding macros to support basic interval functions in popular languages, which then filter straight to the WASM layer. No chance for creating super subtle bugs due to misguided ULP manipulation. If I recall, an Arianne 5 rocket blew up back in the naughts due to floating-point imprecision which would have been prevented with intervals. And if I'm not mistaken, the safety of the Large Hadron Collider was proven with interval analysis, as well. But it's not just about safety. Computation can be exponentially accelerated, in some cases, using intervals, simply because lower precision is often acceptable, and the time savings propagate to many levels of the function tree. This facilitates much faster raytracing, for instance. Fluid dynamics stand to benefit as well. (This is all abstractly related to the Shrinking Bullseye algorithm.) I'm intrigued by your comments regarding SIMD, but I'll leave that for another thread. |
It's not clear to me why interval arithmetic should be a part of the WebAssembly ISA as opposed to a software library that compiles to existing WebAssembly instructions. Do native architectures support interval arithmetic directly? If not, there's probably no benefit to building it into WebAssembly. |
These are the reasons that intervals should be supported in WASM:
If the only acceptable solution is 4 different rounding flavors of your existing primitives, then I'll take it, but the path to fewer bugs is just to implement intervals at the bottom. This could be extended to SIMD at some point, as well. |
Wasm is generally designed as a minimal abstraction over hardware, so the design pressure is to expose (standardized) versions of what is in hardware, and not build the next higher-level concept unless it is impossible not to. In this case it would seem that rounding modes would be both closer to hardware and the more minimal addition. It's my understanding that the rounding mode control design in IEEE 754 is meant for this purpose, and has other purposes too, outside of interval arithmetic. |
If you do it that way, then at least the compilation process would be straightforward, along with extension to SIMD. This also means that interval implementation will need to be done by porting these rounding-aware arithmetic primitives primitives to higher-level languages, which is more of an uphill battle. But OK, what's the next step? |
If I understand correctly, you have the same problem with high-level languages for native compilation, so I don't think that's a Wasm-specific problem that Wasm needs to solve. |
@Brion Well, as I said above, I'm happy enough with the idea of just implementing rounding-aware arithmetic primitives in WASM. The general problem with ULP manipulation, whether you solve it with such primitives or full blown interval support, is that WASM is likely to become a method whereby distributed computing is facilitated for HLLs. So in effect, it is the hardware. You can't go directly from HLL to CPU/GPU machine code and have it run on anonymous targets all over the world. (Imagine trying to run a distributed computing network which mandates that you need nVidia XYZ GPU in order to even participate.) So you have to go to middleware first, of which the obvious (and perhaps only) choice is WASM. So if the HLL programmer wants a feature that is efficiently implementable in hardware, but WASM doesn't support it, then he's stuck having to do nonperformant emulation. |
@KloudKoder , can |
@Yaffle Can you explain your reasoning as to why multiply-add would help? For instance, how would this accelerate the process of adding 2 intervals, given that presumably it's still going to do nearest-or-even? |
it helps to find the error of the multiplication of two f64 numbers or division of f64 numbers: |
@Yaffle Wait a second, are you talking about fma (which I just realized isn't actually in the spec, if the search function is to be believed) or fmax/fmin? Just starting with addition, we have: (a, b) + (c, d) = (RoundNegative(a + c), RoundPositive (b + d)) How that would involve a multiplication, I don't presently see. And multiply-add takes 3 parameters, not 1. Clearly, I'm misinterpreting your notation. |
@KloudKoder , fma that is fused-multiply add, I have updated my notation |
@Yaffle I think I finally understand your point. What you mean is that Intel's various FMA instructions compute: a * b + c to infinite precision, and then round nearest-or-even. (They don't round after just computing the product.) This means, for one thing, that: product = FMUL(a, b) is equivalent to: error = NearestOrEven(ab - FMUL(a, b)) which by definition is off by no more than half a ULP, so you need only nudge the result by 1 ULP in order to obtain an upper bound on the error. This is a valid approach, but ULP nudging requires intensive overhead: you need to first treat the result like an integer, then add your ULP (as an unsigned 1), then see if you wrapped into the exponent field, or perhaps created a NaN, etc., then potentially do a rollback and update the result to infinity or whatever. There are other ways to do this, but AFAIK also involving painful overhead. And do note that I haven't demanded the tightest interval -- just a "tight enough" one for practical precision requirements. So do you have a fast way to nudge? If not, then unfortunately it's back to my original point, which is that intervals are nonperformant under the current ABI, relative to what's actually possible on real hardware. Also, note that I've written FMA above, which alludes to the Intel instruction set. Webassembly's fma, by contrast, doesn't seem to exist. Do you have a link? It doesn't come up in the doc search on the website. Finally, from your link: "addition/subtraction/multiplication/division/sqrt including the rounding mode specification to the command itself" That's precisely what this proposal is all about. It's great to have that capability, but we need a direct way to exploit it. I'm happy to have my pessimism disproven... |
@KloudKoder ,
anyway, you are right, that the overhead could be big |
@Yaffle I now see f64.fma under "Additional floating point operators". I'm willing to assume that it will be adopted as an instruction. And definitely I wouldn't demand the tightest interval, which is unlikely to offer any practical advantages. Now, looking at your alternative nudge: x + ABS(x) * (2**-53+2**-105) it's clear that the point of this is to add the result (x) to: itself shifted down so that its leading 1 appears in the ULP. I'm not entirely sure why we need the (2**(-105)) -- maybe to break the tie case of nearest-or-even in favor of a nonzero nudge -- but in any event, here's what we would need to do:
This does in fact work. It also disrupts the pipeline badly, resulting in a long series of essentially serialized instructions. This could be partially overcome by weaving together a whole bunch of interval operations which can occur in parallel. Thus it still seems that supporting instructions with builtin rounding modes is far cheaper. One could also just ignore the trend toward integer arithmetic, and thereby make WebAssembly less useful in scientific computing. That would be a shame, though. Thanks for helping to demonstrate the scope of this problem so thoroughly. Where to go from here? |
What if I write the Intel/AMD support for you? Would that offer you maintainers enough motivation to port it to other platforms and make it part of the spec? We can argue over the specifics and corner cases, but there's no point in even discussing them if my proposal here isn't sufficiently compelling. |
@KloudKoder thanks for offering to help implement :) Unfortunately there's a lot more than implementation that goes into getting a new proposal standardized, so this is unlikely to advance without more of the community showing interest. See one of my previous comments on why nice-to-have features often don't make the cut, even if they don't seem to have any technical downsides. |
@tlively I read your previous comment and I understand why new opcodes are a hard sell. In this case, that's particularly vexing because intervals have suffered this chicken-and-egg adoption problem since the day they were first proposed for use with IEEE floats. Basically there's a great temptation to avoid using them in high level languages because they're so slow, but no one wants to do the dirty work of optimizing their performance by creating intrinsic support because (surprise!) no one finds them appealing to work with (unless they're desperate for error containment). At the end of the day, making intervals efficient and trivially accessible is a matter of ensuring data security -- just in the analog instead of the digital realm. We're wasting nVidia's (and perhaps others') great contribution of interval-ready floating-point instructions. One glimmer of hope is that there are already public specs on the expected behavior, available both as a comprehensive body of axioms and also a "diet" version for basic arithmetic. So at least some aspects of their expected behavior are already defined for us. |
In this case adding interval arithmetic to WebAssembly wouldn't help that chicken-and-egg problem because it would not be any faster than an interval arithmetic library compiled to WebAssembly unless there is underlying hardware support. I recommend going the easier route of writing such a library for now, and if common hardware natively supports interval arithmetic in the future, it will make sense to revisit adding it to WebAssembly as well. |
@tlively I get that, but it's not accurate. nVidia has hardware interval primitives, in the sense of floating-point instructions with builtin rounding modes. They aren't accessible via WebAssembly. The Intel way involves setting a control word first, which means you need to do a bunch of high limits in series, then a bunch of low limits, in order to minimize control word manipulation. Writing a library doesn't help because this is something that needs to occur on the compiler level. For example, I could create macros in a library which do interval operations in AVX2 or X87, but unless I control the compiler, I can't get it to do those reorderings so as to avoid all the crippling control word manipulation. You may see WebAssembly as a way to accelerate websites. While I don't disagree, I also see it as a global distributed computing platform just waiting to happen. Golem, for instance, has already enabled that it some hacky way, but it could be so much more if it ensured analog security by including this type of support. This is particularly important when attempting to ensure that results obtained via different platforms (including those not using WebAssembly) are consistent with one another, as floating-point results of complicated computations are seldom exactly equal. No intervals means this effectively can't happen and WebAssembly never reaches its true potential. So there's a lot more to this than just "please make this fast". |
Oh I see, I didn't realize this was already supported by Intel hardware. How widespread is the support? |
It is pretty much ubiquitous on CPUs, not only on x86, it is what |
@penzn Is correct. This type of rounding support has been present on Intel for decades. As he indicated, it's a terrible design because changing the rounding mode requires a floating-point pipeline stall. nVidia had the foresight to integrate rounding mode directly into its instructions, so for instance you can "add and round toward positive infinity" or "multiply and chop toward zero". What I'm advocating here -- if not comprehensive interval primitives -- is at least a set of opcodes and/or prefixes which will allow efficient access to the relevant nVidia instructions. Those opcodes and/or prefixes can then be at least somewhat efficiently mapped to Intel by grouping same-rounding-type instructions into dense sequences, so as to minimize rounding mode manipulation. However, this could be done as a performance enhancement; the first release could just change rounding modes literally as specified by the WASM code. |
I don't think I indicated that ;) It is the reality of CPU FP environment, though maybe it could have been designed better.
If I understand things correctly, it exists on Arm as well, at least the same fenv functions work there. As usual, there might be discrepancies between the two on how different instruction groups interact with rounding mode. There is a discussion about better IEEE 754-2019 support in Future Features:
This is similar to original idea for "releaxed SIMD" fpenv: #1401 (comment). Disadvantage of first class interval support is that it would require runtime optimization on CPU, while we don't yet have a viable GPU implementation where it could be organically supported. However the idea in "Future Features" suffers from the same issue. Advantage of interval math is that it can reduce instruction count (two for price of one), but to be useful it might need to be relatively high-level. For example, usable interval math library requires more operations than listed above and additional operations might not be easy to implement without them also becoming instructions. Another issue is scheduling - to do a few low or high limits in a row parts of different interval instructions would need to be scheduled together, which requires even more optimizations in runtime or a new execution model. |
@penzn Yeah my wording wasn't good. I meant that you mentioned the control word architecture, which in my opinion is a terrible design. The bottom line is that a modest percentage increase in native Intel instruction count due to changing the rounding mode results in a multifold explosion in execution time due to the pipeline stalls. As you clearly understand, the only partial mitigation for that would be to group instructions by rounding mode to the extent possible. As much as I'd be thrilled to see complete interval support in WASM, it's unnecessary. The nVidia building blocks would be enough for practicality. (I wonder how Apple M1 does it, for that matter.) I understand that there's no GPU support yet, but I would expect that to be in WASM's future evolution. Why do you think that interval support would need to be more high-level? Even transcendental functions (e.g. log) can be straightforwardly computed from basic interval arithmetic. As to the Future Features proposal, it would be a great help toward efficient interval implementation. I am, however, opposed to anything that would set result flags (e.g. zero, infinity, NaN, etc.) unless that step can be bypassed somehow, such as by writing the flags result to a write-only trash register. The reason is that it adds major overhead to evaluate those flags, even on a hardware level. I suspect that modern FPU pipelines know when they can skip this step, such as when 2 floating-point instructions occur in a row, eliminating the need to set flag bits based on the results of the first one. |
What do you think is the minimal set of interval operations which would be implementable and beneficial on CPU as well? |
@penzn I can hardly think of a more relevant question for this thread. First of all, there are really 2 basic architecture choices:
I would gravitate strongly toward option #1 because (a) it will be simpler to optimize for Intel (even though theoretical performance will be identical in both cases) and (b) I don't think anyone wants a massive zoo of similar animals with slight differences in their stripes. So assuming that we go with this option, then I would strongly caution against the use of any "set floating-point environment" (SFPE) instruction which simultaneously affects exception masking. This will burden rounding mode changes with lots of other logic. Separate these functions across different instructions. (SFPE instructions are OK for the sake of compatibility, but they should be clearly marked as nonperformant.) Critically, neither SFPE nor set-rounding-mode should accept variable inputs. (Imagine having to ask "What's the rounding mode?" prior to every instruction so you know what to execute on nvidia!) Then the question becomes: what is the minimal set of instructions to be worth discussing here? Probably:
Just for sake of consistency, we might want the first 4 opcodes above to be assigned in a manner consistent with Intel's RC bits in the FPU control word. #5 takes a floating-point (FP) value, then returns the smallest-magnitude FP value which, when added to the input with chop rounding, would change its value. (Implicitly, ULP(X) has the same sign as X.) If the input was either sign of zero, then the output shall be FLOAT_MIN (the smallest denormal) of the same sign. If the input was either sign of infinity, then the output shall be unchanged. ULP comes up all over the place in interval arithmetic, so there's a fundamental justification for this instruction. For example, it would be helpful in case one needs to multiply ULP by another FP value in order to compute a compound error term, as would occur when efficiently computing (Y log X) via (truncated) Taylor series. It's also used all the time in order to know when to stop computing such a series. #6 Decrements the mantissa (without changing the sign). Decrements the exponent as well if the mantissa wraps. May result in a denormal value. If the input was either sign of FLOAT_MIN, then the output shall be zero of the same sign. If the input was either sign of zero or infinity, then the result shall be unchanged. #7 Increments the mantissa (without changing the sign). Increments the exponent as well if the mantissa wraps. If the input was either sign of zero, then the result shall be FLOAT_MIN of the same sign. If the input was either sign of infinity, then the result shall be the same. If the input was either sign of FLOAT_MAX, then the result shall be infinity of the same sign. #8 Returns the sign bit, which is important if there's a semantic difference between negative and positive zero, which would not be captured by a traditional "get sign of FP operand" instruction. Note that this deals with the sign bit, which in the aforementioned case actually differs from the (arithmetic) sign. #9 is important for interval division. For example, the reciprocal of [2, 3] is [(1/3), (1/2)] (at least (1/3) and at most (1/2)). But the reciprocal of [-2, 3] is [(1/3), (-1/2)], meaning at most (-1/2) OR at least (1/3). (Inverted limits imply an open interval, wherein a finite segment is missing from the real line.) The reason is that [-2, 3] contains zero. Moreover, it contains points both to the left and to the right of zero, so the reciprocal approaches infinity of either sign. We need to know if the signs are opposite so we can determine whether or not to exchange the limits. Without this instruction, it's lots of branching with sign extraction just for the basic arithmetic function of division. #10 This comes up all the time, although good examples escape me at the moment. Without this instruction, it's a lot of pipeline stalls. I don't mind if the above instructions are undefined for NaN operands, or if they behave as though the NaN is a normal FP value. No point in accelerating broken math. Of course, any undefined operation will potentially leak information about the underlying hardware, and will defeat the possibility of result verification through comparison. |
Speaking of experts in interval arithmetic, this is the video that got me thinking about the importance of analog security way back in 2007. The issues raised in the video are even more relevant today because, frankly, so little progress has been made. It's time to change that. The speaker is G. Bill Walster. I'll cut right to the chase, but it's worth watching the full video: https://www.youtube.com/watch?v=U_Cq_h1vrD0&t=680s I would reach out to him but am not able to due to security constraints on my end. Perhaps one of you would kindly inform him of this thread. He would make an ideal reviewer. |
Disregard my previous comment about denormals, I think we can fully embrace them here, since that is what Wasm does overall (also see WebAssembly/simd#2). The operations you've listed map to standard math operations, we should probably just support them. I suspect CPU-side interval arithmetic libraries build on those today anyway.
I mean new instructions in addition to existing instructions in the spec. Let's say for multiplication we will need to add a new instruction (say) I think there is enough material for at least a CG discussion, to go over the operations in #1384 (comment) and decide which way of encoding rounding mode is preferable. @KloudKoder can you attend one of the meetings (see CG ones here)? You will need to create a Community Group account for that and dial in via Zoom - the meetings are open to people with interest to participate.
Thank you for the link - the video is very informative! I am not sure what is W3C process for requesting external reviews, we can bring this up as well. |
@penzn OK on denormals. Horrible legacy, but let's continue with it if it already works in WASM.
How about this: Instead of creating 3 new instructions for every existing roundable instruction, we just create 3 instruction prefixes (round negative, round positive, and chop). No prefix implies nearest-or-even, so it's backward compatible. Redundant prefixes, prefixes before nonroundable instructions, attempts to define return or jump targets immediately following a prefix, or code which terminates in a solitary prefix will all get busted by the linter during compile. (I'm just wary of the current and prospective explosion in instruction definitions implied by 4 different rounding modes, especially when we consider the incessant flood of new SIMD that seems to be a reliable industry dynamic.) But this is not a red line for me. If the team wants to handle multiple instruction flavors in lieu of implementing prefixes, then so be it.
Link, please. Zoom is a giant security hole, but I'll see what I can do. I do understand the value proposition, though. |
I just realized that we also need:
The reason is that, when computing a Taylor series for the sake of estimating a transcendental function, one often encounters a situation wherein the error due to truncation is known to be smaller than a ULP in magnitude, but of unknown sign. In such cases, we need to subtract a ULP from the low limit and add a ULP to the high limit. (By definition, we don't need to consider rounding implications when adding or subtracting a ULP because the result is exact, although some corner cases exist, e.g. the ULP of zero is epsilon, the smallest denormal.) This requires signed nudges rather than zeroward nudges. To those who, like me, hate adding instruction definitions just for the sake of aesthetics, I will preemptively admit that all nudges are technically unecessary because one can simply load a ULP (instruction #5), then add it to or subtract it from some other float. But that requires explicit use of a precious register and several instructions. Furthermore, nudges are so fundamental that I believe they deserve their own instructions. It would not at all surprise me to see them in hardware as the semiconductor industry finally wakes up to the relevance of intervals in physical security. That would speed things up and also get rid of the need to waste time and a register evaluating the ULP. Note also that nudges usually come in pairs. This means that, under the hood, there's an implicit "Load ULP" occurring over and over again. Be careful! This isn't necessarily one and the same ULP, so despite appearances, they cannot in general be elided away. |
Link to join CG: https://www.w3.org/community/webassembly/ You need to add an agenda item (open a PR) to one of the CG meetings, 4/27 seems to be quite open, there are also instructions on getting an invite to the call in the meeting files: https://github.com/WebAssembly/meetings/tree/master/main/2021 Let me know if you have any questions, I'd be glad to help.
Separate operations allow for additional flexibility in how they can be implemented, though there are other considerations as well, for example performance gains (or lack of thereof). At any rate, this can probably be turned into a proposal first and then we can sort out exact instruction list. |
@penzn Thanks. I'm waiting on a reply to my CG signup request. I will post back to this issue when I have a meeting date. |
@penzn FYI still no reply to my request. |
We might need to ask @dschuff about it. |
@penzn Thank you, it's in process now, pending the resolution of a technical issue with signing up. Will report back. |
@penzn The person that was assisting me seems to have disappeared. As of now, I finally have a W3C account, but I'm wondering how to schedule a meeting. Do I make an edit to one of the MD files here, or what? https://github.com/WebAssembly/meetings/tree/master/main/2021 This has got to be the most awkward way of scheduling a meeting I've ever seen, but I get it: we're dealing with a standards body and have to be painfully formal. |
@KloudKoder yep, that is correct - edit one of the files and submit a PR with changes. Did you join the community group? The link is https://www.w3.org/community/webassembly/join (can be found on CG page. I don't remember if you get added to CG meetings automatically or still need to email the chair. @dtig, is there space for this in April 27 meeting or would it be completely consumed by discussions on layering and linking? Also, if you have access to meeting invites, can you please take a look at @KloudKoder's situation? |
@penzn Looks like we're set for May 11, 2021, assuming I did this PR properly? |
I think you need to update PR with how much time you are planning to take to present, or comment on it, then it will be official: WebAssembly/meetings#756 (comment) |
Notes for the CG meeting today (nothing etched in stone): LOAD ULP (LULP) Load units-of-the-last-place into an f32 or f64. (This instruction works analogously on f32 and f64.) It's important for interval operations whose error bounds are a function of the last term computed in a Taylor series. So for instance one could multiply that last term by ULP, then widen the interval by that much on each side. It's very expensive to compute with logic instructions and appears all over the place in routine interval calculations. There are a few different cases:
LOAD HALF ULP (LHULP) There are distinctly different roles for LULP and LHULP. LULP is used to compute final error bounds, but LHULP is used to test the current Taylor term for triviality. Once the current term becomes smaller than half a ULP, it's often known that the maximum error is either a full ULP, or the product of a full ULP and the current term. It has different corner cases due to discontinuity of exponential domains, which is why it's very expensive to compute from the programmer's level of abstraction.
NUDGE INWARD (NIN) This instruction is used to expand an open interval by moving one of its bounds inward toward zero by a ULP. (Open intervals are implied by their bounds being listed in reverse. They mean "everything on the real line except for this interval".) Note that [-0.0, -0.0], [+0.0, +0.0], and [-0.0, +0.0] all mean "only zero" while [+0.0, -0.0] means "everything except zero".
As you can see, it's quite expensive. NUDGE OUTWARD (NOUT) This instuction is used to expand (closed) intervals by moving one of its bounds outward from zero by a ULP.
NUDGE POSITIVE (NPOS) This instruction is used to increase an interval bound by the upper bound of an error term known to be positive but less than than the absolute value of a ULP. This instruction may seem trivial, but it's not because there are fast ways to add or subtract a ULP without going to the trouble of creating a ULP its own register.
NUDGE NEGATIVE (NNEG) This instruction does the opposite of NPOS.
GET SIGN BIT (GSB) This is equivalent to returning the arithmetic sign of a float, except that it considers (-0.0) as a negative number. This is intended to help interval code figure out which way it needs to nudge. It's debatable whether this bit should be sent to an integer register or a flag. Perhaps a branch version is as good or better. SET SIGN BIT (SSB) Copies the sign bit from an integer's bit zero, or perhaps the carry flag. XOR SIGN BITS (XSB) Xor the sign bit of one float to the sign bit of another. Useful for avoiding lots of branches during interval inversion, such as can occur during interval division. COPY SIGN BIT (CSB) Copy the sign bit of one float to the sign bit of another. (Is this the same as CopySign? I think we need an instruction for the sign bit, and another for the arithmetic sign.) ROUND NEAREST-OR-EVEN, CHOP, ROUND NEGATIVE, ROUND POSITIVE Are these prefixes, instruction flavors, or part of a global variable describing rounding policy? Prefixes provide most backward compatibility with least future instruction expansion. Need to discuss. |
I may not be able to make the meeting today so just getting my opinions out
|
For the record, the presentation of this issue to the community group is being rescheduled, tentatively for June 8. This is due to cascading technical failures on my part which prevented me from participating today, despite a valiant attempt by the hosts to assist. I know how to prevent this from occurring next time. |
@penzn @dschuff This was all discussed in the latest WASM CG meeting. I must say I was really impressed with the quality of the feedback and the ability of everyone to stay on topic. The bottom line was that the group wants to see some sort of performance comparison in a practical case. I've been tasked with creating an apples-to-apples comparison in some performant language (probably C), not necessarily compiling down to WASM. In particular, the experiment is suggested to involve 2 independently optimized operating modes: (1) an implementation involving the 4 rounding modes above (via rounding mode manipulation macros) and (2) the current state of affairs with nearest-or-even rounding, using whatever hacks necessary to make it produce binary-identical results. There are sort of 3 levels of difficulty:
I would say it's guaranteed that the harder the problem on this scale, the less the advantage of interval enhancements, simply because there's so much more overhead involved with sign inspection per operation. "Hard" would probably come in with marginal speedup but the others are an open question. It's also worth noting that as the VM overhead approaches infinity, the performance ratio approaches the instruction count ratio. Perhaps I could also add fake VM overhead so we can see how this transition occurs as it worsens. I will get to work on all this in my spare time and report back when I have a test loop to share. |
Thinking aloud: It seems to me that we could forget about ULP (and its int manipulation, which is really costly) if the following were more performant: When adding/subtracting intervals, just multiply the upper limit by (1+delta) and the lower limit by (1-delta), where delta is the ULP of 1.0 and thus constant. Then do the opposite. Sort the results (which should be possible in a single 4-way SIMD instruction). Take the most negative as the new lower limit and the most positive as the new upper limit. Or don't sort and just take min and max in successive instructions. (Then again, we might be better off using that SIMD power to do ULP construction on 4 floats at once.) For multiplication, compute all 4 possible products, then all 8 possible delta-scalings. Sort them and do the same thing as above. For division, it's basically the same unless either interval straddles zero, then we end up with open (as opposed to closed) intervals, and that's hell no matter what. |
Sorry this has taken so long. Allow me to share a bit of my recent thinking. Let: E = epsilon (smallest positive) Now you need to multiply [A, B] by [C, D] (division being even more complicated):
As I suggested previously, the idea here is that you pick S just big enough to overcome the misdirecting "drag" induced by nearest-or-even in all cases, so you don't need any custom rounding modes at all. Genius, right? Unfortunately... it doesn't work! Consider the case where, say, the exact value of AC is just slightly less than E. Then nearest-or-even will bump it to exactly E. You try to correct this by multiplying by (1-S), but the nearest-or-even result is still E due to quantization error. The same holds for most denormals. So in addition to the above, one could have an ongoing tracker which remembers the smallest nonzero absolute value ever generated. If it falls below some threshold, then you know that the (1-S) trick didn't work, so you need to do a rollback and compute everything all over again using some totally different algorithm. Not. Fun. So where does this leave us? As far as I can see, we're still in dire need of custom rounding modes because without them everything is just grotesque (ULP manipulation with exponent extraction and comparison, integer manipulation of floats while checking for wrap, checking for sufficient mantissa space to compensate for nearest-or-even drag, etc.). In other words, I'm still failing to see how to even construct a fair test of with-and-without custom rounding modes in some language and have it be anywhere near fair to the "without" case while still being roughly optimal in both cases. Thoughts? |
Forgot to mention: one tempting way out of the (1-S) problem is to just to add or subtract some multiple of E from the min and max, then take min and max again in order to deal with signs properly. If I'm not mistaken, this does in fact work. However, we then get into a situation where purely positive intervals can have products that intrude on the negatives. That opens a whole other can of worms because generally such behavior would be presumed impossible, regardless of precision. It has real codepath effects when it then potentially gives rise to open intervals due to taking reciprocals (of closed intervals that span zero). We could avoid this issue, in turn, by testing limits and manually dealing with corner cases, but then I think we're back to "grotesque". |
I would like some feedback from stakeholders on where the performance ratio threshold should be in order to justify passing these proposed instructions into the standard. I'm asking because it's been months, and we still have no proposal from anyone, including me, that even suggests less than a 2X performance ratio. Based on the best design so far, this would also seem to be a conservative lower bound. Implementing a test case seems rather challenging because, depending on the particular strategy, it might involve invoking SIMD builtins specific to a specific compiler, and in any event, some floating-point environment magic. Doing full floating-point emulation wouldn't require any of this, but (1) it would be a huge project in itself and (2) probably wouldn't result in a realistic comparison. My sense at this point, having gone down numerous rabbit holes only to arrive at grossly nonperformant alternative implementations, is that these proposed instructions, and above all rounding mode flavors, offer a performance ratio advantage far greater than has been used to justify many other instructions that WASM employs and have a basis in hardware. In other words, do we really need to go through the motions of a major demo project just to prove what is now, in my view, obvious? Of course, you can argue that no one would ever need interval arithmetic, but I think we're past that stage. The question at hand is: can it be done efficiently, via whatever creative tricks, under WASM as is? And if not, how much acceleration would compel the community to standardize the instructions in question? And finally, in light of that threshold, is it already obvious in the absence of an explicit demo that we will exceed it? We're not in 20% land and never were. |
(Apologies for being late to this party...)
I wanted to point out, if this topic is still being considered, that AVX-512 introduces explicit rounding mode selection per instruction. So, when deciding what to do with WASM here, it is worth considering the possibility that typical x64 math code may slowly transition away from global rounding modes to per-instruction rounding modes, based on the adoption rate of AVX-512 over time. The way Intel exposed this in C is by adding additional versions of all the intrinsics with a "_round" suffix, and they take an additional constant integer rounding mode enumerant. - Casey |
@cmuratori Good contribution! And sorry for the delayed response. I'm honestly kind of exhausted from trying to demonstrate that water is wet. It seems increasingly inevitable that WASM will need to include support for rounding mode flavors of all applicable floating-point instructions. AVX512 is -- thank you -- yet another reason. My problem is that I'm unable to a prove a negative, in the sense that I can't prove conclusively that there's no creative way of abusing nearest-or-even instructions in order to emulate other rounding flavors. I think we're at the point, though, where it's far more likely that not having this support is going to result in a future development crunch as the team struggles to catch up with real world use cases, than that some genius will emerge from the woodwork and prove that all flavors are performance-equivalent to nearest-or-even. |
Well the problem for the development team is cross compatibility. The "correct" way for an instruction set to work is to specify the rounding mode per operation, but not all chips support that, so if WASM were to make that standard, what would it do on chips that don't have those instructions? It can't very well generate FP code that changes the MXCSR or equivalent every other instruction! So I think whatever happens here it probably has to be an extension, like SIMD128. You would have to go, "does this WASM platform support per-instruction rounding modes?" and if the answer is yes, then you run that stream. If the answer is no, then you do not. But I do think it would be smart to have that extension, because although I could be wrong, I get the sense that people realize now that global rounding modes were a bad idea :) - Casey |
@cmuratori "what would it do on chips that don't have those instructions? It can't very well generate FP code that changes the MXCSR or equivalent every other instruction!" -- Correct, which is why I propose to (1) detect whether alternative rounding flavors are used at all (usually not) and optimize accordingly and (2) if they are used, then reorder instructions as much as possible, so as to minimize the need to touch the Rounding Control bits. At the end of the day, frankly, it's on the chip designers to suffer the consequences of their own bad architecture. Let's not encumber the smart ones in order to level the playing field for the dinosaurs. I don't think code should ever need to ask "is this supported" because WASM has to check for illegal instructions anyway, so it might as well just emulate as needed (especially in cases like this where emulation is of trivial complexity -- i.e. just change Rounding Control -- despite carrying a hefty performance penalty). I should emphasize that version 1.0 could just change Rounding Control with impunity and not bother regrouping instructions. The point is that such accelerations could happen later if the community ever cared enough for the dinosaurs (and maybe it wouldn't, which is just as well). It comes down to 2 things that aren't going away, namely: hardware support for native rounding flavors and the increasing employment of interval arithmetic in engineering. To clarify my previous post, I very much doubt that anyone will find a way to emulate alternative rounding flavors by some ingenious semantic abuse of nearest-or-even plus trivial overhead. If nothing else, the overhead required to properly deal with the denormal case in an alternative rounding flavor is nontrivial, as in, it easily doubles execution time. Ultimately, I think 5X execution time is realistic, although it's hard to pin down an average in practice. Perhaps it will soon be time to force a vote somehow. |
There's some history to this. Bear with me.
Typical floating-point rounding policies are: chop, round-negative, round-positive, and round-nearest-or-even. Intel made the dumb decision of encoding this policy in the floating-point control word, which means that if you want to change it frequently, you need to manipulate the control word (FLDCW) all the time, which is expensive.
nVidia realized that this was a nightmare for interval arithmetic, which has risen in prominence for scientific computing in the past several years, mainly because it requires constant changes in rounding policy (because, usually, the lower limit goes down while the upper limit goes up after every floating-point instruction). So nVidia did things the right way: they encoded the rounding policy inside the instruction itself.
Looking at your spec, it seems like round-nearest-or-even is the only supported mode in most cases. That makes interval math effectively impossible, which will only limit the utility of WASM for scientific applications.
Please allow us to prefix or otherwise alter floating-point instructions so as to specify the rounding mode. On Intel/AMD, you can just optimize the prefix away if it's the same one as before (and no intervening branch targets exist). If the prefix is never used, then everything is backward-compatible to round-nearest-or-even, so the control word never needs to be written; this is easily checked during code validity authentication. But on nVidia, the prefix gets sucked into a bitfield in the instruction.
Please implement this. It will enable more and more useful applications than you might currently imagine.
The text was updated successfully, but these errors were encountered: