16 bit multiplication always produces zero result #129

carlos4242 · 2019-01-29T23:10:13Z

Another one from the swift team. I'm surprised no one has found this yet.

@llvm.umul.with.overflow.i16(i16 , i16 ) seems to be lowering to invalid assembly, specifically it ends up moving the two 16 bit values into the top two bytes of 32 bit numbers then using __mulsi3 to multiply them together and put the result into a 32 bit number, where it takes the top two bytes. This will always produce zero.

In my opinion if it's using __mulsi3, it should be putting two 16 bit numbers into the bottom two bytes of each input and taking the bottom two bytes of the output, then any non zero value in the top two bytes after multiplication should be interpreted as an overflow and the flag set accordingly.

Here's the llvm ir in a test case as a patch to llvm...

diff --git a/test/CodeGen/AVR/umul.with.overflow.i16-bug.ll b/test/CodeGen/AVR/umul.with.overflow.i16-bug.ll
new file mode 100644
index 00000000000..12c4030f943
--- /dev/null
+++ b/test/CodeGen/AVR/umul.with.overflow.i16-bug.ll
@@ -0,0 +1,39 @@
+; RUN: llc -O1 < %s -march=avr | FileCheck %s
+
+target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-apple-macosx10.9"
+
+%Vs6UInt16 = type <{ i16 }>
+%Sb = type <{ i1 }>
+
+define hidden void @_TF4main13setServoAngleFT5angleVs6UInt16_T_(i16) #0 {
+entry:
+  %adjustedAngle = alloca %Vs6UInt16, align 2
+  %1 = bitcast %Vs6UInt16* %adjustedAngle to i8*
+  %adjustedAngle._value = getelementptr inbounds %Vs6UInt16, %Vs6UInt16* %adjustedAngle, i32 0, i32 0
+  store i16 %0, i16* %adjustedAngle._value, align 2
+
+;print(unsignedInt: adjustedAngle &* UInt16(11))
+; breaks here
+  %adjustedAngle._value2 = getelementptr inbounds %Vs6UInt16, %Vs6UInt16* %adjustedAngle, i32 0, i32 0
+  %2 = load i16, i16* %adjustedAngle._value2, align 2
+
+  %3 = call { i16, i1 } @llvm.umul.with.overflow.i16(i16 %2, i16 11)
+  %4 = extractvalue { i16, i1 } %3, 0
+  %5 = extractvalue { i16, i1 } %3, 1
+
+  ; above code looks fine, how is it lowered?
+  %6 = call i1 @_TIF3AVR5printFT11unsignedIntVs6UInt1610addNewlineSb_T_A0_()
+  call void @_TF3AVR5printFT11unsignedIntVs6UInt1610addNewlineSb_T_(i16 %4, i1 %6)
+
+  ret void
+}
+
+declare void @_TF3AVR5printFT11unsignedIntVs6UInt1610addNewlineSb_T_(i16, i1) #0
+declare i1 @_TIF3AVR5printFT11unsignedIntVs6UInt1610addNewlineSb_T_A0_() #0
+
+; Function Attrs: nounwind readnone speculatable
+declare { i16, i1 } @llvm.umul.with.overflow.i16(i16, i16) #2
+
+attributes #0 = { "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "target-cpu"="core2" "target-features"="+ssse3,+cx16,+fxsr,+mmx,+x87,+sse,+sse2,+sse3" }
+attributes #2 = { nounwind readnone speculatable }

(Note that I haven't yet put in the FileCheck directives because I haven't decided/worked out what the compiler should be doing here yet!)

FYI the original source, with debug statements in was something like:

func setServoAngle(angle: UInt16) {
  var adjustedAngle: UInt16 = angle
  print(unsignedInt: adjustedAngle &* UInt16(11))
}

This produces the following assembly...

	.text
	.file	"<stdin>"
	.hidden	_TF4main13setServoAngleFT5angleVs6UInt16_T_ ; -- Begin function _TF4main13setServoAngleFT5angleVs6UInt16_T_
	.globl	_TF4main13setServoAngleFT5angleVs6UInt16_T_
	.p2align	1
	.type	_TF4main13setServoAngleFT5angleVs6UInt16_T_,@function
_TF4main13setServoAngleFT5angleVs6UInt16_T_: ; @_TF4main13setServoAngleFT5angleVs6UInt16_T_
; %bb.0:                                ; %entry
	push	r28
	push	r29
	push	r16
	push	r17
	in	r28, 61
	in	r29, 62
	sbiw	r28, 2
	in	r0, 63
	cli
	out	62, r29
	out	63, r0
	out	61, r28
	std	Y+1, r24
	std	Y+2, r25
	ldi	r20, 11
	ldi	r21, 0
	ldi	r18, 0
	ldi	r19, 0
	mov	r22, r18
	mov	r23, r19
	call	__mulsi3
	mov	r16, r24
	mov	r17, r25
	call	_TIF3AVR5printFT11unsignedIntVs6UInt1610addNewlineSb_T_A0_
	mov	r22, r24
	mov	r24, r16
	mov	r25, r17
	call	_TF3AVR5printFT11unsignedIntVs6UInt1610addNewlineSb_T_
	adiw	r28, 2
	in	r0, 63
	cli
	out	62, r29
	out	63, r0
	out	61, r28
	pop	r17
	pop	r16
	pop	r29
	pop	r28
	ret
.Lfunc_end0:
	.size	_TF4main13setServoAngleFT5angleVs6UInt16_T_, .Lfunc_end0-_TF4main13setServoAngleFT5angleVs6UInt16_T_
                                        ; -- End function

From what I know, __mulsi3 takes the 32 bit value in r18-r21, multiplies it by the 32 bit value in r22-r25 and stores the result in the 32 bit value r22-r25. (I'm not sure how it detects overflow.)

It should be multiplying the input value, e.g. 90 by 11 then printing out the result. Instead it always prints 0.

As I read this assembly, it's moving both input values into the top two bytes of 32 bit numbers, which looks broken. Either it should move them to the bottom two bytes or use a different function.

I'd love to investigate this but I'll need to get some pointers from people where this is all happening. I couldn't even find mulsi3 by grepping through llvm source code. No idea how this is made!

The text was updated successfully, but these errors were encountered:

carlos4242 · 2019-01-30T11:32:46Z

I've put the debug output from compiling this into a gist: https://gist.github.com/carlos4242/e7bd5c8bba0d7fb94f02ce02f9ed5189.

This looks like the bit at fault...

Legalizing: t30: i16,i8 = umulo t2, Constant:i16<11>
Trying to expand node
Creating constant: t31: i16 = Constant<0>
Creating new node: t34: ch,glue = callseq_start t0, TargetConstant:i16<0>, TargetConstant:i16<0>
Creating new node: t35: ch,glue = CopyToReg t34, Register:i16 $r25r24, t2
Creating new node: t37: ch,glue = CopyToReg t35, Register:i16 $r23r22, Constant:i16<0>, t35:1
Creating new node: t39: ch,glue = CopyToReg t37, Register:i16 $r21r20, Constant:i16<11>, t37:1
Creating new node: t41: ch,glue = CopyToReg t39, Register:i16 $r19r18, Constant:i16<0>, t39:1
Creating new node: t42: ch,glue = CALL t41, TargetExternalSymbol:i16'__mulsi3', Register:i16 $r25r24, Register:i16 $r23r22, Register:i16 $r21r20, Register:i16 $r19r18, RegisterMask:Untyped, t41:1
Creating new node: t43: ch,glue = callseq_end t42, TargetConstant:i16<0>, TargetConstant:i16<0>, t42:1
Creating new node: t44: i16,ch,glue = CopyFromReg t43, Register:i16 $r25r24, t43:1
Creating new node: t45: i16,ch,glue = CopyFromReg t44:1, Register:i16 $r23r22, t44:2
Creating new node: t46: i16,i16 = merge_values t44, t45
Creating new node: t48: i8 = setcc t45, Constant:i16<0>, setne:ch
Successfully expanded node
 ... replacing: t30: i16,i8 = umulo t2, Constant:i16<11>
     with:      t44: i16,ch,glue = CopyFromReg t43, Register:i16 $r25r24, t43:1
      and:      t48: i8 = setcc t45, Constant:i16<0>, setne:ch

The registers being used look wrong.

I'm trying to debug through how it works. In lib/CodeGen/SelectionDAG/LegalizeDAG.cpp, line 3334 (on my version) seems to be one of the interesting areas...

case ISD::UMULO:
  case ISD::SMULO: {
    EVT VT = Node->getValueType(0);
    EVT WideVT = EVT::getIntegerVT(*DAG.getContext(), VT.getSizeInBits() * 2);
    SDValue LHS = Node->getOperand(0);
    SDValue RHS = Node->getOperand(1);

As I understand it, this tries to see if the multiplication of two i16 numbers is "legal", i.e. if there's a native instruction, which there isn't. (AVR has i8i8 -> i16 but not an i16i16 instruction)
So it "expands" into a sequence of instructions and a call to __mulsi3.

Further down I can see...

      // Here we're passing the 2 arguments explicitly as 4 arguments that are
      // pre-lowered to the correct types. This all depends upon WideVT not
      // being a legal type for the architecture and thus has to be split to
      // two arguments.
      SDValue Ret;
      if(DAG.getDataLayout().isLittleEndian()) {
        // Halves of WideVT are packed into registers in different order
        // depending on platform endianness. This is usually handled by
        // the C calling convention, but we can't defer to it in
        // the legalizer.
        SDValue Args[] = { LHS, HiLHS, RHS, HiRHS };
        Ret = ExpandLibCall(LC, WideVT, Args, 4, isSigned, dl);
      } else {
        SDValue Args[] = { HiLHS, LHS, HiRHS, RHS };
        Ret = ExpandLibCall(LC, WideVT, Args, 4, isSigned, dl);
      }

Which looks like it's getting close to the issue. It's packing the 16 bit registers plus zeros into 4 arguments. The question is how it gets from here to actual register choice and where it goes wrong? (I'll continue investigating when I have time.)

carlos4242 · 2019-02-03T00:34:32Z

OK, here's a bit more debug.

tracking if AVR AVRISelLowering.cpp responsible:



current breakpoints:
(lldb) br l
Current breakpoints:
1: file = 'LegalizeDAG.cpp', line = 3336, exact_match = 0, locations = 1, resolved = 1, hit count = 3
  1.1: where = llc`(anonymous namespace)::SelectionDAGLegalize::ExpandNode(llvm::SDNode*) + 36076 at LegalizeDAG.cpp:3336, address = 0x0000000102034bac, resolved, hit count = 3 

10: file = '/Users/carlpeto/llvm/llvm2/llvm-avr/build/llvm-patched/../../lib/CodeGen/SelectionDAG/LegalizeDAG.cpp', line = 3407, exact_match = 0, locations = 1, resolved = 1, hit count = 0
  10.1: where = llc`(anonymous namespace)::SelectionDAGLegalize::ExpandNode(llvm::SDNode*) + 40714 at LegalizeDAG.cpp:3407, address = 0x0000000102035dca, resolved, hit count = 0 

*** see the graph "Before ExpandLibCall"

SDValue SelectionDAGLegalize::ExpandLibCall(RTLIB::Libcall LC, EVT RetVT,
                                            const SDValue *Ops, unsigned NumOps,
                                            bool isSigned, const SDLoc &dl) {

...

break at
...
Breakpoint 11: where = llc`(anonymous namespace)::SelectionDAGLegalize::ExpandLibCall(llvm::RTLIB::Libcall, llvm::EVT, llvm::SDValue const*, unsigned int, bool, llvm::SDLoc const&) + 1540 at LegalizeDAG.cpp:2074, address = 0x00000001020557a4


  std::pair<SDValue,SDValue> CallInfo = TLI.LowerCallTo(CLI);


step in...

Process 16117 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step in
    frame #0: 0x000000010217b42c llc`llvm::TargetLowering::LowerCallTo(this=0x00000001068083b8, CLI=0x00007ffeefbf1f08) const at SelectionDAGBuilder.cpp:8418
   8415	std::pair<SDValue, SDValue>
   8416	TargetLowering::LowerCallTo(TargetLowering::CallLoweringInfo &CLI) const {
   8417	  // Handle the incoming return values from the call.
-> 8418	  CLI.Ins.clear();


*** see the graph "Before LowerCallTo"


Then ran down to just before the AVR specific bit...

(lldb) br s -l 8657
Breakpoint 12: where = llc`llvm::TargetLowering::LowerCallTo(llvm::TargetLowering::CallLoweringInfo&) const + 9388 at SelectionDAGBuilder.cpp:8657, address = 0x000000010217d8ac
(lldb) c
Process 16117 resuming
Process 16117 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 12.1
    frame #0: 0x000000010217d8ac llc`llvm::TargetLowering::LowerCallTo(this=0x00000001068083b8, CLI=0x00007ffeefbf1f08) const at SelectionDAGBuilder.cpp:8657
   8654	  }
   8655	
   8656	  SmallVector<SDValue, 4> InVals;
-> 8657	  CLI.Chain = LowerCall(CLI, InVals);
   8658	
   8659	  // Update CLI.InVals to use outside of this function.
   8660	  CLI.InVals = InVals;
Target 0: (llc) stopped.



*** see the graph "Before LowerCall"



Process 16117 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step in
    frame #0: 0x00000001008b42e0 llc`llvm::AVRTargetLowering::LowerCall(this=0x00000001068083b8, CLI=0x00007ffeefbf1f08, InVals=0x00007ffeefbf10d0) const at AVRISelLowering.cpp:1139
   1136	
   1137	SDValue AVRTargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
   1138	                                     SmallVectorImpl<SDValue> &InVals) const {
-> 1139	  SelectionDAG &DAG = CLI.DAG;
   1140	  SDLoc &DL = CLI.DL;
   1141	  SmallVectorImpl<ISD::OutputArg> &Outs = CLI.Outs;
   1142	  SmallVectorImpl<SDValue> &OutVals = CLI.OutVals;






   1142	  SmallVectorImpl<SDValue> &OutVals = CLI.OutVals;
Target 0: (llc) stopped.
(lldb) tbr -l 1297
Breakpoint 13: where = llc`llvm::AVRTargetLowering::LowerCall(llvm::TargetLowering::CallLoweringInfo&, llvm::SmallVectorImpl<llvm::SDValue>&) const + 6427 at AVRISelLowering.cpp:1297, address = 0x00000001008b5bcb
(lldb) c
Process 16117 resuming
Creating new node: t34: ch,glue = callseq_start t0, TargetConstant:i16<0>, TargetConstant:i16<0>
Creating new node: t35: ch,glue = CopyToReg t34, Register:i16 $r25r24, t2
Creating new node: t37: ch,glue = CopyToReg t35, Register:i16 $r23r22, Constant:i16<0>, t35:1
Creating new node: t39: ch,glue = CopyToReg t37, Register:i16 $r21r20, Constant:i16<11>, t37:1
Creating new node: t41: ch,glue = CopyToReg t39, Register:i16 $r19r18, Constant:i16<0>, t39:1
Creating new node: t42: ch,glue = CALL t41, TargetExternalSymbol:i16'__mulsi3', Register:i16 $r25r24, Register:i16 $r23r22, Register:i16 $r21r20, Register:i16 $r19r18, RegisterMask:Untyped, t41:1
Creating new node: t43: ch,glue = callseq_end t42, TargetConstant:i16<0>, TargetConstant:i16<0>, t42:1
Process 16117 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = one-shot breakpoint 13
    frame #0: 0x00000001008b5bcb llc`llvm::AVRTargetLowering::LowerCall(this=0x00000001068083b8, CLI=0x00007ffeefbf1f08, InVals=0x00007ffeefbf10d0) const at AVRISelLowering.cpp:1297
   1294	
   1295	  // Handle result values, copying them out of physregs into vregs that we
   1296	  // return.
-> 1297	  return LowerCallResult(Chain, InFlag, CallConv, isVarArg, Ins, DL, DAG,
   1298	                         InVals);
   1299	}
   1300	
Target 0: (llc) stopped.



*** see the graph "At end of AVRTargetLowering::LowerCall"

...as i read this, the DAG suggests
0 will be copied into the i16 pair r22,r23
%0 (the parameter for the function) will be copied into the i16 pair r24,r25
0 will be copied into the i16 pair r18,r19
11 will be copied into the i16 pair r20,r21

__mulsi3 will be called
... we haven't yet got into what we're doing with the result

The above is buggy and looks like the AVR code at flaw.

At end of AVRTargetLowering--LowerCall.pdf
before ExpandLibCall.pdf
before LowerCall.pdf
before LowerCallTo.pdf

carlos4242 · 2019-02-11T17:29:04Z

OK, I think I've found the root cause.

When legalizing the DAG, the code decides based on the types involved, it decides it cannot lower to MC so it expands the call to __mulsi3.

The code in question is in LegaliseDAG.cpp, SelectionDAGLegalize::ExpandNode(SDNode *Node).

It creates an array of four i16 parameters to pass to __mulsi3 then goes into standard external call lowering for AVR.

Because AVR is a little endian platform, it creates the arguments as:

(i16)multiplicand1, (i16)0, (i16)multiplicand2, (i16) 0

When these are lowered by AVRISelLowering.cpp, in the function analyzeStandardArguments, it walks the registers backwards in two register pairs as if these were normal function arguments.

(i16)multiplicand1 =>R24R25

(i16)0 => R22R23

(i16)multiplicand2 => R20R21

(i16) 0 => R18R19

Which is the wrong order.

So between the assumptions made in the ordering of arguments in SelectionDAGLegalize::ExpandNode, the fact that AVR reports as little endian and the way parameters are ordered in AVRISelLowering.cpp, analyzeStandardArguments, the register order is garbled. The information that these parameters actually represent two 32 bit numbers is being lost between those two functions.

Fixes I can think of:

make AVR little endian, this would fix this one problem at a stroke but would probably cause loads of other issues and to my understanding the processor is little endian in most people's way of understanding.
patch SelectionDAGLegalize::ExpandNode in case ISD::UMULO:, case ISD::SMULO: to do a more subtle check, rather than if(DAG.getDataLayout().isLittleEndian()), check some platform specific flag or have code that can be overridden somehow to reverse the order these four parameters are passed down to the ABI layer.
create some kind of hack in the call lowering that somehow knows how to lower these calls correctly (tricky?)

I think at this point I can hack something for myself but a better fix will require community feedback.

carlos4242 · 2019-03-10T01:43:28Z

This is 'fixed' by avr-rust/llvm#9. But it might arguably not be the best way to fix it? Open to suggestions/advice...

carlos4242 · 2019-03-10T01:53:59Z

LABEL AS: has local patch

dylanmckay · 2019-05-16T12:07:23Z

Raised https://reviews.llvm.org/D62003 for upstreaming the patch.

…nvention endianess Summary: The endianess used in the calling convention does not always match the endianess of the target on all architectures, namely AVR. When an argument is too large to be legalised by the architecture and is split for the ABI, a new hook TargetLoweringInfo::shouldSplitFunctionArgumentsAsLittleEndian is queried to find the endianess that function arguments must be laid out in. This approach was recommended by Eli Friedman. Originally reported in avr-rust/rust-legacy-fork#129. Patch by Carl Peto. Reviewers: bogner, t.p.northover, RKSimon, niravd Subscribers: JDevlieghere, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D62003

…nvention endianess Summary: The endianess used in the calling convention does not always match the endianess of the target on all architectures, namely AVR. When an argument is too large to be legalised by the architecture and is split for the ABI, a new hook TargetLoweringInfo::shouldSplitFunctionArgumentsAsLittleEndian is queried to find the endianess that function arguments must be laid out in. This approach was recommended by Eli Friedman. Originally reported in avr-rust/rust-legacy-fork#129. Patch by Carl Peto. Reviewers: bogner, t.p.northover, RKSimon, niravd, efriedma Reviewed By: efriedma Subscribers: JDevlieghere, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D62003 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@361222 91177308-0d34-0410-b5e6-96231b3b80d8

…nvention endianess Summary: The endianess used in the calling convention does not always match the endianess of the target on all architectures, namely AVR. When an argument is too large to be legalised by the architecture and is split for the ABI, a new hook TargetLoweringInfo::shouldSplitFunctionArgumentsAsLittleEndian is queried to find the endianess that function arguments must be laid out in. This approach was recommended by Eli Friedman. Originally reported in avr-rust/rust-legacy-fork#129. Patch by Carl Peto. Reviewers: bogner, t.p.northover, RKSimon, niravd, efriedma Reviewed By: efriedma Subscribers: JDevlieghere, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D62003 llvm-svn: 361222

…nvention endianess Summary: The endianess used in the calling convention does not always match the endianess of the target on all architectures, namely AVR. When an argument is too large to be legalised by the architecture and is split for the ABI, a new hook TargetLoweringInfo::shouldSplitFunctionArgumentsAsLittleEndian is queried to find the endianess that function arguments must be laid out in. This approach was recommended by Eli Friedman. Originally reported in avr-rust/rust-legacy-fork#129. Patch by Carl Peto. Reviewers: bogner, t.p.northover, RKSimon, niravd, efriedma Reviewed By: efriedma Subscribers: JDevlieghere, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D62003 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@361222 91177308-0d34-0410-b5e6-96231b3b80d8

dylanmckay · 2019-05-21T08:15:53Z

Fix upstreamed in r361222

Thanks for the patch Carl!

carlos4242 · 2019-05-21T09:06:12Z

Cool. :) Np. Glad to help.

…

On 21 May 2019, at 09:15, Dylan McKay ***@***.***> wrote: Fix upstreamed in r361222 Thanks for the patch Carl! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

…nvention endianess Summary: The endianess used in the calling convention does not always match the endianess of the target on all architectures, namely AVR. When an argument is too large to be legalised by the architecture and is split for the ABI, a new hook TargetLoweringInfo::shouldSplitFunctionArgumentsAsLittleEndian is queried to find the endianess that function arguments must be laid out in. This approach was recommended by Eli Friedman. Originally reported in avr-rust/rust-legacy-fork#129. Patch by Carl Peto. Reviewers: bogner, t.p.northover, RKSimon, niravd, efriedma Reviewed By: efriedma Subscribers: JDevlieghere, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D62003 llvm-svn: 361222

carlos4242 mentioned this issue Feb 3, 2019

32 bit multiplication produces byte-swapped result #130

Open

shepmaster added the has-local-patch A patch exists but has not been applied to upstream LLVM label Apr 13, 2019

dylanmckay mentioned this issue May 16, 2019

LowerCallResult regression bugfix avr-rust/llvm#10

Closed

dylanmckay closed this as completed May 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

16 bit multiplication always produces zero result #129

16 bit multiplication always produces zero result #129

carlos4242 commented Jan 29, 2019

carlos4242 commented Jan 30, 2019

carlos4242 commented Feb 3, 2019

carlos4242 commented Feb 11, 2019

carlos4242 commented Mar 10, 2019 •

edited by shepmaster

Loading

carlos4242 commented Mar 10, 2019

dylanmckay commented May 16, 2019

dylanmckay commented May 21, 2019

carlos4242 commented May 21, 2019 via email

16 bit multiplication always produces zero result #129

16 bit multiplication always produces zero result #129

Comments

carlos4242 commented Jan 29, 2019

carlos4242 commented Jan 30, 2019

carlos4242 commented Feb 3, 2019

carlos4242 commented Feb 11, 2019

carlos4242 commented Mar 10, 2019 • edited by shepmaster Loading

carlos4242 commented Mar 10, 2019

dylanmckay commented May 16, 2019

dylanmckay commented May 21, 2019

carlos4242 commented May 21, 2019 via email

carlos4242 commented Mar 10, 2019 •

edited by shepmaster

Loading