Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Move Hub fails to raise TypeError #950

Closed
laurensvalk opened this issue Feb 13, 2023 · 13 comments
Closed

[Bug] Move Hub fails to raise TypeError #950

laurensvalk opened this issue Feb 13, 2023 · 13 comments
Labels
bug Something isn't working hub: movehub Issues related to the LEGO BOOST Move hub

Comments

@laurensvalk
Copy link
Member

laurensvalk commented Feb 13, 2023

Describe the bug
The Move Hub does not correctly parse some arguments.

This leads to undefined behavior, which may or may not lead to the crash in https://github.com/orgs/pybricks/discussions/949

To reproduce

from pybricks.pupdevices import Motor, Remote
from pybricks.parameters import Port, Direction, Stop, Button
from pybricks.tools import wait

motor = Motor(Port.C)

# This should raise a TypeError, not pass silently.
motor.run_time(300, 1000, Stop, wait=False)

wait(1000)

Expected behavior
Raise TypeError

@laurensvalk laurensvalk added bug Something isn't working hub: movehub Issues related to the LEGO BOOST Move hub labels Feb 13, 2023
@laurensvalk
Copy link
Member Author

laurensvalk commented Feb 13, 2023

Type checks are correct, but raising the error seems to fail.

Mysteriously, if I add a debug print, then it does raise where it should:

diff --git a/pybricks/util_mp/pb_obj_helper.c b/pybricks/util_mp/pb_obj_helper.c
index cf397d1d4..ce20cb965 100644
--- a/pybricks/util_mp/pb_obj_helper.c
+++ b/pybricks/util_mp/pb_obj_helper.c
@@ -123,6 +123,7 @@ mp_obj_t pb_obj_get_base_class_obj(mp_obj_t obj, const mp_obj_type_t *type) {
 void pb_assert_type(mp_obj_t obj, const mp_obj_type_t *type) {
     if (!mp_obj_is_type(obj, type)) {
         #if MICROPY_ERROR_REPORTING == MICROPY_ERROR_REPORTING_TERSE
+        mp_printf(&mp_plat_print, "Raises if debug print present?\n");
         mp_raise_TypeError(NULL);
         #else
         mp_raise_msg_varg(&mp_type_TypeError, MP_ERROR_TEXT("can't convert %s to %s"),

Does this get optimized away somehow?

@laurensvalk laurensvalk changed the title [Bug] Move Hub does not correctly verify arguments [Bug] Move Hub fails to raise TypeError Feb 13, 2023
@dlech
Copy link
Member

dlech commented Feb 13, 2023

Does this get optimized away somehow?

Indeed, that seems to be the case.

For future reference, the way to troubleshoot this is to run:

arm-none-eabi-objdump -D bricks/movehub/build/firmware.elf | code -

and search for the symbol in question. If it is not found, add __attribute__((noinline)) to the function and compile and dump again. If still not found, we know it is being optimized out.

@laurensvalk
Copy link
Member Author

Awesome, thanks for the hint to debug this in the future!

@laurensvalk
Copy link
Member Author

Adding an FYI for @jimmo in case this ever happens to upstream MicroPython.

tl;dr: If things like mp_raise_TypeError(NULL) ever go missing (causing weird behavior), check that LTO didn't optimize it out.

@jimmo
Copy link

jimmo commented Feb 14, 2023

That's very interesting! @projectgus FYI

I note that mp_raise_TypeError (and all the subsequent methods it calls) are annotated NORETURN (i.e. __attribute__((noreturn)) and the final method has an MP_UNREACHABLE (i.e. __builtin_unreachable()).

So it seems very surprising (to me at least) that the compiler thinks its allowed to optimise this out.

What GCC version are you using?

@laurensvalk
Copy link
Member Author

laurensvalk commented Feb 14, 2023

arm-none-eabi-gcc --version
arm-none-eabi-gcc (15:10.3-2021.07-4) 10.3.1 20210621 (release)

If you're interested in seeing the exact steps to reproduce the build let us know. Most of the steps are here, but you probably already have most dependencies. So just clone our repo prior to fixing this issue, e.g. this one, and run make movehub -j.

For context, this was only happening for our smallest build target, Move Hub with STM32F070RB6.

@projectgus
Copy link

Thanks for the heads-up @jimmo, you know I love some good compiler weirdness. 😁

I couldn't reproduce with gcc 12, but if I download the mentioned ARM toolchain 2021.07 compiler release then I get it straight away on the pre-fixed commit.

I had a quick look at gdb disassembly (building with -ggdb) and some preprocessor dumps (compiling with -E) and I can only agree this is a compiler bug:

  • There doesn't seem to be any inline asm nearby in the call tree for the compiler to misunderstand as side effect free, so it's kind of weird that __asm__("") fixes this at all.
  • I thought maybe the compiler infers mp_raise_TypeError(NULL); has no side-effects (despite "noreturn"), but moving the __asm__("") workaround inside mp_raise_TypeError() still causes pb_assert_type() to be optimised out. Other calls to mp_raise_TypeError() are linked into the firmware anyhow. So I think it's not that.
  • Maybe the optimiser is deciding the !mp_obj_is_type(obj, type) test in pb_assert_type() will always fail (or is UB) but I can't see how this could be the case, and it still wouldn't 100% explain why the __asm__("") workaround works as the dummy inline asm is nested inside that test.

So hopefully whatever the root cause bug is/was, it's been fixed in gcc. A quick look in their bug tracker didn't find anything, but I'm also not exactly sure what to look for there.

@dlech
Copy link
Member

dlech commented Feb 15, 2023

  • so it's kind of weird that __asm__("") fixes this at all.

FYI, this is workaround comes straight from the gcc docs on the noinline attribute at https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

@projectgus
Copy link

I see, thanks @dlech. I think I had a partial understanding of this, but I hadn't seen this note and I now realise I had an out of date assumption about inline asm and side effects.[*]

BTW, do I have it right that none of the functions in this call stack are explicitly marked noinline? It just happens that the optimizer heuristic decides not to inline pb_assert_type()?

The documentation about this workaround says it is when a "function does not have side effects", so a precondition should be that gcc has decided pb_assert_type() has no side effects.[*] I don't see how it could have inferred this though (excluding a compiler bug).

Does that match your understanding?

[*] My new best guess of the mechanism is: Some older version of gcc would incorrectly mark functions whose side effects only happened via inline asm as "side effect free". Newer versions work around this by marking any function containing inline asm as having side effects, leading to this becoming the documented workaround for explicitly marking a function as having side effects.

@projectgus
Copy link

projectgus commented Feb 15, 2023

So, got curious enough to dig in linker intermediate dumps and learn a bit more gcc internals. I think this a gcc intermediate analysis bug.

Passing -fdump-tree-local-pure-const2[*] to the linker you get some dump files with names matching firmware.elf.ltrans?.*.local-pure-const2.

In the gcc 12 output, without the fix patch but with correct behaviour, one of the dumps includes:

;; Function pb_assert_type (pb_assert_type, funcdef_no=156, decl_uid=5225, cgraph_uid=449, symbol_order=11924)



 local analysis of pb_assert_type/11924
   scanning: _4 = (int) obj_3(D);
  scanning: _6 = _4 & 3;
  scanning: if (_6 != 0)
  scanning: _1 = MEM[(struct mp_obj_base_t *)obj_3(D)].type;
    Indirect ref read is not const
  scanning: if (_1 != type_5(D))
  scanning: mp_raise_TypeError (0B);
  scanning: return;
Function is locally looping.
Function is locally pure.
void pb_assert_type (void * obj, const struct mp_obj_type_t * type)
{
  const struct mp_obj_type_t * _1;
  int _4;
  int _6;

  <bb 2> [local count: 1073741824]:
  # DEBUG BEGIN_STMT
  # DEBUG o => obj_3(D)
  # DEBUG INLINE_ENTRY mp_obj_is_obj
  # DEBUG BEGIN_STMT
  _4 = (int) obj_3(D);
  _6 = _4 & 3;
  # DEBUG o => NULL
  if (_6 != 0)
    goto <bb 6>; [0.04%]
  else
    goto <bb 3>; [99.96%]

  <bb 6> [local count: 429496]:
  goto <bb 4>; [100.00%]

  <bb 3> [local count: 1073312329]:
  _1 = MEM[(struct mp_obj_base_t *)obj_3(D)].type;
  if (_1 != type_5(D))
    goto <bb 7>; [0.04%]
  else
    goto <bb 5>; [99.96%]

  <bb 7> [local count: 429324]:

  <bb 4> [local count: 858820]:
  # DEBUG D#96 => obj_3(D)
  # DEBUG D#97 => type_5(D)
  # DEBUG INLINE_ENTRY mp_ensure_not_fixed
  # DEBUG D#99 => D#98
  # DEBUG dict => D#99
  # DEBUG BEGIN_STMT
  mp_raise_TypeError (0B);

  <bb 5> [local count: 1072883005]:
  return;

}

I don't understand gcc RTL, but the thing to note here is the line Function is locally looping, which in gcc-land apparently means function has side effect and might never return. I'm pretty sure this is flagged because pb_assert_type contains a possible function call to a noreturn function (mp_raise_TypeError).

If we go back to gcc 10, still without the fix included, so we expect a miscompile:

; Function pb_assert_type (pb_assert_type, funcdef_no=252, decl_uid=4662, cgraph_uid=444, symbol_order=11888)



 local analysis of pb_assert_type/11888
   scanning: _4 = (int) obj_3(D);
  scanning: _7 = _4 & 3;
  scanning: if (_7 != 0)
  scanning: _1 = MEM[(struct mp_obj_base_t *)obj_3(D)].type;
    Indirect ref read is not const
  scanning: if (_1 != type_5(D))
  scanning: mp_raise_TypeError (0B);
  scanning: return;
Function is locally pure.
pb_assert_type (void * obj, const struct mp_obj_type_t * type)
{
  const struct mp_obj_type_t * _1;
  int _4;
  int _7;

  <bb 2> [local count: 1073741824]:
  # DEBUG BEGIN_STMT
  # DEBUG o => obj_3(D)
  # DEBUG INLINE_ENTRY mp_obj_is_obj
  # DEBUG BEGIN_STMT
  _4 = (int) obj_3(D);
  _7 = _4 & 3;
  # DEBUG o => NULL
  if (_7 != 0)
    goto <bb 6>; [0.04%]
  else
    goto <bb 3>; [99.96%]

  <bb 6> [local count: 429496]:
  goto <bb 4>; [100.00%]

  <bb 3> [local count: 1073312329]:
  _1 = MEM[(struct mp_obj_base_t *)obj_3(D)].type;
  if (_1 != type_5(D))
    goto <bb 7>; [0.04%]
  else
    goto <bb 5>; [99.96%]

  <bb 7> [local count: 429324]:

  <bb 4> [local count: 858820]:
  # DEBUG BEGIN_STMT
  mp_raise_TypeError (0B);

  <bb 5> [local count: 1072883005]:
  return;

}

Now the function is analysed only as "locally pure", not also as "locally looping". So gcc has incorrectly determined it will always return.

Add the fix patch "asm" line back in and rebuild:

;; Function pb_assert_type (pb_assert_type, funcdef_no=252, decl_uid=4662, cgraph_uid=446, symbol_order=11888)



 local analysis of pb_assert_type/11888
   scanning: _4 = (int) obj_3(D);
  scanning: _6 = _4 & 3;
  scanning: if (_6 != 0)
  scanning: _1 = MEM[(struct mp_obj_base_t *)obj_3(D)].type;
    Indirect ref read is not const
  scanning: if (_1 != type_5(D))
  scanning: __asm__ __volatile__("");
    volatile is not const/pure
  scanning: mp_raise_TypeError (0B);
  scanning: return;
Function is locally looping.
Function is locally pure.
Function can locally free.
pb_assert_type (void * obj, const struct mp_obj_type_t * type)
{
  const struct mp_obj_type_t * _1;
  int _4;
  int _6;

  <bb 2> [local count: 1073741824]:
  # DEBUG BEGIN_STMT
  # DEBUG o => obj_3(D)
  # DEBUG INLINE_ENTRY mp_obj_is_obj
  # DEBUG BEGIN_STMT
  _4 = (int) obj_3(D);
  _6 = _4 & 3;
  # DEBUG o => NULL
  if (_6 != 0)
    goto <bb 6>; [0.04%]
  else
    goto <bb 3>; [99.96%]

  <bb 6> [local count: 429496]:
  goto <bb 4>; [100.00%]

  <bb 3> [local count: 1073312329]:
  _1 = MEM[(struct mp_obj_base_t *)obj_3(D)].type;
  if (_1 != type_5(D))
    goto <bb 7>; [0.04%]
  else
    goto <bb 5>; [99.96%]

  <bb 7> [local count: 429324]:

  <bb 4> [local count: 858820]:
  # DEBUG D#139 => obj_3(D)
  # DEBUG D#140 => type_5(D)
  # DEBUG INLINE_ENTRY pb_assert_type
  # DEBUG D#143 => D#141
  # DEBUG obj => D#143
  # DEBUG D#144 => D#142
  # DEBUG type => D#144
  # DEBUG BEGIN_STMT
  __asm__ __volatile__("");
  # DEBUG BEGIN_STMT
  mp_raise_TypeError (0B);

  <bb 5> [local count: 1072883005]:
  return;

}

Now the function is "locally looping" again, but I think only because of the inline asm workaround...

I'm now 99% sure this is a gcc bug because in all three versions (including the version with the bug) the callee function mp_raise_TypeError is correctly identified as "locally looping":

;; Function mp_raise_TypeError (mp_raise_TypeError, funcdef_no=161, decl_uid=4313, cgraph_uid=132, symbol_order=1422) (executed once)



 local analysis of mp_raise_TypeError/1422
   scanning: mp_raise_msg (&mp_type_TypeError, msg_2(D));
Function is locally looping.
Function is locally pure.
__attribute__((noreturn))
mp_raise_TypeError (struct 
{
} * msg)
{
  <bb 2> [local count: 1073741824]:
  # DEBUG BEGIN_STMT
  mp_raise_msg (&mp_type_TypeError, msg_2(D));

}

... so there shouldn't be any way that a function which calls that function should not be marked as "looping" itself.

[*] I initially passed -fdump-rtl-all -fdump-tree-all -fdump-ipa-all but this results in a firehose of intermediate dumps.

@dlech
Copy link
Member

dlech commented Feb 15, 2023

BTW, do I have it right that none of the functions in this call stack are explicitly marked noinline?

That is correct. Although when I was investigating the problem I temporarily added to try to find the function in case it was being inlined. And luckily, this led me to the workaround. When I added __attribute__((noinline)) temporarily, the function was still optimized out.

It just happens that the optimizer heuristic decides not to inline pb_assert_type()?

Yes, that seems to be the case. The function is used in quite a few places, so it makes sense that it wouldn't be inlined.

Thanks for the additional analysis. Always handy to know the right magic gcc options to help debug stuff like this.

@projectgus
Copy link

Was discussing this with @jimmo and he dug up this gcc bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103052

This looks like it's probably either the root cause bug, or a bug whose fix happens to fix this case as well. Fix looks to be in gcc 10.4 onwards, but if using the ARM GNU Toolchains then the first one with the fix is likely their 11.3.Rel1 release.

@jimmo
Copy link

jimmo commented Feb 15, 2023

@tannewt @jepler @dhalbert -- Possibly RTYI

dlech added a commit to pybricks/pybricks-micropython that referenced this issue Feb 17, 2023
For unknown reasons, GCC 10 LTO is optimizing out pb_assert_type()
(presumably because it thinks the function has no side effects as
noinline has no effect). The GCC recommended workaround for such issues
is to include and empty inline assembly statement to prevent the
function from being optimized out.

Fixes: pybricks/support#950
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hub: movehub Issues related to the LEGO BOOST Move hub
Projects
None yet
Development

No branches or pull requests

4 participants