SDTs optimised into tail-calls panic #476

FelixMcFelix · 2024-03-12T20:54:30Z

Noted while working on #462/#475 -- this is a tracking issue to understand this as its own problem. Today we are converting InnerFlowIds to flow_id_sdt_arg structs, which is moderately costly as it occurs many times per packet. This creates one or two (current, or before+after) stack-local variables which are referenced without issue.

Removing this and passing in either a *const InnerFlowId or converting to a uintptr_t (as we do with our other args) leads to known panics in two locations so far. From some dumps I've captured:

Periodic flow expiry

panic[cpu14]/thread=fffffe009270ac20:
BAD TRAP: type=e (#pf Page fault) rp=fffffe009270a090 addr=18 occurred in module "xde" due to a NULL pointer dereference


sched:
#pf Page fault
Bad kernel fault at addr=0x18
pid=0, pc=0xfffffffff44db7be, sp=0xfffffe009270a180, eflags=0x10246
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe>  cr4: 3406f8<smap,smep,osxsav,xmme,fxsr,pge,mce,pae,pse,de>
cr2: 18
cr3: 1a800000
cr8: 0

        rdi: fffffe69f79cacf0 rsi:                0 rdx:           110000
        rcx:                0  r8:                2  r9: fffffe009270a278
        rax:                0 rbx: fffffe69f79cacf0 rbp: fffffe009270a1d0
        r10:      8a2ba1a7901 r11:                6 r12:                4
        r13:                1 r14:                4 r15:                0
        fsb: fffffc7fef2d2a40 gsb: fffffe69deb02000  ds:                0
         es:                0  fs:                0  gs:                0
        trp:                e err:                0 rip: fffffffff44db7be
         cs:               30 rfl:            10246 rsp: fffffe009270a180
         ss:               38

fffffe0092709fa0 unix:die+c0 ()
fffffe009270a080 unix:trap+999 ()
fffffe009270a090 unix:cmntrap+e9 ()
fffffe009270a1d0 xde:_ZN4core3fmt9Formatter12pad_integral17hdace542c09befd8aE+18e ()
fffffe009270a260 xde:_ZN4core3fmt3num53_$LT$impl$u20$core..fmt..LowerHex$u20$for$u20$u16$GT$3fmt17hba211b57c0906999E+7a ()
fffffe009270a2f0 xde:_ZN4core3fmt5write17h5e760e4f19caf97dE+1b3 ()
fffffe009270a3b0 xde:_ZN67_$LT$smoltcp..wire..ipv6..Address$u20$as$u20$core..fmt..Display$GT$3fmt17h6e9c915dfd6e131fE+195 ()
fffffe009270a440 xde:_ZN4core3fmt5write17h5e760e4f19caf97dE+1b3 ()
fffffe009270a4a0 xde:_ZN44_$LT$$RF$T$u20$as$u20$core..fmt..Display$GT$3fmt17h9128c724cf7b20c1E+61 ()
fffffe009270a530 xde:_ZN4core3fmt5write17h5e760e4f19caf97dE+1b3 ()
fffffe009270a590 xde:_ZN59_$LT$opte_api..ip..IpAddr$u20$as$u20$core..fmt..Display$GT$3fmt17haa9991ea1a942307E+74 ()
fffffe009270a620 xde:_ZN4core3fmt5write17h5e760e4f19caf97dE+1b3 ()
fffffe009270a6c0 xde:_ZN69_$LT$opte..engine..nat..OutboundNat$u20$as$u20$core..fmt..Display$GT$3fmt17hdcf3ff64a1fb60dcE+5b ()
fffffe009270a800 xde:_ZN121_$LT$alloc..collections..btree..map..ExtractIf$LT$K$C$V$C$F$C$A$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h6364d8716d0dcc4eE+1fc ()
fffffe009270a980 xde:_ZN4opte6engine10flow_table18FlowTable$LT$S$GT$12expire_flows17hb96f1e0513b25d84E+17f ()
fffffe009270ab10 xde:expire_periodic+91 ()
fffffe009270ab50 genunix:periodic_execute+f5 ()
fffffe009270ac00 genunix:taskq_thread+2a6 ()
fffffe009270ac10 unix:thread_start+b ()

dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel

Update TCP state

panic[cpu3]/thread=fffffe00934eac20:
BAD TRAP: type=e (#pf Page fault) rp=fffffe00934e9aa0 addr=c occurred in module "xde" due to a NULL pointer dereference


sched:
#pf Page fault
Bad kernel fault at addr=0xc
pid=0, pc=0xfffffffff44ee31b, sp=0xfffffe00934e9b90, eflags=0x10286
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe>  cr4: 3406f8<smap,smep,osxsav,xmme,fxsr,pge,mce,pae,pse,de>
cr2: c
cr3: 1a800000
cr8: 0

        rdi:                2 rsi: fffffe6ba8fb8228 rdx: fffffe00934ea0e8
        rcx:               17  r8:          a000000  r9:                4
        rax: fffffe69e320cac0 rbx:                2 rbp: fffffe00934e9d50
        r10:               ff r11:                5 r12: fffffe69e8260e10
        r13: fffffe00934ea0e8 r14:               17 r15:                2
        fsb: fffffc7fef2d2a40 gsb: fffffe69de4bf000  ds:                0
         es:                0  fs:                0  gs:                0
        trp:                e err:                0 rip: fffffffff44ee31b
         cs:               30 rfl:            10286 rsp: fffffe00934e9b90
         ss:               38

fffffe00934e99b0 unix:die+c0 ()
fffffe00934e9a90 unix:trap+999 ()
fffffe00934e9aa0 unix:cmntrap+e9 ()
fffffe00934e9d50 xde:_ZN4opte6engine4port13Port$LT$N$GT$16update_tcp_entry17h116fa2f846532633E+3b ()
fffffe00934e9f20 xde:_ZN4opte6engine4port13Port$LT$N$GT$16update_tcp_entry17h116fa2f846532633E+25c ()
fffffe00934ea2e0 xde:_ZN4opte6engine4port13Port$LT$N$GT$7process17h132132694f056ce9E+3bc ()
fffffe00934ea950 xde:xde_rx+43c ()
fffffe00934ea9a0 mac:mac_promisc_dispatch_one+60 ()
fffffe00934eaa20 mac:mac_promisc_dispatch+83 ()
fffffe00934eaa80 mac:mac_rx_common+47 ()
fffffe00934eaae0 mac:mac_rx+c6 ()
fffffe00934eab20 mac:mac_rx_ring+2b ()
fffffe00934eab60 igb:igb_intr_rx_work+5c ()
fffffe00934eab80 igb:igb_intr_rx+15 ()
fffffe00934eabd0 apix:apix_dispatch_by_vector+8c ()
fffffe00934eac00 apix:apix_dispatch_lowlevel+29 ()
fffffe0093499a40 unix:switch_sp_and_call+15 ()
fffffe0093499aa0 apix:apix_do_interrupt+f3 ()
fffffe0093499ab0 unix:cmnint+c3 ()
fffffe0093499ba0 unix:i86_mwait+12 ()
fffffe0093499bd0 unix:cpu_idle_mwait+14b ()
fffffe0093499be0 unix:cpu_idle_adaptive+19 ()
fffffe0093499c00 unix:idle+a8 ()
fffffe0093499c10 unix:thread_start+b ()

Both occur some distance from the actual SDT: a format statement on the supposedly-valid &InnerFlowId, and a match on a &InnerFlowId respectively before the probe occurs. Removing the probe causes these callsites to behave/compile correctly. ~~Another SDT, layer-process-return shows a different variation:~~

NAME         DIR EPOCH    FLOW BEFORE                                 FLOW AFTER                                  LEN   RESULT
opte0        OUT 3        UDP,10.0.0.2:38231,10.0.0.1:10000           ,0.254.255.255:29,16.0.0.0:1980             133   Modified

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 11 a1 02 00 0a 00 00 02 0a 00 00 01 00 00 00 00  ................
        10: 00 00 00 00 02 00 00 03 02 11 00 0e cb 1f a1 fb  ................
        20: ff ff ff ff 57 95 10 27                          ....W..'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 90 69 1b 93 00 fe ff ff 10 00 00 00 00 00 00 00  .i..............
        10: c0 db 43 42 6a fe ff ff f4 33 a1 fb ff ff ff ff  ..CBj....3......
        20: 57 95 10 27 1d 00 bc 07                          W..'....

Flow_before is obviously valid, while flow_after appears to point elsewhere. The only obvious difference I'm aware of is that flow_after is obtained direct via Packet::flow, while flow_before is obtained and explicltly copied out before pkt is modified.

EDIT: The last case is caused by our uintptr_t untyped args making it easy to pass the wrong thing. The actual kernel panics still stand even with accurate types, however.

The text was updated successfully, but these errors were encountered:

`uintptr_t` maybe be easy to stick in a probe, but it is also a footgun. This fixdes the display for e.g. port-process-return, but does not affect the actual panics mentioned in #476.

FelixMcFelix · 2024-03-15T12:31:13Z

The actual panics all occur in sites where rustc has chosen to tail-call optimise the calls to the dtrace probes with a JMP rather than a CALL. There are 3 probes in #475 which are called in this manner:

uft-tcp-closed via Port::uft_tcp_closed
uft-invalidate via Port::uft_invalidate_probe
flow-expired via Port::flow_expired_probe (called in per-entry closure from FlowTable::expire_flows.

Another failure had an interesting stack trace that led me to this:

#pf Page fault
Bad kernel fault at addr=0x1408
pid=0, pc=0xfffffffff44e2d16, sp=0xfffffe00930d9ae0, eflags=0x10282
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe>  cr4: 3406f8<smap,smep,osxsav,xmme,fxsr,pge,mce,pae,pse,de>
cr2: 1408
cr3: 1a800000
cr8: 0

        rdi:                2 rsi: fffffe69e46ad678 rdx: fffffe00930da0f8
        rcx:               11  r8:                2  r9:             1400
        rax: fffffe6a20f3a480 rbx:                2 rbp: fffffe00930d9ce0
        r10:                2 r11:               ff r12:               11
        r13: fffffe00930da0f8 r14: fffffe69e46ad678 r15:               11
        fsb:                0 gsb: fffffe69ddfb4000  ds:               4b
         es:               4b  fs:                0  gs:              1c3
        trp:                e err:                0 rip: fffffffff44e2d16
         cs:               30 rfl:            10282 rsp: fffffe00930d9ae0
         ss:               38


fffffe00930d9900 unix:die+c0 ()
fffffe00930d99e0 unix:trap+999 ()
fffffe00930d99f0 unix:cmntrap+e9 ()
fffffe00930d9ce0 xde:_ZN4opte6engine4port13Port$LT$N$GT$16update_tcp_entry17had9d5dc9309bc5feE+36 ()
fffffe00930d9ef0 xde:_ZN4opte6engine4port13Port$LT$N$GT$16update_tcp_entry17had9d5dc9309bc5feE+353 ()
fffffe00930da290 xde:_ZN4opte6engine4port13Port$LT$N$GT$7process17hd560a8c179c6c2a1E+432 ()
fffffe00930da950 xde:xde_rx+632 ()
fffffe00930da9a0 mac:mac_promisc_dispatch_one+60 ()

The call made before Port::update_tcp_entry+353 is into Port::uft_tcp_closed, so why are we faulting in update_tcp_entry again? Seemingly because uft_tcp_closed comes immediately before update_tcp_entry in the built binary, and we are falling through back into it.

This can be seen in the periodic flow expiry case above: the ExtractIf closure calls flow_expired_probe, and we then fall through into an adjacent format impl. We certainly don't use that logic in flow expiry! On a more recent build things are moved around, so I get the following:

fffffe00927044e0 unix:die+d3 ()
fffffe00927045c0 unix:trap+855 ()
fffffe00927045d0 unix:cmntrap+e9 ()
fffffe00927046c0 xde:_ZN77_$LT$opte..engine..icmp..v4..MessageType$u20$as$u20$core..cmp..PartialOrd$GT$11partial_cmp17h0c9d2e455cd7b27fE+a ()
fffffe0092704800 xde:_ZN121_$LT$alloc..collections..btree..map..ExtractIf$LT$K$C$V$C$F$C$A$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h6364d8716d0dcc4eE+1fc ()
fffffe0092704980 xde:_ZN4opte6engine10flow_table18FlowTable$LT$S$GT$12expire_flows17hb96f1e0513b25d84E+17f ()
fffffe0092704b10 xde:expire_periodic+91 ()

I can confirm that ...ExtractIf...+1fc follows a CALL flow_expired_probe. This fn is adjacent to _ZN77_$LT$opte..engine..icmp..v4..MessageType ... <>::partial_cmp.

Both examined functions end in a tail-call to the relevant DTrace SDT.

FelixMcFelix · 2024-03-15T14:42:22Z

@citrus-it took a look at a recent coredump, focussing on uft_tcp_closed. When loaded, the binary contains NOP sleds as expected at the sites of the dtrace probes. uft_tcp_closed contains one call and one jump to the uft-tcp-closed probe. Unfortunately, the JMP is treated identically to the CALL:

_ZN4opte6engine4port13Port$LT$N$GT$14uft_tcp_closed17h126181eb75c38034E+0x25a:  popq   %rbp
_ZN4opte6engine4port13Port$LT$N$GT$14uft_tcp_closed17h126181eb75c38034E+0x25b:  nop
_ZN4opte6engine4port13Port$LT$N$GT$14uft_tcp_closed17h126181eb75c38034E+0x25c:  nop
_ZN4opte6engine4port13Port$LT$N$GT$14uft_tcp_closed17h126181eb75c38034E+0x25d:  nop
_ZN4opte6engine4port13Port$LT$N$GT$14uft_tcp_closed17h126181eb75c38034E+0x25e:  nop
_ZN4opte6engine4port13Port$LT$N$GT$14uft_tcp_closed17h126181eb75c38034E+0x25f:  nop

This does not include a RET, and we fall through to the next function along. Currently we can workaround this by instantiating any struct with impl Drop to prevent tailcall optimisation, but this feels risky.

luqmana · 2024-04-17T23:25:50Z

Rather than working around this in opte, this should probably be fixed in illumos. Opened upstream bug and potential fix.

luqmana · 2024-04-17T23:54:22Z

The binary on-disk indeed shows (via objdump) we have a tail call as we jmp to the probe "function":

0000000000000000 <opte::engine::flow_table::flow_expired_probe>:
[...snip...]
  88:   31 c9                   xor    %ecx,%ecx
  8a:   49 8b 7d 00             mov    0x0(%r13),%rdi
  8e:   49 8b 34 24             mov    (%r12),%rsi
  92:   4d 85 f6                test   %r14,%r14
  95:   75 be                   jne    55 <opte::engine::flow_table::flow_expired_probe+0x55>
  97:   45 31 c0                xor    %r8d,%r8d
  9a:   4c 89 ca                mov    %r9,%rdx
  9d:   48 83 c4 18             add    $0x18,%rsp
  a1:   5b                      pop    %rbx
  a2:   41 5c                   pop    %r12
  a4:   41 5d                   pop    %r13
  a6:   41 5e                   pop    %r14
  a8:   41 5f                   pop    %r15
  aa:   5d                      pop    %rbp
  ab:   e9 00 00 00 00          jmp    b0 <opte::engine::flow_table::flow_expired_probe+0xb0>

At runtime on stock illumos bits we see the code in-memory with NOPs replacing the last five bytes (jmp op code + 4 address bytes):

opte::engine::flow_table::flow_expired_probe+0x88:       xorl   %ecx,%ecx
opte::engine::flow_table::flow_expired_probe+0x8a:       movq   0x0(%r13),%rdi
opte::engine::flow_table::flow_expired_probe+0x8e:       movq   (%r12),%rsi
opte::engine::flow_table::flow_expired_probe+0x92:       testq  %r14,%r14
opte::engine::flow_table::flow_expired_probe+0x95:       jne    -0x42    <opte::engine::flow_table::flow_expired_probe+0x55>
opte::engine::flow_table::flow_expired_probe+0x97:       xorl   %r8d,%r8d
opte::engine::flow_table::flow_expired_probe+0x9a:       movq   %r9,%rdx
opte::engine::flow_table::flow_expired_probe+0x9d:       addq   $0x18,%rsp
opte::engine::flow_table::flow_expired_probe+0xa1:       popq   %rbx
opte::engine::flow_table::flow_expired_probe+0xa2:       popq   %r12
opte::engine::flow_table::flow_expired_probe+0xa4:       popq   %r13
opte::engine::flow_table::flow_expired_probe+0xa6:       popq   %r14
opte::engine::flow_table::flow_expired_probe+0xa8:       popq   %r15
opte::engine::flow_table::flow_expired_probe+0xaa:       popq   %rbp
opte::engine::flow_table::flow_expired_probe+0xab:       nop
opte::engine::flow_table::flow_expired_probe+0xac:       nop
opte::engine::flow_table::flow_expired_probe+0xae:       nop
opte::engine::flow_table::flow_expired_probe+0xaf:       nop

With the above patch, we now have a ret to catch us from falling through:

opte::engine::flow_table::flow_expired_probe+0xaa:       popq   %rbp
opte::engine::flow_table::flow_expired_probe+0xab:       nop
opte::engine::flow_table::flow_expired_probe+0xac:       nop
opte::engine::flow_table::flow_expired_probe+0xad:       nop
opte::engine::flow_table::flow_expired_probe+0xae:       nop
opte::engine::flow_table::flow_expired_probe+0xaf:       ret

For the non-tail call case, we continue to patch in only NOPs:

0000000000000000 <opte::engine::flow_table::flow_expired_probe>:
[...snip...]
 725:   eb 03                   jmp    72a <opte::engine::flow_table::flow_expired_probe+0x72a>
 727:   45 31 c0                xor    %r8d,%r8d
 72a:   48 8d 95 00 ff ff ff    lea    -0x100(%rbp),%rdx
 731:   e8 00 00 00 00          call   736 <opte::engine::flow_table::flow_expired_probe+0x736>
 736:   48 81 c4 08 01 00 00    add    $0x108,%rsp
 73d:   5b                      pop    %rbx
 73e:   41 5c                   pop    %r12
 740:   41 5d                   pop    %r13
 742:   41 5e                   pop    %r14
 744:   41 5f                   pop    %r15
 746:   5d                      pop    %rbp
 747:   c3                      ret

opte::engine::flow_table::flow_expired_probe+0x725:      jmp    +0x3     <opte::engine::flow_table::flow_expired_probe+0x72a>
opte::engine::flow_table::flow_expired_probe+0x727:      xorl   %r8d,%r8d
opte::engine::flow_table::flow_expired_probe+0x72a:      leaq   0xffffffffffffff00(%rbp),%rdx
opte::engine::flow_table::flow_expired_probe+0x731:      nop
opte::engine::flow_table::flow_expired_probe+0x732:      nop
opte::engine::flow_table::flow_expired_probe+0x733:      nop
opte::engine::flow_table::flow_expired_probe+0x734:      nop
opte::engine::flow_table::flow_expired_probe+0x735:      nop
opte::engine::flow_table::flow_expired_probe+0x736:      addq   $0x108,%rsp
opte::engine::flow_table::flow_expired_probe+0x73d:      popq   %rbx
opte::engine::flow_table::flow_expired_probe+0x73e:      popq   %r12
opte::engine::flow_table::flow_expired_probe+0x740:      popq   %r13
opte::engine::flow_table::flow_expired_probe+0x742:      popq   %r14
opte::engine::flow_table::flow_expired_probe+0x744:      popq   %r15
opte::engine::flow_table::flow_expired_probe+0x746:      popq   %rbp
opte::engine::flow_table::flow_expired_probe+0x747:      ret

FelixMcFelix · 2024-04-26T22:54:32Z

Closed by illumos#16480.

FelixMcFelix mentioned this issue Mar 15, 2024

Use a repr(C) flow ID to reduce repeated per-packet conversion costs #475

Merged

FelixMcFelix changed the title ~~SDTs directly using &InnerFlowId behave unexpectedly and/or panic~~ SDTs optimised into tail-calls panic Mar 20, 2024

FelixMcFelix added the bug label Mar 28, 2024

FelixMcFelix closed this as completed Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDTs optimised into tail-calls panic #476

SDTs optimised into tail-calls panic #476

FelixMcFelix commented Mar 12, 2024 •

edited

Loading

FelixMcFelix commented Mar 15, 2024 •

edited

Loading

FelixMcFelix commented Mar 15, 2024 •

edited

Loading

luqmana commented Apr 17, 2024

luqmana commented Apr 17, 2024

FelixMcFelix commented Apr 26, 2024

SDTs optimised into tail-calls panic #476

SDTs optimised into tail-calls panic #476

Comments

FelixMcFelix commented Mar 12, 2024 • edited Loading

FelixMcFelix commented Mar 15, 2024 • edited Loading

FelixMcFelix commented Mar 15, 2024 • edited Loading

luqmana commented Apr 17, 2024

luqmana commented Apr 17, 2024

FelixMcFelix commented Apr 26, 2024

FelixMcFelix commented Mar 12, 2024 •

edited

Loading

FelixMcFelix commented Mar 15, 2024 •

edited

Loading

FelixMcFelix commented Mar 15, 2024 •

edited

Loading