You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The description of this issue is mostly a copy of the email with the same title sent to the google group along with some comments.
As I was working on a patch to address issue #1087, I discovered that occasionally it would crash or hang while running a test tst-bsd-tcp1-zsndrcv.cc. The crash would look like this:
page fault outside application, addr: 0x0000000000000000
[registers]
PC: 0x0000000fc011ca5c <ifa_ifwithnet+152>
X00: 0x0000000000000012 X01: 0xffff90004088f4a0 X02: 0x0000000000000000
X03: 0xffff800040f668c0 X04: 0xffffa00040e778e8 X05: 0xffff800040f66950
X06: 0x0000000000000000 X07: 0x0000000000000000 X08: 0x0000000000000000
X09: 0x0000000000000000 X10: 0x0000000000000000 X11: 0x0000000000000000
X12: 0x0000000000004000 X13: 0x0000000000000000 X14: 0x00000000000000b0
X15: 0x0000000000000000 X16: 0x0000000000000001 X17: 0x0000000000000001
X18: 0x0000000055697ee1 X19: 0xffffa0004088e800 X20: 0xffff800040f66840
X21: 0xffff90004088f000 X22: 0x0000000fc0672838 X23: 0x0000000000000000
X24: 0x0000000000000000 X25: 0xffff90004088f4a0 X26: 0x0000000000000001
X27: 0x00000000ffffffff X28: 0x0000000000000002 X29: 0xffff800040f667a0
X30: 0x0000000fc011ca2c SP: 0xffff800040f667a0 ESR: 0x0000000096000006
PSTATE: 0x0000000020000345
Aborted
[backtrace]
0x0000000fc014c2e4 <mmu::vm_fault(unsigned long, exception_frame*)+724>
0x0000000fc017e4c8 <page_fault+100>
0x0000000fc017e254 <???+-1072176556>
0x0000000fc01234dc <???+-1072548644>
0x0000000fc0124a80 <in_pcbconnect_setup+192>
0x0000000fc0145920 <???+-1072408288>
0x0000000fc0145afc <???+-1072407812>
0x0000000fc0100bb8 <kern_connect+120>
0x0000000fc010de7c <linux_connect+60>
0x0000000fc010233c <connect+12>
0x00001000000376a0 <???+226976>
0x00001000000377e0 <???+227296>
0x0000000fc0257440 <thread_main_c+32>
0x0000000fc017d784 <???+-1072179324>
---------
(gdb) bt
#0 processor::wfi () at arch/aarch64/processor.hh:52
#1 processor::halt_no_interrupts () at arch/aarch64/processor.hh:52
#2 arch::halt_no_interrupts () at arch/aarch64/arch.hh:48
#3 osv::halt () at arch/aarch64/power.cc:24
#4 0x0000000fc00e9fcc in abort (fmt=fmt@entry=0xfc0481190 "Aborted\n") at runtime.cc:142
#5 0x0000000fc00a0654 in abort () at runtime.cc:106
#6 0x0000000fc014c2e8 in mmu::vm_sigsegv (ef=0xffff800040f66680, addr=0) at core/mmu.cc:1420
#7 mmu::vm_sigsegv (ef=0xffff800040f66680, addr=0) at core/mmu.cc:1414
#8 mmu::vm_fault (addr=0, addr@entry=1, ef=ef@entry=0xffff800040f66680) at core/mmu.cc:1442
#9 0x0000000fc017e4cc in page_fault (ef=0xffff800040f66680) at arch/aarch64/mmu.cc:50
#10 <signal handler called>
#11 0x0000000fc011ca5c in ifa_ifwithnet (addr=addr@entry=0xffff800040f66840, ignore_ptp=0) at bsd/sys/net/if.cc:1434
#12 0x0000000fc01234e0 in in_pcbladdr (inp=inp@entry=0xffffa000411c1600, faddr=faddr@entry=0xffff800040f669d8, laddr=laddr@entry=0xffff800040f669d0, cred=0x0)
at bsd/sys/netinet/in_pcb.cc:969
#13 0x0000000fc0124a84 in in_pcbconnect_setup (inp=inp@entry=0xffffa000411c1600, nam=nam@entry=0xffffa00040b046d0, laddrp=laddrp@entry=0xffff800040f66a30,
lportp=lportp@entry=0xffff800040f66a2e, faddrp=faddrp@entry=0xffffa000411c1664, fportp=fportp@entry=0xffffa000411c1654, oinpp=oinpp@entry=0xffff800040f66a38,
cred=cred@entry=0x0) at bsd/sys/netinet/in_pcb.cc:1056
#14 0x0000000fc0145924 in tcp_connect (tp=tp@entry=0xffffa00040e5e400, nam=nam@entry=0xffffa00040b046d0, td=<optimized out>) at bsd/sys/netinet/tcp_usrreq.cc:1089
#15 0x0000000fc0145b00 in tcp_usr_connect (td=<optimized out>, nam=0xffffa00040b046d0, so=<optimized out>) at bsd/sys/netinet/tcp_usrreq.cc:463
#16 tcp_usr_connect (so=<optimized out>, nam=0xffffa00040b046d0, td=<optimized out>) at bsd/sys/netinet/tcp_usrreq.cc:436
#17 0x0000000fc0100bbc in kern_connect (fd=fd@entry=4, sa=0xffffa00040b046d0) at bsd/sys/kern/uipc_syscalls.cc:377
#18 0x0000000fc010de80 in linux_connect (s=s@entry=4, name=name@entry=0xffff800040f66be8, namelen=namelen@entry=16) at bsd/sys/compat/linux/linux_socket.cc:684
#19 0x0000000fc0102340 in connect (fd=fd@entry=4, addr=addr@entry=0xffff800040f66be8, len=len@entry=16) at bsd/sys/kern/uipc_syscalls_wrap.cc:128
#20 0x00001000000376a4 in test_bsd_tcp1::tcp_client (this=0x2000001ffa98) at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zrcv.cc:169
#21 0x00001000000377e4 in test_bsd_tcp1::run()::{lambda()#2}::operator()() const (__closure=<optimized out>)
at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zrcv.cc:230
#22 std::__invoke_impl<void, test_bsd_tcp1::run()::{lambda()#2}&>(std::__invoke_other, test_bsd_tcp1::run()::{lambda()#2}&) (__f=...)
at /usr/include/c++/10/bits/invoke.h:60
#23 std::__invoke_r<void, test_bsd_tcp1::run()::{lambda()#2}&>(std::__is_invocable&&, (test_bsd_tcp1::run()::{lambda()#2}&)...) (__fn=...)
at /usr/include/c++/10/bits/invoke.h:153
#24 std::_Function_handler<void (), test_bsd_tcp1::run()::{lambda()#2}>::_M_invoke(std::_Any_data const&) (__functor=...)
at /usr/include/c++/10/bits/std_function.h:291
#25 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040c72e00) at core/sched.cc:1267
#26 sched::thread_main_c (t=0xffffa00040c72e00) at arch/aarch64/arch-switch.hh:162
#27 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) frame 11
#11 0x0000000fc011ca5c in ifa_ifwithnet (addr=addr@entry=0xffff800040f66840, ignore_ptp=0) at bsd/sys/net/if.cc:1434
1434 if (ifa->ifa_addr->sa_family != af)
(gdb) p ifa->ifa_addr
$1 = (bsd_sockaddr *) 0xffff80004088e800
(gdb) p ifa->ifa_addr->sa_family
$2 = 232 '\350'
(gdb) p af
$3 = 2
As you can see the fault happened in line 1434 of if.cc:
when dereferencing the field ifa_addr of the ifa pointer.
In this case, gdb actually showed ifa_addr with a valid non-zero address (which is interesting) but most of the time when I connected I would see ifa_addr equal to 0.
Sometimes this test would "hang", which I discovered was a tcp server thread left listening on a socket after tcp client thread terminated when its loop broke due to an error after calling connect() on a socket:
for (inti=0; i<ITERATIONS; i++) {
...
dbg_d("client: connecting to 127.0.0.1...");
if ( connect(s, (structsockaddr*)&raddr, sizeof(raddr)) <0 ) {
dbg_d("client: connect() failed!");
return-1;
}
}
Initially, I thought that these "hangs" or crashes were not really specific to TCP stack, but instead just random manifestations of some bug in my patch fixing the issue #1087. The patch was about changing preboot.S, boot.S, and core/mmu.cc to move aarch64 kernel code to a different much higher place in virtual memory so that the same loader.elf/kernel.elf could run correctly on both QEMU and firecracker. I speculated that somehow some of virt/phys or phys/virt translation routines have bugs, then possibly something wrong in the new assembly code updating early boot mapping page tables, or finally some dirty cache lines and some missing isb or dsb instruction. Nothing was helping and at some point, I found that strangely adding a nop instruction to premain() in loader.cc shifting the address of premain would somehow make this crash go away (or most likely I did not see it as often). Then I started debugging the TCP stack and code around the crash and "hang" to understand why exactly this was misbehaving. It had to do with some memory mapping bug I thought.
To make this long story short I eventually took a step back and tried to reproduce the exact same crashes/"hangs" on the original version of the code before the patch as it is on master now. Also during the research, I realized that the other 3 tst-bsd-tcp1-*cc tests were failing in a similar way.
All of these tests exercise basic socket API fundamentals where one thread acts as a TCP server listening on some port and another thread acting a client in a loop of 400 iterations connecting to the server, sending some data and reading it back as an echo, and closing the socket. The 3 -z* ones exercise the zero-copy API to send and/or receive data.
After I increased the number of loops to something much higher like 100K I started seeing the exact same errors after 5-10K of iterations with the 0-address faults being more dominant. So clearly the issue was not caused by my kernel memory layout changes but due to some old bug and somehow my patch changes made these more frequent. I also tried to reproduce it on x86_64 with the same number of iterations but to no avail.
Now a little by luck and little by deduction, I managed to realize that the page fault might be caused by something writing over/erasing the memory (stack overflow?) where the ifa_addr field of the bsd_ifaddr struct (ifa variable in the example above). Here is the struct defined in the bsd/sys/net/if_var.h (it will be quite important in this long email below):
/* * The bsd_ifaddr structure contains information about one address * of an interface. They are maintained by the different address families, * are allocated and attached when an address is set, and are linked * together so all addresses for an interface can be located. * * NOTE: a 'struct bsd_ifaddr' is always at the beginning of a larger * chunk of malloc'ed memory, where we store the three addresses * (ifa_addr, ifa_dstaddr and ifa_netmask) referenced here. */structbsd_ifaddr {
structbsd_sockaddr*ifa_addr; /* address of interface */structbsd_sockaddr*ifa_dstaddr; /* other end of p-to-p link */#defineifa_broadaddr ifa_dstaddr /* broadcast address interface */structbsd_sockaddr*ifa_netmask; /* used to determine subnet */structif_dataif_data; /* not all members are meaningful */structifnet*ifa_ifp; /* back-pointer to interface */TAILQ_ENTRY(bsd_ifaddr) ifa_link; /* queue macro glue */void (*ifa_rtrequest) /* check or clean routes (+ or -)'d */
(int, structrtentry*, structrt_addrinfo*);
u_shortifa_flags; /* mostly rt_flags for cloning */u_intifa_refcnt; /* references to this structure */intifa_metric; /* cost of going out this interface */int (*ifa_claim_addr) /* check if an addr goes to this if */
(structbsd_ifaddr*, structbsd_sockaddr*);
structmtxifa_mtx;
};
Obviously, finding what possibly erases this field would be like looking for a needle in the stack of hay. Then I theorized, that the memory allocated to hold the bsd_ifaddr may be freed somehow and used again for the next malloc and end up being written to by something completely else.
To that end, I captured the address of the bsd_ifaddr (for example 0xffffa00040882800) and then put a breakpoint or printf in core/mempool.cc:free() to check if something frees that address. And indeed before the crash or hang it would be caught in this check. Eventually, I managed to capture what calls the free():
#9 0x0000000fc0243e10 in memory::pool::free (this=0xffffa00040e94800, object=object@entry=0xffff80004088e800) at core/mempool.cc:334
#10 0x0000000fc0244010 in free (object=0xffff80004088e800) at core/mempool.cc:1897
#11 free (object=object@entry=0xffffa0004088e800) at core/mempool.cc:1871
#12 0x0000000fc011c568 in ifa_free (ifa=0xffffa0004088e800) at bsd/sys/net/if.cc:1204
#13 ifa_free (ifa=ifa@entry=0xffffa0004088e800) at bsd/sys/net/if.cc:1200
#14 0x0000000fc012c328 in ip_preprocess_packet (m=<optimized out>, protocol=@0xffff800040a47f5b: 192 '\300', hlen=@0xffff800040a47f5c: 20)
at bsd/sys/netinet/ip_input.cc:717
#15 0x0000000fc012c734 in ip_input (m=<optimized out>) at bsd/sys/netinet/ip_input.cc:770
#16 0x0000000fc0117c24 in netisr_process_workstream_proto (proto=1, nwsp=0xfc0671f20 <main_nws>) at bsd/sys/net/netisr.cc:576
#17 swi_net (arg=0xfc0671f20 <main_nws>) at bsd/sys/net/netisr.cc:615
#18 0x0000000fc0118f9c in netisr_osv_thread_wrapper (handler=0xfc0117b10 <swi_net(void*)>, arg=0xfc0671f20 <main_nws>) at bsd/sys/net/netisr1.cc:29
#19 0x0000000fc0119010 in operator() (__closure=<optimized out>, __closure=<optimized out>) at bsd/sys/net/netisr1.cc:39
#20 std::__invoke_impl<void, netisr_osv_start_thread(netisr_osv_handler_t, void*)::<lambda()>&> (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#21 std::__invoke_r<void, netisr_osv_start_thread(netisr_osv_handler_t, void*)::<lambda()>&> (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
#22 std::_Function_handler<void(), netisr_osv_start_thread(netisr_osv_handler_t, void*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
at /usr/include/c++/10/bits/std_function.h:291
#23 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860200) at core/sched.cc:1267
#24 sched::thread_main_c (t=0xffffa00040860200) at arch/aarch64/arch-switch.hh:162
#25 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
And here is where the most critical part of freeBSD networking code in bsd/sys/net/if.cc comes:
It seems that ifa_free() calls mtx_destroy() and free() when the ifa_refcnt field of bsd_ifaddr is 0 after refcount_release() is called. See code of refcount_release in bsd/sys/sys/refcount.h:
static __inline intrefcount_release(volatileu_int*count)
{
u_intold;
/* XXX: Should this have a rel membar? */old=atomic_fetchadd_int(count, -1);
KASSERT(old>0, ("negative refcount %p", count));
return (old==1);
}
BTW the XXX comment made me a little bit suspicious. But before we go there let me try to explain what I figured out about the ifa_refcnt field. The comment from the bsd_ifaddr definition states this:
u_intifa_refcnt; /* references to this structure */
So it seems to be some sort of reference count field. But what purpose does it serve? The simple grep of ifa_refcnt to see where it is used gives this list:
./bsd/sys/net/if_var.h: u_int ifa_refcnt; /* references to this structure */
./bsd/sys/net/if.cc: refcount_init(&ifa->ifa_refcnt, 1);
./bsd/sys/net/if.cc: refcount_acquire(&ifa->ifa_refcnt);
./bsd/sys/net/if.cc: if (refcount_release(&ifa->ifa_refcnt)) {
which in essence are the 3 functions from bsd/sys/net/if.cc I put couple of paragraphs above - ifa_init(), ifa_ref() and ifa_free(). I do not know for sure of the intention behind ifa_refcnt (I did study some FreeBSD commits) but based on its use by these 3 functions, it seems to act as a reference count to implement some sort of garbage collection mechanism around bsd_ifaddr structures to make sure that it is eventually getting free()-ed after no longer used/referenced. From what I figured the ifa_init() is called during this struct allocation and initialization and there are two places in bsd subtree:
if_attach_internal() function in bsd/sys/net/if.cc
in_control() when the SIOCSIFADDR, SIOCSIFNETMASK or SIOCSIFDSTADDR cmd passed in bsd/sys/netinet/in.cc
In essence, the ifa_init() sets the ifa_mtx field and sets ifa_refcnt of the bsd_ifaddr to 1.
It turns out that the ifa_ref() is then called in 33 places and ifa_free() is called in 55 places in the bsd subtree. The former increments and the latter decrements the ifa_refcnt counter.
So I started studying any changes to how these functions may have changed on OSv side of bsd tree and I have found this commit:
8ad9f53 - which is backport of the commit on freebsd side to remove some redundant (?) occurrences of ifa_ref() and ifa_free() from bsd/sys/netinet/ip_output.cc,
Then, I found this patch in FreeBSD codebase around refcount_release() - freebsd/freebsd-src@157d5e6 - which changes refcount_release() to look like this:
static __inline intrefcount_release(volatileu_int*count)
{
u_intold;
/* XXX: Should this have a rel membar? */atomic_thread_fence_rel();
old=atomic_fetchadd_int(count, -1);
KASSERT(old>0, ("negative refcount %p", count));
return (old==1);
if (old>1)
return (0);
/* * Last reference. Signal the user to call the destructor. * * Ensure that the destructor sees all updates. The fence_rel * at the start of the function synchronized with this fence. */atomic_thread_fence_acq();
return (1);
}
And then some other changes adding and removing ifa_ref() and ifa_free():
Why am I listing all these changes? Well, my theory (I think more than that) is that there is some old bug in the reference counting logic around ifa_refcnt in the bsd substree - either un-balanced (too many, too few) number of calls to ifa_ref()/ifa_free() or buggy refcount_release() function (see TODO comment) or combination of those that somehow only shows its ugly face on aarch64. Or possibly there is some race condition that makes calls to ifa_free() be executed before ifa_ref() even if they are balanced. Please note that this bug happens on 1 or more CPUs so it is not SMP-specific by any means.
If that is the case then ideally we would want to fix this bug by back-porting all these commits and possibly many more from the current FreeBSD TCP code. And I have actually tried to do that - 1st by backporting the refcount_release() patch which did not help. Then apply some extra changes to add/remove some calls to ifa_ref() and ifa_free(). This led to some new crashes now even on x86_64. All of this has made me think that bringing back all necessary changes from FreeBSD tree to fix this bug would be a very tedious and risky effort. We might make the situation even worse.
So instead I realized there is a simpler solution to this problem - eliminate buggy reference counting code like this:
The bsd_ifaddr is allocated only a handful of times (=5) and destroyed once during OSv runtime. Here is the list of places that I captured from gdb:
(gdb) hbr bsd/sys/net/if.cc:1187
Hardware assisted breakpoint 1 at 0xfc011bddc: bsd/sys/net/if.cc:1187. (2 locations)
(gdb) c
Continuing.
Breakpoint 1, if_attach_internal (vmove=0, ifp=0xffffa0004088e400) at bsd/sys/net/if.cc:1187
1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0 if_attach_internal (vmove=0, ifp=0xffffa0004088e400) at bsd/sys/net/if.cc:1187
#1 if_attach (ifp=ifp@entry=0xffff90004088f000) at bsd/sys/net/if.cc:519
#2 0x0000000fc011a390 in lo_clone_create (ifc=0xfc061d908 <lo_cloner>, unit=0, params=<optimized out>) at bsd/sys/net/if_loop.cc:116
#3 0x0000000fc011a0d0 in ifc_simple_create (ifc=0xfc061d908 <lo_cloner>, name=0xffff8000407a2d50 "lo0", len=16, params=0x0) at bsd/sys/net/if_clone.cc:572
#4 0x0000000fc0119360 in if_clone_createif (ifc=ifc@entry=0xfc061d908 <lo_cloner>, name=name@entry=0xffff8000407a2d50 "lo0", len=len@entry=16,
params=params@entry=0x0) at bsd/sys/net/if_clone.cc:171
#5 0x0000000fc0119458 in if_clone_createif (params=0x0, len=16, name=0xffff8000407a2d50 "lo0", ifc=0xfc061d908 <lo_cloner>) at bsd/sys/net/if_clone.cc:523
#6 ifc_simple_attach (ifc=0xfc061d908 <lo_cloner>) at bsd/sys/net/if_clone.cc:525
#7 0x0000000fc0119908 in if_clone_attach (ifc=ifc@entry=0xfc061d908 <lo_cloner>) at bsd/sys/net/if_clone.cc:306
#8 0x0000000fc011a420 in vnet_loif_init () at bsd/sys/net/if_loop.cc:125
#9 0x0000000fc00ec2dc in net_init () at bsd/net.cc:64
#10 0x0000000fc00dc99c in main_cont (loader_argc=<optimized out>, loader_argv=<optimized out>) at loader.cc:729
#11 0x0000000fc0257444 in sched::thread::main (this=0xfc0663e28) at core/sched.cc:1267
#12 sched::thread_main_c (t=0xfc0663e28) at arch/aarch64/arch-switch.hh:162
#13 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) c
Continuing.
Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa0004088e800) at bsd/sys/net/if.cc:1187
1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0 ifa_init (ifa=ifa@entry=0xffffa0004088e800) at bsd/sys/net/if.cc:1187
#1 0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034, data=data@entry=0xffff8000407a2d50 "lo0", ifp=ifp@entry=0xffff90004088f000, td=td@entry=0x0)
at bsd/sys/netinet/in.cc:388
#2 0x0000000fc0107554 in osv::start_if (if_name="lo0", ip_addr="127.0.0.1", mask_addr="255.0.0.0") at bsd/porting/networking.cc:113
#3 0x0000000fc00ec350 in net_init () at /usr/include/c++/10/bits/char_traits.h:322
#4 0x0000000fc00dc99c in main_cont (loader_argc=<optimized out>, loader_argv=<optimized out>) at loader.cc:729
#5 0x0000000fc0257444 in sched::thread::main (this=0xfc0663e28) at core/sched.cc:1267
#6 sched::thread_main_c (t=0xfc0663e28) at arch/aarch64/arch-switch.hh:162
#7 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) c
Continuing.
Breakpoint 1, if_attach_internal (vmove=0, ifp=0xffffa0004088ec00) at bsd/sys/net/if.cc:1187
1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0 if_attach_internal (vmove=0, ifp=0xffffa0004088ec00) at bsd/sys/net/if.cc:1187
#1 if_attach (ifp=ifp@entry=0xffff900040961000) at bsd/sys/net/if.cc:519
#2 0x0000000fc0110314 in ether_ifattach (ifp=0xffff900040961000, lla=lla@entry=0xffff90004096a248 "RT") at bsd/sys/net/if_ethersubr.cc:902
#3 0x0000000fc017790c in virtio::net::net (this=0xffff90004096a000, dev=...) at drivers/virtio-net.cc:304
#4 0x0000000fc0177cdc in aligned_new<virtio::net, virtio::virtio_device&> () at /usr/include/c++/10/new:175
#5 virtio::net::probe (dev=<optimized out>) at drivers/virtio-net.cc:931
#6 0x0000000fc016cfd8 in std::function<hw::hw_driver* (hw::hw_device*)>::operator()(hw::hw_device*) const (__args#0=<optimized out>, this=<optimized out>)
at drivers/driver.cc:39
#7 operator() (dev=0xffffa0004096f300, __closure=0x200000100b10) at drivers/driver.cc:40
#8 std::__invoke_impl<void, hw::driver_manager::load_all()::<lambda(hw::hw_device*)>&, hw::hw_device*> (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#9 std::__invoke_r<void, hw::driver_manager::load_all()::<lambda(hw::hw_device*)>&, hw::hw_device*> (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
#10 std::_Function_handler<void(hw::hw_device*), hw::driver_manager::load_all()::<lambda(hw::hw_device*)> >::_M_invoke(const std::_Any_data &, hw::hw_device *&&) (
__functor=..., __args#0=<optimized out>) at /usr/include/c++/10/bits/std_function.h:291
#11 0x0000000fc0169658 in std::function<void (hw::hw_device*)>::operator()(hw::hw_device*) const (__args#0=<optimized out>, this=0x200000100b10)
at /usr/include/c++/10/bits/std_function.h:617
#12 hw::device_manager::for_each_device(std::function<void (hw::hw_device*)>) (this=this@entry=0xffffa00040854b40, func=...) at drivers/device.cc:67
#13 0x0000000fc016cd18 in hw::driver_manager::load_all (this=this@entry=0xffffa00040854f80) at /usr/include/c++/10/new:175
#14 0x0000000fc017b7e8 in arch_init_drivers () at arch/aarch64/arch-setup.cc:240
#15 0x0000000fc00dcf7c in do_main_thread (_main_args=0xffffa000406fb7f0) at loader.cc:416
#16 0x0000000fc02b8eb4 in operator() (__closure=0xffffa00040861c00) at libc/pthread.cc:116
#17 std::__invoke_impl<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__f=...)
at /usr/include/c++/10/bits/invoke.h:60
#18 std::__invoke_r<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
at /usr/include/c++/10/bits/invoke.h:153
#19 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/10/bits/std_function.h:291
#20 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860c00) at core/sched.cc:1267
#21 sched::thread_main_c (t=0xffffa00040860c00) at arch/aarch64/arch-switch.hh:162
#22 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) c
Continuing.
Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at bsd/sys/net/if.cc:1187
1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0 ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at bsd/sys/net/if.cc:1187
#1 0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034, data=data@entry=0x200000100a70 "eth0", ifp=ifp@entry=0xffff900040961000, td=td@entry=0x0)
at bsd/sys/netinet/in.cc:388
#2 0x0000000fc0107554 in osv::start_if (if_name="eth0", ip_addr="0.0.0.0", mask_addr="255.255.255.0") at bsd/porting/networking.cc:113
#3 0x0000000fc00da234 in operator() (if_name="eth0", __closure=0x200000100db0) at /usr/include/c++/10/ext/new_allocator.h:79
#4 std::__invoke_impl<void, do_main_thread(void*)::<lambda(std::string)>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__f=...)
at /usr/include/c++/10/bits/invoke.h:60
#5 std::__invoke_r<void, do_main_thread(void*)::<lambda(std::string)>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__fn=...)
at /usr/include/c++/10/bits/invoke.h:153
#6 std::_Function_handler<void(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), do_main_thread(void*)::<lambda(std::string)> >::_M_invoke(const std::_Any_data &, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > &&) (__functor=..., __args#0=...)
at /usr/include/c++/10/bits/std_function.h:291
#7 0x0000000fc0107274 in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) const (__args#0="", this=0x200000100db0) at /usr/include/c++/10/bits/std_function.h:617
#8 osv::for_each_if(std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>) (func=...) at bsd/porting/networking.cc:32
#9 0x0000000fc00dcfd4 in do_main_thread (_main_args=0xffffa000406fb7f0) at /usr/include/c++/10/new:175
#10 0x0000000fc02b8eb4 in operator() (__closure=0xffffa00040861c00) at libc/pthread.cc:116
#11 std::__invoke_impl<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__f=...)
at /usr/include/c++/10/bits/invoke.h:60
#12 std::__invoke_r<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
at /usr/include/c++/10/bits/invoke.h:153
#13 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/10/bits/std_function.h:291
#14 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860c00) at core/sched.cc:1267
#15 sched::thread_main_c (t=0xffffa00040860c00) at arch/aarch64/arch-switch.hh:162
#16 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) c
Continuing.
Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at bsd/sys/net/if.cc:1187
1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0 ifa_init (ifa=ifa@entry=0xffffa00040b0fc00) at bsd/sys/net/if.cc:1187
#1 0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034, data=data@entry=0x200000100a70 "eth0", ifp=ifp@entry=0xffff900040961000, td=td@entry=0x0)
at bsd/sys/netinet/in.cc:388
#2 0x0000000fc0107554 in osv::start_if (if_name="eth0", ip_addr="0.0.0.0", mask_addr="255.255.255.0") at bsd/porting/networking.cc:113
#3 0x0000000fc00da234 in operator() (if_name="eth0", __closure=0x200000100db0) at /usr/include/c++/10/ext/new_allocator.h:79
#4 std::__invoke_impl<void, do_main_thread(void*)::<lambda(std::string)>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__f=...)
at /usr/include/c++/10/bits/invoke.h:60
#5 std::__invoke_r<void, do_main_thread(void*)::<lambda(std::string)>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (__fn=...)
at /usr/include/c++/10/bits/invoke.h:153
#6 std::_Function_handler<void(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), do_main_thread(void*)::<lambda(std::string)> >::_M_invoke(const std::_Any_data &, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > &&) (__functor=..., __args#0=...)
at /usr/include/c++/10/bits/std_function.h:291
#7 0x0000000fc0107274 in std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) const (__args#0="", this=0x200000100db0) at /usr/include/c++/10/bits/std_function.h:617
#8 osv::for_each_if(std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>) (func=...) at bsd/porting/networking.cc:32
#9 0x0000000fc00dcfd4 in do_main_thread (_main_args=0xffffa000406fb7f0) at /usr/include/c++/10/new:175
#10 0x0000000fc02b8eb4 in operator() (__closure=0xffffa00040861c00) at libc/pthread.cc:116
#11 std::__invoke_impl<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__f=...)
at /usr/include/c++/10/bits/invoke.h:60
#12 std::__invoke_r<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
at /usr/include/c++/10/bits/invoke.h:153
#13 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/10/bits/std_function.h:291
#14 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040860c00) at core/sched.cc:1267
#15 sched::thread_main_c (t=0xffffa00040860c00) at arch/aarch64/arch-switch.hh:162
#16 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
(gdb) c
Continuing.
Breakpoint 1, ifa_init (ifa=ifa@entry=0xffffa00040960400) at bsd/sys/net/if.cc:1187
1187 mtx_init(&ifa->ifa_mtx, "bsd_ifaddr", NULL, MTX_DEF);
(gdb) bt
#0 ifa_init (ifa=ifa@entry=0xffffa00040960400) at bsd/sys/net/if.cc:1187
#1 0x0000000fc01219c8 in in_control (so=so@entry=0x0, cmd=1077963034, data=data@entry=0xffff800040d52d90 "eth0", ifp=ifp@entry=0xffff900040961000, td=td@entry=0x0)
at bsd/sys/netinet/in.cc:388
#2 0x0000000fc0107554 in osv::start_if (if_name="eth0", ip_addr="192.168.122.15", mask_addr="255.255.255.0") at bsd/porting/networking.cc:113
#3 0x0000000fc027b05c in dhcp::dhcp_interface_state::state_request (this=this@entry=0xffffa0004096bc40, dm=...) at /usr/include/c++/10/bits/char_traits.h:322
#4 0x0000000fc027c3d0 in dhcp::dhcp_interface_state::process_packet (this=0xffffa0004096bc40, m=m@entry=0xffffa00040c7c100) at core/dhcp.cc:633
#5 0x0000000fc027c540 in dhcp::dhcp_worker::dhcp_worker_fn (this=0xfc06aab30 <net_dhcp_worker>) at core/dhcp.cc:843
#6 0x0000000fc027c5e0 in operator() (__closure=<optimized out>, __closure=<optimized out>) at core/dhcp.cc:773
#7 std::__invoke_impl<void, dhcp::dhcp_worker::init()::<lambda()>&> (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#8 std::__invoke_r<void, dhcp::dhcp_worker::init()::<lambda()>&> (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
#9 std::_Function_handler<void(), dhcp::dhcp_worker::init()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
at /usr/include/c++/10/bits/std_function.h:291
#10 0x0000000fc0257444 in sched::thread::main (this=0xffffa00040b08800) at core/sched.cc:1267
#11 sched::thread_main_c (t=0xffffa00040b08800) at arch/aarch64/arch-switch.hh:162
#12 0x0000000fc017d788 in thread_main () at arch/aarch64/entry.S:116
Ideally, as a "good citizen" we would want to free what we allocated but given that even now these 5 calls to allocate bsd_ifaddr (mostly from osv::start_if()) are not balanced with corresponding ones to free bsd_ifaddr, possibly because we lack all necessary calls to osv::stop_if() (only one in core/dhcp.cc). But given there is only one occurrence of freeing bsd_ifaddr(), disabling reference counting logic would only lead to a single leak of memory (roughly 1024 bytes). These structures are long-lived and map to a handful of IPs/interfaces (lo, eth0, ...) OSv would have.
There is also extra benefit: the refcount_release()/refcount_acquire() functions called by ifa_free()/ifa_ref() are not free - they operate on atomics and probably lead to some contention in SMP case. I have not made any measurements but I would assume the tcp stack could become a little bit faster by not using atomics at all (there are 33 calls to ifa_ref() and 55 to ifa_free() in bsd subtree). For example, the refcount_acquire() calls this inline assembly:
After applying the patch (see above) to disable ifa_refcnt counting (ifa_ref(), ifa_free() do nothing) and running all 4 tst-bsd-tcp-* tests with 100K iterations on 1, 2, and 4 CPUs on both aarch64 and x86_64 I could not reproduce these crashes and "hang" scenarios.
The text was updated successfully, but these errors were encountered:
The description of this issue is mostly a copy of the email with the same title sent to the google group along with some comments.
As I was working on a patch to address issue #1087, I discovered that occasionally it would crash or hang while running a test
tst-bsd-tcp1-zsndrcv.cc
. The crash would look like this:As you can see the fault happened in line 1434 of if.cc:
when dereferencing the field
ifa_addr
of theifa
pointer.In this case, gdb actually showed
ifa_addr
with a valid non-zero address (which is interesting) but most of the time when I connected I would seeifa_addr
equal to 0.Sometimes this test would "hang", which I discovered was a tcp server thread left listening on a socket after tcp client thread terminated when its loop broke due to an error after calling connect() on a socket:
Initially, I thought that these "hangs" or crashes were not really specific to TCP stack, but instead just random manifestations of some bug in my patch fixing the issue #1087. The patch was about changing
preboot.S
,boot.S
, andcore/mmu.cc
to move aarch64 kernel code to a different much higher place in virtual memory so that the same loader.elf/kernel.elf could run correctly on both QEMU and firecracker. I speculated that somehow some of virt/phys or phys/virt translation routines have bugs, then possibly something wrong in the new assembly code updating early boot mapping page tables, or finally some dirty cache lines and some missing isb or dsb instruction. Nothing was helping and at some point, I found that strangely adding a nop instruction to premain() in loader.cc shifting the address of premain would somehow make this crash go away (or most likely I did not see it as often). Then I started debugging the TCP stack and code around the crash and "hang" to understand why exactly this was misbehaving. It had to do with some memory mapping bug I thought.To make this long story short I eventually took a step back and tried to reproduce the exact same crashes/"hangs" on the original version of the code before the patch as it is on master now. Also during the research, I realized that the other 3 tst-bsd-tcp1-*cc tests were failing in a similar way.
All of these tests exercise basic socket API fundamentals where one thread acts as a TCP server listening on some port and another thread acting a client in a loop of 400 iterations connecting to the server, sending some data and reading it back as an echo, and closing the socket. The 3 -z* ones exercise the zero-copy API to send and/or receive data.
After I increased the number of loops to something much higher like 100K I started seeing the exact same errors after 5-10K of iterations with the 0-address faults being more dominant. So clearly the issue was not caused by my kernel memory layout changes but due to some old bug and somehow my patch changes made these more frequent. I also tried to reproduce it on x86_64 with the same number of iterations but to no avail.
Now a little by luck and little by deduction, I managed to realize that the page fault might be caused by something writing over/erasing the memory (stack overflow?) where the ifa_addr field of the bsd_ifaddr struct (ifa variable in the example above). Here is the struct defined in the
bsd/sys/net/if_var.h
(it will be quite important in this long email below):Obviously, finding what possibly erases this field would be like looking for a needle in the stack of hay. Then I theorized, that the memory allocated to hold the
bsd_ifaddr
may be freed somehow and used again for the next malloc and end up being written to by something completely else.To that end, I captured the address of the
bsd_ifaddr
(for example0xffffa00040882800
) and then put a breakpoint or printf incore/mempool.cc:free()
to check if something frees that address. And indeed before the crash or hang it would be caught in this check. Eventually, I managed to capture what calls the free():And here is where the most critical part of freeBSD networking code in
bsd/sys/net/if.cc
comes:It seems that
ifa_free()
callsmtx_destroy()
andfree()
when theifa_refcnt
field ofbsd_ifaddr
is 0 afterrefcount_release()
is called. See code of refcount_release inbsd/sys/sys/refcount.h
:BTW the XXX comment made me a little bit suspicious. But before we go there let me try to explain what I figured out about the ifa_refcnt field. The comment from the bsd_ifaddr definition states this:
So it seems to be some sort of reference count field. But what purpose does it serve? The simple grep of
ifa_refcnt
to see where it is used gives this list:which in essence are the 3 functions from
bsd/sys/net/if.cc
I put couple of paragraphs above - ifa_init()
,ifa_ref()
andifa_free()
. I do not know for sure of the intention behindifa_refcnt
(I did study some FreeBSD commits) but based on its use by these 3 functions, it seems to act as a reference count to implement some sort of garbage collection mechanism aroundbsd_ifaddr
structures to make sure that it is eventually getting free()-ed after no longer used/referenced. From what I figured theifa_init()
is called during this struct allocation and initialization and there are two places in bsd subtree:if_attach_internal()
function inbsd/sys/net/if.cc
in_control()
when theSIOCSIFADDR
,SIOCSIFNETMASK
orSIOCSIFDSTADDR
cmd passed inbsd/sys/netinet/in.cc
In essence, the
ifa_init()
sets theifa_mtx
field and setsifa_refcnt
of thebsd_ifaddr
to 1.It turns out that the
ifa_ref()
is then called in 33 places andifa_free()
is called in 55 places in the bsd subtree. The former increments and the latter decrements theifa_refcnt
counter.So I started studying any changes to how these functions may have changed on OSv side of bsd tree and I have found this commit:
8ad9f53 - which is backport of the commit on freebsd side to remove some redundant (?) occurrences of ifa_ref() and
ifa_free()
from bsd/sys/netinet/ip_output.cc,Then, I found this patch in FreeBSD codebase around
refcount_release()
- freebsd/freebsd-src@157d5e6 - which changesrefcount_release()
to look like this:And then some other changes adding and removing
ifa_ref()
andifa_free()
:NET_EPOCH_CALL
in some related placesOn top of that the
ifa_ref()
andifa_free()
have been changed and enhanced with new related functions and look like this in the latest version (see https://github.com/freebsd/freebsd-src/blob/main/sys/net/if.c#L1759-L1795):The struct
ifaddr
has changed as well.Why am I listing all these changes? Well, my theory (I think more than that) is that there is some old bug in the reference counting logic around
ifa_refcnt
in the bsd substree - either un-balanced (too many, too few) number of calls toifa_ref()
/ifa_free()
or buggyrefcount_release()
function (see TODO comment) or combination of those that somehow only shows its ugly face on aarch64. Or possibly there is some race condition that makes calls toifa_free()
be executed beforeifa_ref()
even if they are balanced. Please note that this bug happens on 1 or more CPUs so it is not SMP-specific by any means.If that is the case then ideally we would want to fix this bug by back-porting all these commits and possibly many more from the current FreeBSD TCP code. And I have actually tried to do that - 1st by backporting the refcount_release() patch which did not help. Then apply some extra changes to add/remove some calls to
ifa_ref()
andifa_free()
. This led to some new crashes now even on x86_64. All of this has made me think that bringing back all necessary changes from FreeBSD tree to fix this bug would be a very tedious and risky effort. We might make the situation even worse.So instead I realized there is a simpler solution to this problem - eliminate buggy reference counting code like this:
The
bsd_ifaddr
is allocated only a handful of times (=5) and destroyed once during OSv runtime. Here is the list of places that I captured from gdb:Ideally, as a "good citizen" we would want to free what we allocated but given that even now these 5 calls to allocate
bsd_ifaddr
(mostly fromosv::start_if()
) are not balanced with corresponding ones to free bsd_ifaddr, possibly because we lack all necessary calls toosv::stop_if()
(only one incore/dhcp.cc
). But given there is only one occurrence of freeingbsd_ifaddr()
, disabling reference counting logic would only lead to a single leak of memory (roughly 1024 bytes). These structures are long-lived and map to a handful of IPs/interfaces (lo, eth0, ...) OSv would have.There is also extra benefit: the
refcount_release()
/refcount_acquire()
functions called byifa_free()
/ifa_ref()
are not free - they operate on atomics and probably lead to some contention in SMP case. I have not made any measurements but I would assume the tcp stack could become a little bit faster by not using atomics at all (there are 33 calls toifa_ref()
and 55 toifa_free()
in bsd subtree). For example, therefcount_acquire()
calls this inline assembly:This is actually a loop likely to happen in SMP contentious cases. See this article - https://cpufun.substack.com/p/atomics-in-aarch64?s=r.
After applying the patch (see above) to disable ifa_refcnt counting (
ifa_ref()
,ifa_free()
do nothing) and running all 4 tst-bsd-tcp-* tests with 100K iterations on 1, 2, and 4 CPUs on both aarch64 and x86_64 I could not reproduce these crashes and "hang" scenarios.The text was updated successfully, but these errors were encountered: