Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsevents segfaulting under M1 mac #6151

Closed
emillon opened this issue Sep 15, 2022 · 11 comments · Fixed by #7312 or ocaml/opam-repository#23622
Closed

fsevents segfaulting under M1 mac #6151

emillon opened this issue Sep 15, 2022 · 11 comments · Fixed by #7312 or ocaml/opam-repository#23622
Labels

Comments

@emillon
Copy link
Collaborator

emillon commented Sep 15, 2022

Hi,

I'm trying to make the test suite work on M1 macs.

At the moment, the inline tests for fsevents are triggering a segfault.
I reduced that to the following piece of code:

open Stdune
module Event = Fsevents.Event

let main () =
  let dir = Temp.create Dir ~prefix:"fsevents_dune" ~suffix:"" in
  Sys.chdir (Path.to_string dir);
  let cwd = Sys.getcwd () in
  let fsevent_r = ref None in
  let fsevent =
    Fsevents.create ~paths:[ cwd ] ~latency:0.0 ~f:(fun _ ->
        Fsevents.stop (Option.value_exn !fsevent_r);
        raise Exit)
  in
  fsevent_r := Some fsevent;
  let runloop = Fsevents.RunLoop.in_current_thread () in
  Fsevents.start fsevent runloop;
  match Fsevents.RunLoop.run_current_thread runloop with
  | Ok () -> ()
  | Error e -> raise e

Here's the crash location:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xbeadde8d2dc8)
    frame #0: 0x00000001000f7d90 t.exe`dune_fsevents_callback(streamRef=<unavailable>, t=0x0000600000202dc0, numEvents=<unavailable>, eventPaths=1 element, eventFlags=0x000000010058c000, eventIds=0x0000000100588000) at fsevents_stubs.c:120:23 [opt]
   117    }
   118    v_res = caml_callback_exn(t->v_callback, v_events_xs);
   119    if (Is_exception_result(v_res)) {
-> 120      t->runloop->v_exn = Extract_exception(v_res);
   121      CFRunLoopStop(t->runloop->runloop);
   122    }
   123    CAMLdrop;
Target 0: (t.exe) stopped.
warning: t.exe was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) p/x t
(dune_fsevents_t *) $0 = 0x0000600000202dc0
(lldb) p/x t->runloop
(dune_runloop *) $1 = 0x0000beadde8d2dc0
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xbeadde8d2dc8)
  * frame #0: 0x00000001000f7d90 t.exe`dune_fsevents_callback(streamRef=<unavailable>, t=0x0000600000202dc0, numEvents=<unavailable>, eventPaths=1 element, eventFlags=0x000000010058c000, eventIds=0x0000000100588000) at fsevents_stubs.c:120:23 [opt]
    frame #1: 0x00000001bd6b3000 FSEvents`implementation_callback_rpc + 3560
    frame #2: 0x00000001bd6b2190 FSEvents`_Xcallback_rpc + 220
    frame #3: 0x00000001bd6b2088 FSEvents`FSEventsD2F_server + 72
    frame #4: 0x00000001bd6b51cc FSEvents`FSEventsClientProcessMessageCallback + 68
    frame #5: 0x00000001b6275858 CoreFoundation`__CFMachPortPerform + 260
    frame #6: 0x00000001b62458c8 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 60
    frame #7: 0x00000001b6245784 CoreFoundation`__CFRunLoopDoSource1 + 604
    frame #8: 0x00000001b6243c18 CoreFoundation`__CFRunLoopRun + 2372
    frame #9: 0x00000001b6242b34 CoreFoundation`CFRunLoopRunSpecific + 600
    frame #10: 0x00000001b62cfcc8 CoreFoundation`CFRunLoopRun + 64
    frame #11: 0x00000001000f78c8 t.exe`dune_fsevents_runloop_run(v_runloop=<unavailable>) at fsevents_stubs.c:59:3 [opt]
    frame #12: 0x0000000100127c2c t.exe`caml_c_call + 28
    frame #13: 0x0000000100068404 t.exe`camlFsevents__run_current_thread_481 + 52
    frame #14: 0x0000000100005e10 t.exe`camlDune__exe__T__code_end + 360
    frame #15: 0x0000000100005c98 t.exe`camlDune__exe__T__code_begin + 32
    frame #16: 0x0000000100002ae4 t.exe`caml_program + 4716
    frame #17: 0x0000000100127c9c t.exe`caml_start_program + 104
    frame #18: 0x000000010010226c t.exe`caml_startup_common(argv=0x0000000100068404, pooling=<unavailable>) at startup_nat.c:160:9 [opt]
    frame #19: 0x00000001001022e0 t.exe`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:167:10 [opt]
    frame #20: 0x00000001001022d8 t.exe`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:172:15 [opt]
    frame #21: 0x00000001001022d8 t.exe`caml_main(argv=<unavailable>) at startup_nat.c:179:3 [opt]
    frame #22: 0x0000000100102340 t.exe`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
    frame #23: 0x000000010038508c dyld`start + 520

Here, t->runloop cannot be written to. I'm not too sure about what happens with memory management in the fsevent stubs. Printing the value of t->runloop before and after caml_callback_exn shows a different value. One thing I also noticed is that in dune_fsevents_flush_sync, we're releasing the runtime before FSEventStreamFlushSync, but that function runs callbacks which can call out to ocaml code. Even if the runtime is re-acquired in dune_fsevents_callback, I'm not sure that's valid.

Can you have a look @rgrinberg ? Thanks!

@nojb
Copy link
Collaborator

nojb commented Sep 15, 2022

Do you get the segfault if you call Fsevents.stop outside of the callback (after Fsevents.run_current_thread returns)? I believe Fsevents.stop actually frees the associated data structure, so it shouldn't be done when the watcher is still in use.

@emillon
Copy link
Collaborator Author

emillon commented Sep 15, 2022

Yes, that makes the segfault go away. In the test, it's being done in a separate thread. Do you have an idea how to fix this? Move stop to finally in the test (but the test seems to observe that behavior) or moving the deallocation in a finalizer maybe?

@emillon
Copy link
Collaborator Author

emillon commented Sep 29, 2022

I managed to remove the segfault by moving caml_stat_free to a finalizer but the tests do not seem to agree between the committed version of expectations, M1 output and intel output. I'll open a PR with that to discuss.

@emillon
Copy link
Collaborator Author

emillon commented Oct 11, 2022

That PR is #6215

@gridbugs
Copy link
Collaborator

gridbugs commented Nov 1, 2022

Head up that I also encounter this segfault on an intel mac

@emillon
Copy link
Collaborator Author

emillon commented Feb 22, 2023

Ah, that's great. Can you try #6215? I'm not super familiar with these stubs but there's definitely something wrong with memory management in there.

@gridbugs
Copy link
Collaborator

gridbugs commented Mar 6, 2023

I tried #6215 and running the code from the top of the issues on an intel and m1 mac and I no longer get the seg fault (I do get Fatal error: exception Stdlib.Exit now but not sure if that's intentional).

@rgrinberg
Copy link
Member

Do you want to take over that PR and get it over the finish line?

@emillon
Copy link
Collaborator Author

emillon commented Mar 7, 2023

That would be appreciated. I can guide you but I don't have a mac handy at the moment so I can't easily work on it.

@gridbugs
Copy link
Collaborator

gridbugs commented Mar 8, 2023

Yeh sure

@gridbugs
Copy link
Collaborator

I did some digging today into why the crash happens when the fsevents callback calls Fsevents.stop.

The c function dune_fsevents_callback is called on each event, and invokes the user-provided callback (the ~f argument to Fsevents.create) before dereferencing a field of the dune_fsevents_t struct . The c function dune_fsevents_stop deallocates the dune_events_t struct and lucky for us the deallocator seems to corrupt its pointer fields. dune_fsevents_stop is called by the ocaml function Fsevents.stop. So when the fsevents callback calls Fsevents.stop dune_fsevents_callback tries to dereference a field of the dune_events_t after its pointer fields have been corrupted and a segfault happens.

This implies that we'd also see this crash if we ever started, stopped, and then re-started a Fsevents.t as stop causes the value to be deallocated but it's only allocated inside create, though is moot as the Fsevents.state state machine doesn't allow stopping and re-starting.

@emillon's fix (#6215) solves the problem by removing the call to caml_stat_free(t); from dune_fsevents_stop and registering a custom finalizer for the dune_fsevents_t object instead. I've copied to #7312 and rebased onto main.

emillon added a commit to emillon/opam-repository that referenced this issue Apr 4, 2023
…, dune-rpc, dune-rpc-lwt, dune-private-libs, dune-glob, dune-configurator, dune-build-info, dune-action-plugin and chrome-trace (3.7.1)

CHANGES:

- Fix segfault on MacOS when dune was being shutdown while in watch mode.
  (ocaml/dune#7312, fixes ocaml/dune#6151, @gridbugs, @emillon)

- Fix preludes not being recorded as dependencies in the `(mdx)` stanza (ocaml/dune#7109,
  fixes ocaml/dune#7077, @emillon).

- Pass correct flags when compiling `stdlib.ml`. (ocaml/dune#7241, @emillon)

- Handle "Too many links" errors when using Dune cache on Windows.  The fix in
  3.7.0 for this same issue was not effective due to a typo. (ocaml/dune#7472, @nojb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment