Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[coq] Support for token-based interruption #509

Merged
merged 1 commit into from
May 28, 2024
Merged

[coq] Support for token-based interruption #509

merged 1 commit into from
May 28, 2024

Conversation

ejgallego
Copy link
Owner

@ejgallego ejgallego commented Jun 21, 2023

This very preliminary PR reifies the Coq API so we can pass an interruption / cancellation token around.

Two implementations of the API are provided:

  • one using the "classical" Control.interrupt variable present in Coq
  • one using memprof-limits by @gadmm

The code is very preliminary and will likely be rewritten to better accomodate the API, however the current version works, and testing is much appreciated! This is a delicate part of the system so testing is badly needed.

In the examples we had, memprof-limits works superb! In particular it works in all the examples where polling fails :D

I think this should allow us to get rid of Coq polling eventually.

Some open issues:

  • While token-based interruption works pretty well, there are a few corner cases where the user experience is not optimal. In particular, when a command that takes a lot of time to run (like a Qed) is interrupted (for example the user does a search), we need to restart it from scratch. That's a big PITA. We could solve this in OCaml 5 both with Domains or algebraic effects + continuations, however....
  • ... unfortunately memprof-limits is still not ported to OCaml 5, thus Support for interrupting another domain ocaml/ocaml#11411 is still open.

Also, it didn't take long for us to find a bug in Coq and how they handle resources, c.f. coq/coq#17760

Thanks to @gadmm for all the interest and help.

Also cc: @HazardousPeach

Bugs fixed when using the memprof-limits backend: #139 #484 #487

@ejgallego
Copy link
Owner Author

We could solve this in OCaml 5 both with Domains or algebraic effects + continuations, however....

Actually I'm unsure that could work, indeed semantics of async effects are quite a bit more complex than semantics for regular ones, IIUC (c.f. https://arxiv.org/abs/2003.02110 ) .

@gadmm , is it feasible to think that async interruptions here (for token or limits) could be handled in OCaml 5 as an effect that could be resumed?

@ejgallego
Copy link
Owner Author

I think this needs more testing, so I will postpone to 0.1.8 for now, as I'd like to release 0.1.7 ASAP.

@ejgallego ejgallego added this to the 0.1.8 milestone Jun 23, 2023
@gadmm
Copy link

gadmm commented Jun 23, 2023

If the question is about asynchronously pausing/restarting instead of immediately cancelling, you might be overthinking it as you can probably do this with (sys)threads using condition variables from an async callback (whether with signals & signal masks, or with an approach à la memprof-limits).

@ejgallego
Copy link
Owner Author

If the question is about asynchronously pausing/restarting instead of immediately cancelling

Yes, that's the question.

Usually we have a very slow Qed, which is busy with some large terms in conversion (so no polling available)

With memprof-limits I can now interrupt the slow Qed, which is good, however after the interruption, the manager may decide that it'd like to resume the Qed computation.

Let's assume for simplicity that all the computations here are pure (or properly guarded).

you might be overthinking it as you can probably do this with (sys)threads using condition variables from an async callback (whether with signals & signal masks, or with an approach à la memprof-limits).

Would you minding explaining a bit more how that would work in the above case?

I guess what you suggest is to have the async handler determine whether the computation should be interrupted or resumed, then act accordingly ? But I'm unsure what the limitations there would be, for example if the hanlder needs to do IO before resuming.

I was thinking of an interface more similar to effects, where the handler gets the continuation, which can be resumed or not at any later arbitrary point.

@gadmm
Copy link

gadmm commented Jun 23, 2023

Using condition variables, you let the OS deal with scheduling. In this scenario there is no "async handler" and no need for one.

There is only an async callback, that is called inside the thread to be paused/interrupted. Inside this async callback, it is possible to wait on a condition variable as with control_check below:

type thread_status = Running | Paused | Interrupted

let status = { mutable status : thread_status; mutex : Mutex.create (); cond : Condition.create () }

(* from thread 1 *)
let control_set new_status =
  Mutex.lock status.mutex;
  status.status <- new_status;
  if new_status <> Paused then Condition.broadcast status.cond;
  Mutex.unlock status.mutex

(* from an async callback in thread 2 *)
let control_check () =
  Mutex.lock status.mutex;
  while status.status = Paused do
    Condition.wait status.cond status.mutex
  done;
  let interrupted = (status.status = Interrupted) in
  Mutex.unlock status.mutex;
  if interrupted then raise Sys.Break

Note that in your current approach, this involve changing memprof-limits itself to do something like that.

@gadmm
Copy link

gadmm commented Jun 23, 2023

To really answer your question though:

is it feasible to think that async interruptions here (for token or limits) could be handled in OCaml 5 as an effect that could be resumed?

Currently, it is not possible to perform an effect inside an async callback and handle it outside of the callback. There is no fundamental obstacle to implementing this, but: 1) this requires some work on the runtime to radically change how async callbacks run (essentially remove the need to switch from OCaml to C when running async actions), 2) you would need to convince upstream that this is something which they would like to have.

@ejgallego ejgallego modified the milestones: 0.1.8, 0.1.9 Jul 7, 2023
@ejgallego ejgallego modified the milestones: 0.1.9, 0.3.0 Oct 2, 2023
@HazardousPeach
Copy link
Collaborator

Hey sorry, recently realized that I wasn't getting notified on github mentions. I'll test this right away, this seems super useful.

@ejgallego
Copy link
Owner Author

Hi @HazardousPeach , we actually did a lot of progress with this implementation; I'll push a new tree soon, after discussion with Guillaume (the author of the lib we use here) we'd like to do some significant changes.

I'll be happy to merge this upstream once it gets a bit of testing.

Note that for your use case, we have finally worked (after your feedback and others) a protocol that should work much much better than LSP, but it will use the coq-lsp engine underneath.

@HazardousPeach

This comment was marked as outdated.

ejgallego added a commit that referenced this pull request Mar 31, 2024
…ken.

- We now conditionally depend on `memprof-limits` on OCaml 4.x
- Each call that can be interrupted now additionally takes a `Token.t`

There is still more work to be done so this can be properly used.

cc: #509
@ejgallego ejgallego force-pushed the limits branch 2 times, most recently from 55db79f to 0925408 Compare April 2, 2024 15:39
ejgallego added a commit that referenced this pull request Apr 3, 2024
- We now conditionally depend on `memprof-limits` on OCaml 4.x
- Each call that can be interrupted now additionally can take a `Token.t`

There is still more work to be done so this can be properly used.

cc: #509
ejgallego added a commit that referenced this pull request Apr 3, 2024
- We now conditionally depend on `memprof-limits` on OCaml 4.x
- Each call that can be interrupted now additionally can take a `Token.t`

There is still more work to be done so this can be properly used.

cc: #509
ejgallego added a commit that referenced this pull request Apr 3, 2024
For now the interruption token is still the global `Control.interrupt`
variable, but we adapt the code to the token interface.

This PR was part of #509.
ejgallego added a commit that referenced this pull request Apr 3, 2024
For now the interruption token is still the global `Control.interrupt`
variable, but we adapt the code to the token interface.

This PR was part of #509.
ejgallego added a commit that referenced this pull request Apr 3, 2024
For now the interruption token is still the global `Control.interrupt`
variable, but we adapt the code to the token interface.

This PR was part of #509.
@ejgallego ejgallego modified the milestones: 0.3.0, 0.1.9 Apr 3, 2024
@HazardousPeach
Copy link
Collaborator

Huh, so once I switched back to the 4.x OCaml compiler, I started getting the Prelude Not Found error again. To be sure, I created an opam switch from scratch with the following script, from inside the coq-lsp directory:

#!/usr/bin/env bash
set -e
opam switch create coq-lsp ocaml-base-compiler.4.14.2
eval $(opam env --switch=coq-lsp --set-switch)
opam install vendor/coq/coq{-core,-stdlib,ide-server,}.opam -y
opam install vendor/coq-serapi -y
opam install . -y

but the error persists, even though the same commands worked fine on OCaml 5.x.

Any ideas?

@ejgallego
Copy link
Owner Author

@HazardousPeach is your vendor/coq repository clean?

opam has weird rules, so maybe running git clean -fxd in vendor/coq could help. It could also be that your opam version is old and has some pinning bugs / lack of features?

[Indeed opam has to correctly pick the updated opam file from vendor and only moderm opam do]

@ejgallego
Copy link
Owner Author

Also check that the coq-lsp.opam file contains the rm -r vendor which is critical.

@HazardousPeach
Copy link
Collaborator

HazardousPeach commented May 6, 2024

My vendor/coq is clean. My Opam version is 2.1.2, is that recent enough? But I didn't see rm -r vendor in coq-lsp.opam, and realized that my code was a bit behind (I hadn't updated since you added that I guess), so I fetched the latest pull request code, deleted the switch, and recreated it with the same script.

This failed the first time around because something wasn't right with how opam scheduled the package installs on the last command; the opam install of vendor/coq-core.opam (among others) first compiled without mprof-limits (I think), so when the opam install . ran, it decided to recompile coq-core with mprof-limits. But in doing so it first removed the old version, then tried to build coq-lsp, then installed the new version of coq-core. So the coq-lsp build failed since there was no coq-core for it to import at build time.

So I just ran opam install . again, and got the error:

# Error: Files <...>/.opam/coq-lsp/lib/coq-serapi/serlib/serlib.cmxa
#        and <...>/.opam/coq-lsp/lib/coq-core/lib/lib.cmxa
#        make inconsistent assumptions over implementation CErrors

This seems to be because when coq-core was rebuilt, coq-serapi was not, even though coq-serapi links against it at an OCaml level. Maybe some issue with how the vendored coq-serapi specifies dependencies?

From there I ran opam reinstall vendor/coq-serapi, and then opam install . again, and coq-lsp successfully installed.

After all that though, interrupts still aren't working, with the same behavior as when I had OCaml 5.x.

My current OCaml is 4.14.2, could that be too new for some reason? Or could something else be going on?

For reference, memprof-limits is at 0.2.1

@ejgallego
Copy link
Owner Author

Oh indeed sorry, the problem with memprof-limits is annoying, I will update the instructions.

Indeed you need the latest code. Thanks a lot for the testing, I guess the next step is for me to try to reproduce the issue, would it be possible to share a problematic example?

Thanks again!

@ejgallego
Copy link
Owner Author

@HazardousPeach note that examples in #139 #484 #487 work correctly with the new interruption backend, so maybe these are good tests on your side.

The fix for Coq that I was waiting has been merged, so I will complete the docs, tweak display and usability and close this PR soon.

@HazardousPeach
Copy link
Collaborator

HazardousPeach commented May 7, 2024

Hmm, so the example from #487 doesn't interrupt properly on my end, so maybe there's a problem with how my coq-lsp is set up. Could you post the OCaml and memprof-limits versions that you've tested with? Could also be something about the way I'm driving coq-lsp in python versus through VSCode.
To aid in reproduction, you should be able to reproduce exactly what I'm running by running

pip install git+https://github.com/HazardousPeach/coq_serapy.git

And then running this script:

https://pastebin.com/H5JW2HeR

That clones CompCert, builds it, starts a coq_serapy agent instance with coq-lsp as the backend, and then runs the commands from #487, checking the goals in between each. When it runs the final simpl, after a 30 second timeout, it rolls back the simple and sends a a new document without the simpl, and asks for goals again. On my side, that hangs, and it keeps rolling back a new sentence every 30 seconds.

@ejgallego
Copy link
Owner Author

ejgallego commented May 10, 2024

@HazardousPeach , I have no problem interrupting the example from regular coq-lsp, so I cannot reproduce.

Will check serapy , you can check with VSCode if things work out well in your end (don't forget to pass the parameter)

@ejgallego
Copy link
Owner Author

So indeed things don't work in SerAPY, trying to understand a bit more, however the way the code is structured now makes it pretty hard to debug (why is the code checking trace messages? Also, the way requests are handled seems strange)

@ejgallego
Copy link
Owner Author

ejgallego commented May 10, 2024

Hi @HazardousPeach , I did some more debug, and indeed, I cannot reproduce with the standard node LSP client.

You can find the example here: #626

If you run the recommended command, the sequence serapy is trying works as expected.

I wouldn't be surprised if this is down to some low-level IO Python vs node differences, so IMHO the next step to debug is to be able to display what serapy is sending in the wire.

I will stop this debug session for now, as I'm unsure how to best update the serapy code for this tracking.

@ejgallego
Copy link
Owner Author

ejgallego commented May 10, 2024

I did some further checks, and actually the server is working fine; it seems that for some reason serapy gets deadlocked.

The trace for `test.py is:

...
=> "textDocument/didChange"  { [version: 20] "... unfold map. simpl."
=> "proof/goals" { id : 19 } { [version N/A] }
    [process_queue]: Serving Request: textDocument/didChange
    [process_queue]: Serving Request: proof/goals
    [request]: postponing rq : 19
<= "textDocument/publishDiagnostics" { [version: 20], partial=true }
=> "textDocument/didChange"  { [version: 21] "... unfold map."
    [process_queue]: Serving Request: textDocument/didChange
<= {"jsonrpc":"2.0","id":19,"error":{"code":-32802,"message":"Request got old in server"}}
    [process_queue]: Serving Request: proof/goals
    [request]: postponing rq : 20
<= "textDocument/publishDiagnostics" { [version: 21], partial=true }
<= "textDocument/publishDiagnostics" { [version: 21], partial=false }
    [serving]: rq: 20 | {k:pos | l: 17, c: 11 v:  p: true}
=> "proof/goals" { id : 20 } { answer... }

and continues like this, with coq-lsp mostly idle between each didChange notification.

A wild guess is that serapy doesn't correctly recognized that the server cancelled request id 19 here, as the checking was interrupted before the position on the document was reached to serve the goal request.

Anyways after having a look at the code, I cannot recommend you folks implement a RL agent on top of LSP, it is just not a good fit; I'd look into using petanque.

@HazardousPeach
Copy link
Collaborator

Ah, huh. When I did my own tracing I didn't see those proof/goals responses, maybe my setup has some difference causing an issue. If so, then I wouldn't have had a chance to experience the correct cancelling behavior, so my handling code would be untested and not ready for the correct behavior.

I've got some other tasks I have to catch up on, but I should be able to get back to this later in the week. When I do, I'll re-implement the tracing on my end and see if the behavior still differs, and I'll look into using petanque as a backend instead. It certainly seems like an easier api for what I'm doing, and the ocaml-in-python api would make it even more attractive (though the readme says it doesn't work yet).

However, the RL is already implemented fully on top of coq_serapy with the coq-serapi backend, the only issue is that backtracking is too expensive for smart search to work as well as it should in coq-serapi (but we're under submission anyway). The coq-lsp backend support was meant to fix this issue, and the support is complete except for interruption; the RL algorithm functions fully on proof searches where interruption is not necessary. So, I'll have to assess whether it makes sense to build a petanque backend right away, since the coq-lsp logic is already fully implemented (except for handling interrupt responses properly).

Relevant to this, interruption in petanque works the same as interruption in coq-lsp, right? So, if there's an issue with my setup causing problems interrupting coq-lsp, then that same issue would affect petanque?

@ejgallego
Copy link
Owner Author

Ah, huh. When I did my own tracing I didn't see those proof/goals responses, maybe my setup has some difference causing an issue. If so, then I wouldn't have had a chance to experience the correct cancelling behavior, so my handling code would be untested and not ready for the correct behavior.

So far I've done all my tests on main and worked fine, your backend also passed the parameters correctly.

I had to hack the server to realize it was working fine, tho you should see that in the test case you posted, CPU usage is 0%, however, when you pass --int_backend=Coq, it should go to 100% CPU use.

It certainly seems like an easier api for what I'm doing

Yes, and it will be hugely more efficient. It's been on my mind to fix the problems you folks had on SerAPI for a while, but it took a long time to get the pre-requisites ready.

Now the main todo is multi-threading, I have a patch that could kind of work, however memprof-limits is not available in OCaml 5 :( However I think for RL-like applications, multi-threading in the server is a lower priority than for UI development.

and the ocaml-in-python api would make it even more attractive (though the readme says it doesn't work yet).

That would be amazing indeed, however I didn't manage to get it working, must be some silly thing. I wonder if @thierry-martinez had a chance to look into these problems.

I'll have to assess whether it makes sense to build a petanque backend right away, since the coq-lsp logic is already fully implemented (except for handling interrupt responses properly).

Indeed it is hard for me to asses what would make sense in your case, but I think the problems w.r.t. request will be similar both for petanque and LSP.

We have just considered using JSON-RPC as the official transport for petanque as a server, so in this case, it would share the request base with LSP. IMHO that's a very good thing.

In general, you need to handle all RPC calls as an async request, and check the error cases properly. JSON-RPC provides reasonable settings for this and seems popular enough.

Relevant to this, interruption in petanque works the same as interruption in coq-lsp, right? So, if there's an issue with my setup causing problems interrupting coq-lsp, then that same issue would affect petanque?

I think the issue in your backend was not due to interruptions per-se but due to the cancellation of an async request. Note that this can happen in many cases, not only in interruptions (for example the server can inspect its queue and decide it got enough work to do).

So yes, the issue is very likely to occur in petanque.

Both coq-lsp and petanque are kind of flexible w.r.t. interruptions, for now, they are set to interrupt when a new command has been received, but could do otherwise, including having an expliciit Interrupt command.

@ejgallego
Copy link
Owner Author

Note that Guillaume is coding a python backend for petanque, so maybe that could be plugged into your infra easily.

…s on a sane branch.

This will prevent frustation from users.
@ejgallego
Copy link
Owner Author

Last bit of support has been added.

@ejgallego ejgallego merged commit ae6ec4d into main May 28, 2024
13 checks passed
@ejgallego ejgallego deleted the limits branch May 28, 2024 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants