Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal tracing implementation (internal tracing PR 1 of 3) #12703

Merged

Conversation

tizoc
Copy link
Member

@tizoc tizoc commented Feb 23, 2023

This PR implements support for internal tracing. It is still a WIP but is already close to the final version.

Note that this is a new implementation, different from the one we have been deploying to our cluster, but it records enough information to allow for the reproduction of the same features provided by the original implementation.

This version is much simpler, and only records the flat, low-level trace without trying to make any sense of it. This will be done by an external consumer (unlike in the version we have been using so far that performs this inside the node, and exposes traces through new graphql queries). This external consumer doesn't need to belong to this repository, and can change without requiring changes to the node itself (except when new traces or metadata is required).

The interface and format are documented in https://github.com/openmina/mina/blob/feature/internal-tracing-develop/src/lib/internal_tracing/internal_tracing.mli

Impact:

  • Disabled by default.
  • Internal tracing will produce extra output files when enabled. Support for producing output to different targets (or using different strategies) is easy to add.
  • When making changes to code that contains trace recording calls, these calls will have to be taken into account, otherwise the quality of the traces may degrade.

Scope:

  • Current version focuses on block production and processing, but can be easily extended to also trace other parts of the system like storage/frontier related operations (original implementation traces that too but it is missing in this version) the SNARK and transaction pools, etc
  • There is preliminary support for proof production tracing being worked on, but right not that is not part of this PR.

Right now there are 3 commits separated to ease reviewing.

  • The first commit adds the library that implements internal tracing support
  • The second commit integrates it into the node by adding the CLI flag, mina client command, setup, etc.
  • The third commit adds trace recording calls to the code

I will keep adding commits on top but eventually I will merge them back accordingly to each one of those 3 commits.

Pending:

  • Proof tracing. @binier is working on this, it needs to be adapted to this new version of internal tracing (shouldn't be an issue). May be submitted as a separate PR later? (left for later)
  • Support for dynamically enabling/disabling tracing through the client (currently this can only be controlled at startup, by default internal tracing is disabled, and the -internal-tracing flag enables it).
  • Proper file handling (current version outputs the trace to stdout, final version will use logrotated files, and maybe support for streaming over a pipe)
  • Cleanup and revise checkpoints (some of the checkpoints are probably not required anymore and can me removed, others have to be revised to make sure that they are still correct after a rebase over develop)
  • Recover useful checkpoints lost when rebasing over develop (mostly related to transaction application, the model changed considerably in develop when compared to release/2.0.0)
  • Record any other events that may be useful (like status changes: bootstrap, synced, etc) (left for later)
  • Unit/integration tests? not sure how to best approach this, but ideally during the test suite traces would be collected and checked to ensure that they remain correct after changes made to the node. (left for later)

Checklist:

  • Modified the current draft of release notes with details on what is completed or incomplete within this project
  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? List them

@tizoc tizoc force-pushed the feature/internal-tracing-develop branch from 604f9d3 to c2cd190 Compare February 23, 2023 19:32
@tizoc
Copy link
Member Author

tizoc commented Feb 24, 2023

Question, would it be a problem to update yojson to version 2.0.2? That would let me avoid a bunch of small allocations when handling the logged events by reusing a permanent buffer when rendering the JSON strings from the Yojson values.

The opam.export diff would look like this (notice that it upgrades the graphql libraries):

--- /home/bruno/projects/mina-protocol/mina/opam.export 2023-02-21 18:48:29.462164153 -0300
+++ opam.export 2023-02-24 09:56:21.981966163 -0300
@@ -1,4 +1,5 @@
 opam-version: "2.0"
+compiler: ["ocaml-base-compiler.4.14.0"]
 roots: [
   "alcotest.1.1.0"
   "angstrom.0.15.0"
@@ -12,16 +13,16 @@
   "cohttp-async.5.0.0"
   "core_extended.v0.14.0"
   "extlib.1.7.8"
-  "graphql-async.0.13.0"
+  "graphql-async.0.14.0"
   "graphql-cohttp.0.13.0"
-  "graphql-lwt.0.13.0"
+  "graphql-lwt.0.14.0"
   "graphql_ppx.1.2.2"
   "js_of_ocaml.4.0.0"
   "js_of_ocaml-ppx.4.0.0"
   "js_of_ocaml-toplevel.4.0.0"
   "lmdb.1.0"
   "menhir.20210419"
-  "merlin.4.5-414"
+  "merlin.4.7-414"
   "ocaml-base-compiler.4.14.0"
   "ocamlformat.0.20.1"
   "ocamlgraph.1.8.8"
@@ -38,7 +39,7 @@
   "rpc_parallel.v0.14.0"
   "sexp_diff_kernel.v0.14.0"
   "utop.2.9.1"
-  "yojson.1.7.0"
+  "yojson.2.0.2"
 ]
 installed: [
   "alcotest.1.1.0"
@@ -102,7 +103,7 @@
   "ctypes-foreign.0.4.0"
   "digestif.0.9.0"
   "domain-name.0.3.0"
-  "dot-merlin-reader.4.2"
+  "dot-merlin-reader.4.6"
   "dune.3.3.1"
   "dune-build-info.3.1.1"
   "dune-configurator.2.8.2"
@@ -115,10 +116,10 @@
   "fix.20201120"
   "fmt.0.8.6"
   "fpath.0.7.3"
-  "graphql.0.13.0"
-  "graphql-async.0.13.0"
+  "graphql.0.14.0"
+  "graphql-async.0.14.0"
   "graphql-cohttp.0.13.0"
-  "graphql-lwt.0.13.0"
+  "graphql-lwt.0.14.0"
   "graphql_parser.0.12.2"
   "graphql_ppx.1.2.2"
   "incremental.v0.14.0"
@@ -144,8 +145,9 @@
   "menhir.20210419"
   "menhirLib.20210419"
   "menhirSdk.20210419"
-  "merlin.4.5-414"
+  "merlin.4.7-414"
   "merlin-extend.0.6.1"
+  "merlin-lib.4.7-414"
   "mew.0.1.0"
   "mew_vi.0.5.0"
   "minicli.5.0.2"
@@ -253,7 +255,7 @@
   "uuseg.13.0.0"
   "uutf.1.0.2"
   "variantslib.v0.14.0"
-  "yojson.1.7.0"
+  "yojson.2.0.2"
   "zarith.1.7"
   "zarith_stubs_js.v0.14.1"
   "zed.3.1.0"

@tizoc tizoc force-pushed the feature/internal-tracing-develop branch 3 times, most recently from 3a8f0b3 to cfe1559 Compare February 24, 2023 19:56
@tizoc
Copy link
Member Author

tizoc commented Feb 24, 2023

Still need to revise the recorded checkpoints, and maybe add more documentation. API could change a little but I think it is ready for reviewing as-is. Removing draft status.

@tizoc tizoc marked this pull request as ready for review February 24, 2023 20:04
@tizoc tizoc requested a review from a team as a code owner February 24, 2023 20:04
@tizoc tizoc force-pushed the feature/internal-tracing-develop branch 3 times, most recently from f65863f to b773585 Compare February 25, 2023 19:02
@tizoc tizoc requested a review from a team as a code owner February 25, 2023 19:02
@psteckler
Copy link
Member

There's an existing logging mechanism, is this additional mechanism needed?

You could add a trace level to that mechanism, and recover the traces using logproc. Alternatively, use a trace keyword in logs, and grep for that.

@tizoc
Copy link
Member Author

tizoc commented Feb 27, 2023

(EDIT: see next comment, I may have misunderstood what you were suggesting)

Hello @psteckler, that is a good question. That approach was considered when switching to the current implementation, and the separate output channel is not strictly needed (except for the part that attaches the current block information to the async execution context and detects when the async scheduler switches to processing a different block, so that checkpoints can be properly associated to the correct block trace), but the current implementation has some advantages:

  • a logger is not required in scope (so if a new checkpoint is added deep into the call stack, there is no need to update the call-chain to add a logger). When adding a new checkpoint, the only requirement is that at some point, in the current Async execution context information about the current block gets attached.
  • the output is more compact
  • is easy to trace Rust code too (part of proof tracing involves this)
  • can be enabled/disabled dynamically (trace logging level has to be set at startup, but I guess this could be changed).
  • being independent of the logging mechanism is in itself useful. The output is not meant to be read by humans, but instead to be consumed and processed by external tools (in our case, it is used to reconstruct structured traces for each block so that we can display them in our UI, and also compare the results of different runs on CI):

image

@tizoc
Copy link
Member Author

tizoc commented Feb 27, 2023

Thinking a bit more about this, I may have mis-understood what you meant @psteckler.

You could add a trace level to that mechanism

I got confused by the above because that level exists already, and at some point I considered just adding more [%log trace] calls to the code and just process the logs.

But what you meant is that I could add a new separate level for these internal traces, correct? Then by registering a new log consumer with a custom processor and transport, I could do the same thing I am doing now. So I can still control the format, output target, etc, and also be able to dynamically toggle internal tracing without affecting regular logging.

I think that would solve many of the issues I mentioned above, but a big one that is not solved is the requirement for having a logger handle in the scope (but maybe there is a solution for that).

@psteckler
Copy link
Member

I'd forgotten about the existing trace level, but yes, you could add a new level that's treated specially.

Propagating the logger argument can be tedious in some cases, the compiler will help you find them, of course. In many cases, it's available in the CONTEXT argument.

@tizoc
Copy link
Member Author

tizoc commented Feb 27, 2023

I did a quick test by adding a dummy ~logger:(_ : Logger.t) argument to the trace recording functions and updated all the calls to see how much would break. Surprisingly to me, with the current checkpoints I had to update far fewer calls that I thought to make the logger available in the scopes where it was required and not present already.

I will investigate your suggested approach more, and if I don't find any serious blocker (but I think it should work fine), I will commit an alternative version based on that.

@tizoc tizoc force-pushed the feature/internal-tracing-develop branch from b773585 to c8fc3d2 Compare February 28, 2023 17:43
@tizoc
Copy link
Member Author

tizoc commented Feb 28, 2023

@psteckler hello. I just pushed a new version that uses the logging system as you suggested. A new spy level was added (trace is taken already) as the lowest priority, and a new consumer to process that level too.
The metadata, control, etc recording functions were replaced by special control strings that are interpreted accordingly.

checkpoints looks like this now:

 [%log spy] "Generate_next_state" ;
 let%bind next_state_opt =
   generate_next_state ~constraint_constants ~scheduled_time
     ~block_data ~previous_protocol_state ~time_controller
     ~staged_ledger:(Breadcrumb.staged_ledger crumb)
     ~transactions ~get_completed_work ~logger ~log_block_creation
     ~winner_pk:winner_pubkey ~block_reward_threshold
 in
 [%log spy] "Generate_next_state_done" ;

and special commands look like this:

[%log spy] "@block_metadata"
  ~metadata:
    [ ( "blockchain_length"
      , Mina_numbers.Length.to_yojson
        @@ Mina_block.blockchain_length
        @@ Breadcrumb.block breadcrumb )
    ] ;

The Internal_tracing.with_slot/with_state_hash calls remain the same.

@tizoc
Copy link
Member Author

tizoc commented Feb 28, 2023

Btw, a few things may change (probably not much) and I still have to document the new interface, but it would be very useful to get your opinion on this version. Once everything is ready and if we decide to go with this version I will re-organize the commits so that they follow the original structure:

  • implementation commit
  • integration commit
  • checkpoints commit.

@psteckler
Copy link
Member

psteckler commented Feb 28, 2023

@tizoc Thanks, I prefer this style.

There is still the one comment I had made re List.is_empty.

Would internal be a better name than spy? That's consistent with the internal-tracing idea.

@tizoc
Copy link
Member Author

tizoc commented Feb 28, 2023

@tizoc Thanks, I prefer this style.

Great! I am going to settle with this approach then. I will reorganize the commits once I am done verifying the checkpoints and making sure I am not missing anything important that the external trace processing tool needs.

There is still the one comment I had made re List.is_empty.

Sure, I will fix that in the final pass (I was avoiding any changes to the original code unless required to trace new checkpoints). (EDIT: just fixed it now instead)

Would internal be a better name than spy? That's consistent with the internal-tracing idea.

Yes, that works too.

@tizoc tizoc force-pushed the feature/internal-tracing-develop branch 5 times, most recently from b778ea1 to 5784a0b Compare March 6, 2023 12:27
@tizoc
Copy link
Member Author

tizoc commented Mar 31, 2023

@deepthiskumar pushed.

@tizoc
Copy link
Member Author

tizoc commented Mar 31, 2023

Oops, pushed to another PR's branch, pushed here now

@deepthiskumar
Copy link
Member

!ci-build-me

@tizoc
Copy link
Member Author

tizoc commented Apr 3, 2023

The docker image issue is fixed now, but I now see new failures that weren't there before.

@deepthiskumar
Copy link
Member

!ci-build-me

2 similar comments
@jurajselep
Copy link

!ci-build-me

@deepthiskumar
Copy link
Member

!ci-build-me

@tizoc tizoc force-pushed the feature/internal-tracing-develop branch 2 times, most recently from a8cf9db to 333323e Compare April 5, 2023 16:15
@tizoc tizoc force-pushed the feature/internal-tracing-develop branch from 333323e to abbd24d Compare April 5, 2023 16:34
@jurajselep
Copy link

!ci-build-me

@deepthiskumar
Copy link
Member

!ci-build-me

1 similar comment
@tizoc
Copy link
Member Author

tizoc commented Apr 7, 2023

!ci-build-me

@tizoc
Copy link
Member Author

tizoc commented Apr 7, 2023

hmmm, I got an invite to the organization, but seems I still cannot trigger the CI run myself

@jurajselep
Copy link

!ci-build-me

@tizoc
Copy link
Member Author

tizoc commented Apr 11, 2023

!ci-build-me

1 similar comment
@deepthiskumar
Copy link
Member

!ci-build-me

@deepthiskumar
Copy link
Member

@tizoc you need to be a public member. You should be able to change that in your settings

@tizoc
Copy link
Member Author

tizoc commented Apr 11, 2023

@deepthiskumar thank you!! Just tested it here and it worked now #12874 (comment)

@deepthiskumar
Copy link
Member

!approved-for-mainnet

@deepthiskumar deepthiskumar merged commit b0a03de into MinaProtocol:berkeley Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants