Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal tracing for prover and verifier (internal tracing PR 2 of 3) #12874

Merged

Conversation

tizoc
Copy link
Member

@tizoc tizoc commented Mar 20, 2023

In addition to the tracing output generated by the first PR, here two new output files are added, one for the verifier process used by the transition frontier (other verifier processes are not currently traced), and another one for the prover process used by the block producer. If the node is launched with internal tracing enabled, those two subprocesses will be launched with internal tracing enabled too. When the mina client is used to toggle internal tracing, an RPC call will be made to those subprocesses to toggle internal tracing too.

Explain your changes:

This PR implements internal tracing for the prover and verifier. Builds on top of #12703

Explain how you tested your changes:

By consuming the traces with https://github.com/openmina/internal-trace-consumer and running it on our cluster.

Example of a structured trace generated by the trace consumer:

https://gist.github.com/tizoc/aab4d1dcf9d6b1d6857b0ac8ce094f03

Checklist:

  • Modified the current draft of release notes with details on what is completed or incomplete within this project
  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? List them
  • Closes #0000

@tizoc tizoc requested review from a team as code owners March 20, 2023 22:57
@tizoc tizoc force-pushed the feature/prover-internal-tracing branch 3 times, most recently from 7446d68 to e138ef3 Compare March 21, 2023 11:49
Copy link
Member

@mrmr1993 mrmr1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Left a few nits, but also: do you have any statistics around the additional overhead of adding these new log messages? I'm hoping that it's minimal, but if it isn't, we should call out that overhead explicitly in the CLI flag's / command's description.

| Error
| Faulty_peer
| Fatal
| Internal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it might be cleaner to call this log level Checkpoint or similar, Internal feels somewhat non-specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original name was "spy" (because it was short and what it did was to expose what happened internally so that an external consumer could process it). Then it was switched to "internal" based on feedback here: #12703 (comment)

Ideally I would have used "trace" but it is taken already. "checkpoint" works too (not 100% ideal because there is some output that is not specifically checkpoints, like metadata).

Just let me know and I will rename it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@psteckler what's your preference between Checkpoint and Internal?

@@ -309,6 +320,16 @@ let run ~logger ~trust_system ~verifier ~transition_reader
, time_received ) ;
return ()
| Error error ->
Internal_tracing.with_state_hash state_hash
@@ fun () ->
[%log internal] "Failure"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this log message be more specific? Failure seems very broad.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the ~metadata argument, the reason for the failure is attached there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the GCP stackdriver view, this will show up as the message Failure, and you have to manually expand to get the metadata. If there's a more descriptive message that could go here instead, that would help us interpret the tracing messages while looking at the raw logs.

Copy link
Member Author

@tizoc tizoc Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Let me think about it for a bit, but I think something like Failure:<arbitrary output> (with : or \t or some specific separator I can use to ignore the remaining text in the consumer) would do. Would that be good enough? (I don't know what it looks like in stackdriver).

Copy link
Member Author

@tizoc tizoc Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrmr1993 btw just to make sure we are on the same page. The files produced by the internal tracing don't look like the JSON logs produced by the node, they just contain a sequence of events and control commands that look like this:

{"current_block":"3NL3n4UnfDEmbx2hap27Z3z7PU1D4pktAd9qmj2PvnCUAhnNT4rQ"}
["Produce_state_transition_proof",1679321390.6394472]
{"metadata":{"transactions_count":0}}
{"current_block":"3NLqqdkAQnbNorXKRFuWC9fz45jvdF1XUiwtseWzRbCB8vaxUpkS"}
{"block_metadata":{"blockchain_length":"2"}}
["External_block_received",1679321408.8380573]
["Initial_validation",1679321408.8467133]
["Verify_blockchain_snarks",1679321408.8499875]
{"metadata":{"count":1}}
{"current_block":"3NL3n4UnfDEmbx2hap27Z3z7PU1D4pktAd9qmj2PvnCUAhnNT4rQ"}
["Produce_chain_transition_proof",1679321409.389129]
["Produce_validated_transition",1679321409.3892083]
["Build_breadcrumb",1679321409.3894844]
{"block_metadata":{"coinbase_receiver":"B62qnzLUoQaPPM33X6JpVJUMq36QSibEgZep4U2idqFwxj3GXK8drXF","creator":"B62qnzLUoQaPPM33X6JpVJUMq36QSibEgZep4U2idqFwxj3GXK8drXF","global_slot":"3","previous_state_hash":"3NLSc2Y4MkCgPHGbfeuNX23Hpk5ZPs9LHgegf28DcMWWVKCxz9wo","slot":"3","winner":"B62qogjiboJ1tsGm42AtnWuxiL6yX3qDb2CGFVtTgD9iEjxGSEQ3uzm"}}
["Validate_staged_ledger_diff",1679321409.3895411]
["Prediff",1679321409.3908107]
["Verify_commands",1679321409.390832]
{"metadata":{"count":0}}
["Verify_commands_done",1679321415.2565856]
["Apply_diff",1679321415.2566586]
["Update_coinbase_stack",1679321415.2567432]
{"metadata":{"coinbases":1,"commands_count":0,"max_throughput":128,"proofs_waiting":0,"spots_available":128,"transactions":1,"works":0}}
["Update_ledger_and_get_statements",1679321415.2567675]
{"metadata":{"partition":"single"}}
["Update_ledger_and_get_statements_done",1679321415.2621703]
["Update_coinbase_stack_done",1679321415.2622075]
{"metadata":{"data_len":1,"is_new_stack":true,"transactions_len":1}}
["Check_for_sufficient_snark_work",1679321415.2622213]
{"metadata":{"free_space":128,"required_pairs":0,"slots":1,"work_count":0}}
["Check_zero_fee_excess",1679321415.262237]
["Fill_work_and_enqueue_transactions",1679321415.262251]
{"metadata":{"emitted_proof":false,"merge_jobs_created":0,"scan_state_added_works":0,"total_proofs":0}}
["Update_pending_coinbase_collection",1679321415.262411]

I mention this because I don't know what is the format expected by stackdriver. Obviously, if needed another log processor+transport can be registered that outputs logs in the same format as the rest of the logs.

In any case, if it is useful I can add extra text after the failure (something like a comment so it looks like ["Failure // Validation callback expired", 1679321415.43431])

@@ -329,6 +350,11 @@ let run ~logger ~trust_system ~verifier ~transition_reader
|> Protocol_state.hashes )
.state_hash
in
Internal_tracing.with_state_hash state_hash
@@ fun () ->
[%log internal] "Failure"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, it is attached in ~metadata

@tizoc
Copy link
Member Author

tizoc commented Mar 21, 2023

@mrmr1993 thanks for the review!

do you have any statistics around the additional overhead of adding these new log messages? I'm hoping that it's minimal, but if it isn't, we should call out that overhead explicitly in the CLI flag's / command's description.

I don't have any, but if you point me to a good way to compare it I can do some measurements. I guess I would need an archive node to collect some blocks, and then try to replay them?

Note that enabling/disabling the tracing only has an effect on I/O, otherwise the logging functions are still called, and the parameters to it are still created. But in general, I don't think it should have a noticeable impact. Compared to regular logging, the output is quite a bit more compact (the examples in the gist are expanded and not what the node generates). Also there is no tracing done inside loops which is where even small overheads could add up and become meaningful.

Btw please note that the core was implemented in this PR #12703 (from where the first 3 commits come from), this one adds the extra parts required for tracing the verifier and prover.

@tizoc tizoc force-pushed the feature/prover-internal-tracing branch 2 times, most recently from 2356c55 to b41e77c Compare March 22, 2023 19:58
@tizoc tizoc changed the base branch from develop to berkeley March 22, 2023 19:59
@tizoc
Copy link
Member Author

tizoc commented Mar 22, 2023

Rebased over berkeley and made it the target branch for the PR.

@tizoc tizoc force-pushed the feature/prover-internal-tracing branch from b41e77c to d55a134 Compare March 28, 2023 20:40
@deepthiskumar
Copy link
Member

!ci-build-me

@tizoc tizoc changed the title Internal tracing for prover and verifier Internal tracing for prover and verifier (internal tracing PR 2 of 3) Mar 30, 2023
@tizoc tizoc force-pushed the feature/prover-internal-tracing branch from d55a134 to 4fc99aa Compare March 30, 2023 01:13
@deepthiskumar
Copy link
Member

!ci-build-me

@tizoc
Copy link
Member Author

tizoc commented Mar 31, 2023

Just pushed a fix for the test that was failing (Was missing an update to the Verifier.create call)

@deepthiskumar
Copy link
Member

!ci-build-me

1 similar comment
@deepthiskumar
Copy link
Member

!ci-build-me

@tizoc tizoc force-pushed the feature/prover-internal-tracing branch 3 times, most recently from f025b49 to d153e5f Compare April 5, 2023 16:37
@tizoc
Copy link
Member Author

tizoc commented Apr 11, 2023

!ci-build-me

@tizoc tizoc force-pushed the feature/prover-internal-tracing branch from e073761 to 03383d3 Compare April 11, 2023 18:48
@tizoc
Copy link
Member Author

tizoc commented Apr 11, 2023

!ci-build-me

@tizoc
Copy link
Member Author

tizoc commented Apr 11, 2023

!ci-build-me

@tizoc
Copy link
Member Author

tizoc commented Apr 12, 2023

!ci-build-me

@tizoc tizoc force-pushed the feature/prover-internal-tracing branch from 5b0e430 to 448cd7d Compare April 18, 2023 18:12
@tizoc
Copy link
Member Author

tizoc commented Apr 18, 2023

!ci-build-me

@tizoc
Copy link
Member Author

tizoc commented Apr 20, 2023

!ci-build-me

1 similar comment
@tizoc
Copy link
Member Author

tizoc commented Apr 20, 2023

!ci-build-me

@deepthiskumar
Copy link
Member

!approved-for-mainnet

@deepthiskumar deepthiskumar merged commit bd2e8e3 into MinaProtocol:berkeley Apr 20, 2023
@tizoc tizoc deleted the feature/prover-internal-tracing branch April 20, 2023 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants