-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal tracing for prover and verifier (internal tracing PR 2 of 3) #12874
Internal tracing for prover and verifier (internal tracing PR 2 of 3) #12874
Conversation
7446d68
to
e138ef3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Left a few nits, but also: do you have any statistics around the additional overhead of adding these new log messages? I'm hoping that it's minimal, but if it isn't, we should call out that overhead explicitly in the CLI flag's / command's description.
src/lib/logger/impl.mli
Outdated
| Error | ||
| Faulty_peer | ||
| Fatal | ||
| Internal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it might be cleaner to call this log level Checkpoint
or similar, Internal
feels somewhat non-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original name was "spy" (because it was short and what it did was to expose what happened internally so that an external consumer could process it). Then it was switched to "internal" based on feedback here: #12703 (comment)
Ideally I would have used "trace" but it is taken already. "checkpoint" works too (not 100% ideal because there is some output that is not specifically checkpoints, like metadata).
Just let me know and I will rename it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@psteckler what's your preference between Checkpoint
and Internal
?
@@ -309,6 +320,16 @@ let run ~logger ~trust_system ~verifier ~transition_reader | |||
, time_received ) ; | |||
return () | |||
| Error error -> | |||
Internal_tracing.with_state_hash state_hash | |||
@@ fun () -> | |||
[%log internal] "Failure" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this log message be more specific? Failure
seems very broad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the ~metadata
argument, the reason for the failure is attached there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the GCP stackdriver view, this will show up as the message Failure
, and you have to manually expand to get the metadata. If there's a more descriptive message that could go here instead, that would help us interpret the tracing messages while looking at the raw logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Let me think about it for a bit, but I think something like Failure:<arbitrary output>
(with :
or \t
or some specific separator I can use to ignore the remaining text in the consumer) would do. Would that be good enough? (I don't know what it looks like in stackdriver).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrmr1993 btw just to make sure we are on the same page. The files produced by the internal tracing don't look like the JSON logs produced by the node, they just contain a sequence of events and control commands that look like this:
{"current_block":"3NL3n4UnfDEmbx2hap27Z3z7PU1D4pktAd9qmj2PvnCUAhnNT4rQ"}
["Produce_state_transition_proof",1679321390.6394472]
{"metadata":{"transactions_count":0}}
{"current_block":"3NLqqdkAQnbNorXKRFuWC9fz45jvdF1XUiwtseWzRbCB8vaxUpkS"}
{"block_metadata":{"blockchain_length":"2"}}
["External_block_received",1679321408.8380573]
["Initial_validation",1679321408.8467133]
["Verify_blockchain_snarks",1679321408.8499875]
{"metadata":{"count":1}}
{"current_block":"3NL3n4UnfDEmbx2hap27Z3z7PU1D4pktAd9qmj2PvnCUAhnNT4rQ"}
["Produce_chain_transition_proof",1679321409.389129]
["Produce_validated_transition",1679321409.3892083]
["Build_breadcrumb",1679321409.3894844]
{"block_metadata":{"coinbase_receiver":"B62qnzLUoQaPPM33X6JpVJUMq36QSibEgZep4U2idqFwxj3GXK8drXF","creator":"B62qnzLUoQaPPM33X6JpVJUMq36QSibEgZep4U2idqFwxj3GXK8drXF","global_slot":"3","previous_state_hash":"3NLSc2Y4MkCgPHGbfeuNX23Hpk5ZPs9LHgegf28DcMWWVKCxz9wo","slot":"3","winner":"B62qogjiboJ1tsGm42AtnWuxiL6yX3qDb2CGFVtTgD9iEjxGSEQ3uzm"}}
["Validate_staged_ledger_diff",1679321409.3895411]
["Prediff",1679321409.3908107]
["Verify_commands",1679321409.390832]
{"metadata":{"count":0}}
["Verify_commands_done",1679321415.2565856]
["Apply_diff",1679321415.2566586]
["Update_coinbase_stack",1679321415.2567432]
{"metadata":{"coinbases":1,"commands_count":0,"max_throughput":128,"proofs_waiting":0,"spots_available":128,"transactions":1,"works":0}}
["Update_ledger_and_get_statements",1679321415.2567675]
{"metadata":{"partition":"single"}}
["Update_ledger_and_get_statements_done",1679321415.2621703]
["Update_coinbase_stack_done",1679321415.2622075]
{"metadata":{"data_len":1,"is_new_stack":true,"transactions_len":1}}
["Check_for_sufficient_snark_work",1679321415.2622213]
{"metadata":{"free_space":128,"required_pairs":0,"slots":1,"work_count":0}}
["Check_zero_fee_excess",1679321415.262237]
["Fill_work_and_enqueue_transactions",1679321415.262251]
{"metadata":{"emitted_proof":false,"merge_jobs_created":0,"scan_state_added_works":0,"total_proofs":0}}
["Update_pending_coinbase_collection",1679321415.262411]
I mention this because I don't know what is the format expected by stackdriver. Obviously, if needed another log processor+transport can be registered that outputs logs in the same format as the rest of the logs.
In any case, if it is useful I can add extra text after the failure (something like a comment so it looks like ["Failure // Validation callback expired", 1679321415.43431]
)
@@ -329,6 +350,11 @@ let run ~logger ~trust_system ~verifier ~transition_reader | |||
|> Protocol_state.hashes ) | |||
.state_hash | |||
in | |||
Internal_tracing.with_state_hash state_hash | |||
@@ fun () -> | |||
[%log internal] "Failure" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, it is attached in ~metadata
@mrmr1993 thanks for the review!
I don't have any, but if you point me to a good way to compare it I can do some measurements. I guess I would need an archive node to collect some blocks, and then try to replay them? Note that enabling/disabling the tracing only has an effect on I/O, otherwise the logging functions are still called, and the parameters to it are still created. But in general, I don't think it should have a noticeable impact. Compared to regular logging, the output is quite a bit more compact (the examples in the gist are expanded and not what the node generates). Also there is no tracing done inside loops which is where even small overheads could add up and become meaningful. Btw please note that the core was implemented in this PR #12703 (from where the first 3 commits come from), this one adds the extra parts required for tracing the verifier and prover. |
2356c55
to
b41e77c
Compare
Rebased over |
b41e77c
to
d55a134
Compare
!ci-build-me |
d55a134
to
4fc99aa
Compare
!ci-build-me |
Just pushed a fix for the test that was failing (Was missing an update to the |
!ci-build-me |
1 similar comment
!ci-build-me |
f025b49
to
d153e5f
Compare
!ci-build-me |
e073761
to
03383d3
Compare
!ci-build-me |
!ci-build-me |
!ci-build-me |
Dynamic enabled/disable of internal tracing for prover and verifier
…erifier and prover Needed to be able to properly reconstruct traces containing checkpoints of multiple concurrent calls.
5b0e430
to
448cd7d
Compare
!ci-build-me |
!ci-build-me |
1 similar comment
!ci-build-me |
!approved-for-mainnet |
In addition to the tracing output generated by the first PR, here two new output files are added, one for the verifier process used by the transition frontier (other verifier processes are not currently traced), and another one for the prover process used by the block producer. If the node is launched with internal tracing enabled, those two subprocesses will be launched with internal tracing enabled too. When the mina client is used to toggle internal tracing, an RPC call will be made to those subprocesses to toggle internal tracing too.
Explain your changes:
This PR implements internal tracing for the prover and verifier. Builds on top of #12703
Explain how you tested your changes:
By consuming the traces with https://github.com/openmina/internal-trace-consumer and running it on our cluster.
Example of a structured trace generated by the trace consumer:
https://gist.github.com/tizoc/aab4d1dcf9d6b1d6857b0ac8ce094f03
Checklist: