Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added fault checking on every step in the net framework #338

Merged
merged 7 commits into from
Dec 11, 2018

Conversation

vkomenda
Copy link
Contributor

See #120.

Tests will fail if there is a step with a fault report referring to a correct node. This PR also adds storage for the fault log in the test node struct alongside output storage.

@vkomenda vkomenda requested review from mbr and afck November 13, 2018 17:42
Copy link
Collaborator

@afck afck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check in process_step looks good to me.
Not sure whether the store_step functionality should better be left to the individual test?

@vkomenda
Copy link
Contributor Author

vkomenda commented Nov 14, 2018

Not sure whether the store_step functionality should better be left to the individual test?

VirtualNet::new produces Steps internally but doesn't outputs those. It calls process_step from where the step can be stored.

We could output steps from new of course if it's better.

@afck
Copy link
Collaborator

afck commented Nov 14, 2018

I see; returning them from VirtualNet::new sounds like the right thing to do then, but maybe there's a reason why that's not the case? (@mbr)

@vkomenda
Copy link
Contributor Author

@mbr, your input would be welcome on this.

@vkomenda vkomenda force-pushed the vk-net-fault-check1 branch 2 times, most recently from 78a4e1f to a457e9b Compare November 19, 2018 14:35
@mbr
Copy link
Contributor

mbr commented Nov 19, 2018

The basic design philosophy behind VirtualNet is that it makes all the steps available and everything is passed back to the test (which may choose to ignore the faults, for example, when checking for the presence of them).

However, all Steps are passed through processing functions to copy out all messages, which are handled automatically by the network. The fact that outputs are collected is just a convenience bonus.

So, you're right, the fact that new does not pass out the Steps is probably a mistake. Furthermore, collecting faults like outputs is definitely a possibility, however we should not panic on these. At most, we could add an optional panic_on_fault convenience method. Ideally though, there'd be no panic, and we'd introduce a new error for the crank function and others.

@mbr
Copy link
Contributor

mbr commented Nov 19, 2018

So, to summarize, we actually need to solve three issues?

  1. Make new (and hopefully all other functions) return the generated Steps for inspection by the actual test.
  2. Collect FaultLogs as a convenience.
  3. Add a newoptional failure condition (might even be on by default) that causes an error if a fault is encountered.

@vkomenda
Copy link
Contributor Author

Thanks, @mbr. I'll get following through your points.

@vkomenda
Copy link
Contributor Author

@mbr, please check my response to your review comments.

Copy link
Contributor

@mbr mbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I might be missing something here. Most changes are just docs and smaller details I would be willing to merge without, another minor thing is the error style.

My biggeest question is whether the fault-checking logic is placed correctly, I feel like there could be a way that does this simpler and without altering internal APIs.

tests/net/err.rs Outdated
@@ -29,6 +34,10 @@ where
MessageLimitExceeded(usize),
/// The execution time limit has been reached or exceeded.
TimeLimitHit(time::Duration),
/// Fault encountered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a bit nitpicky, but I would really have a slightly longer comment here. Coming back to this after not touching the tests for a week or two, messages like "The algorithm run by the node produced a DistAlgorithm::Error while processing a message." (see above) are instantly clear, but "Fault encountered." is not. Ideally, this would mention when a fault could possibly encountered.

I believe the docs for CrankError::Crypto and CrankError::Algorithm could also use an extra half of a sentence.

tests/net/mod.rs Outdated
@@ -73,6 +72,8 @@ pub struct Node<D: DistAlgorithm> {
is_faulty: bool,
/// Captured algorithm outputs, in order.
outputs: Vec<D::Output>,
/// Collected fault log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, in order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what you mean by collecting in order?

tests/net/err.rs Outdated
/// Fault encountered.
Fault(Fault<D::NodeId>),
/// Threshold cryptography error.
Crypto(crypto::error::Error),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nondescript, I believe this is similar to having a lot of Io(io::Error) errors (this was in the discussion about error styles we had a quarter earlier, when we adopted Nick's excellent style.

So maybe this should be a InitialKeyGeneration error, with /// The initial key generation for threshold cryptography failed. or similar.


message_count
.store_step(step);
if error_on_fault {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is the right place to perform the check that no faults occurred (I am willing to be convinced otherwise though). Performing it here means adding extra cruft to the process step method, I would assume a much more natural place is in the crank() method itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to check for faults whenever an algorithm makes a step. This goes outside the scope of crank. In fact, it matches calls to process_step one-to-one.

@@ -915,12 +979,16 @@ where

// All messages are expanded and added to the queue. We opt for copying them, so we can
// return unaltered step later on for inspection.
self.message_count = self.message_count.saturating_add(process_step(
match process_step(
&mut self.nodes,
receiver.clone(),
&step,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a &DaStep here, could we not leave the code as is and check for faults above or below the process_step call?

@@ -318,6 +343,9 @@ where
message_limit: Option<usize>,
/// Optional time limit.
time_limit: Option<time::Duration>,
/// Property to cause an error if a `Fault` is output from a correct node. By default,
/// encountering a fault leads to an error.
error_on_fault: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever want to turn that off?
I imagine that we'll always want any tests to fail if a correct node blames a correct node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a comment asking to have this option but I cannot find it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By disabling errors on faults you would be able to see more faults and get stalling or divergence.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that's really useful, but I'm fine with it, too.

tests/net/mod.rs Outdated
/// Property to cause an error if a `Fault` is output from a correct node. By default,
/// encountering a fault leads to an error.
///
/// The deault setting `true` can be changed using this function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: default

Copy link
Collaborator

@afck afck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
Feel free to merge once @mbr's comments have been addressed.

@@ -318,6 +343,9 @@ where
message_limit: Option<usize>,
/// Optional time limit.
time_limit: Option<time::Duration>,
/// Property to cause an error if a `Fault` is output from a correct node. By default,
/// encountering a fault leads to an error.
error_on_fault: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that's really useful, but I'm fine with it, too.

@vkomenda
Copy link
Contributor Author

Now we have a maintenance task: the crossbeam dependency of the old example has to be updated or the example should be removed.

@vkomenda vkomenda merged commit c1c7fff into master Dec 11, 2018
@vkomenda vkomenda deleted the vk-net-fault-check1 branch December 11, 2018 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants