Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix telemetry rpc getting stuck if all nodes have bandwidth set to 0 #3643

Merged
merged 1 commit into from
Feb 2, 2022

Conversation

pwojcikdev
Copy link
Contributor

@pwojcikdev pwojcikdev commented Jan 3, 2022

I encountered this bug when I was doing some testing on private network, where all nodes have bandwidth set to 0. In that case the 'bandwidths' set is empty and std::next() has undefined behavior (gets stuck in infinite loop).

@thsfs
Copy link
Contributor

thsfs commented Jan 3, 2022

what are the steps to reproduce this issue?

@thsfs thsfs requested review from thsfs and clemahieu January 3, 2022 20:42
@pwojcikdev
Copy link
Contributor Author

  1. Create network of at least 10 nodes with config option 'bandwidth_limit = 0'
  2. Try to call 'telemetry' rpc action
  3. RPC never completes, IO thread stuck in infinite loop

@zhyatt zhyatt added this to the V24.0 milestone Jan 3, 2022
@zhyatt zhyatt added the bug label Jan 3, 2022
@thsfs thsfs self-assigned this Jan 3, 2022
@dsiganos
Copy link
Contributor

dsiganos commented Jan 4, 2022

I think this bug is caused by this if statement, which removes the zero bandwidth values.

if (telemetry_data.bandwidth_cap != 0)
{
	bandwidths.insert (telemetry_data.bandwidth_cap);
}

If all bandwidth_cap are set 0 then counts inside strip_outliers_and_sum when calculating bandwidth_sum will be empty.

Therefore, this should be reproducible by a unit test with 10 nodes having bandwidth set to 0.

@fikumikudev are you willing to write such a unit test?

@dsiganos
Copy link
Contributor

dsiganos commented Jan 4, 2022

Also, I do not think this bug is fully fixed by the proposed change. The function strip_outliers_and_sum expects to receive a list of at least 2 elements to operate correctly and this check only checks for empty list.

@pwojcikdev
Copy link
Contributor Author

pwojcikdev commented Jan 4, 2022

Yes, it looks like my fix is not complete. std::next will move the iterator past the end of a container, there are no checks there. I also think that this bug could potentially affect mainnet even, if someone set up a lot of nodes with bandwidth_limit = 0, so that num_either_side_to_remove is greater than the number of nodes with bandwidth limit set to non zero value, then any node that handles telemetry rpc would have all its io threads stuck, effectively taking down that node.

@pwojcikdev
Copy link
Contributor Author

There are two ways to fix this I can think of, first is to rewrite strip_outliers_and_sum to properly respect container bounds, the second is to not skip zero bandwidth values, but instead replace them with something predefined, like 1Gbps maybe?

@dsiganos
Copy link
Contributor

dsiganos commented Jan 6, 2022

I think improving strip_outliers_and_sum to handle any number of elements is the right way to do it.

@thsfs thsfs assigned dsiganos and unassigned thsfs Jan 19, 2022
@pwojcikdev pwojcikdev force-pushed the fix-telemetry-rare-case branch from bf67968 to af495ea Compare January 24, 2022 22:29
@pwojcikdev pwojcikdev force-pushed the fix-telemetry-rare-case branch from af495ea to 1aee398 Compare January 24, 2022 22:54
@pwojcikdev
Copy link
Contributor Author

@dsiganos I fixed the strip_outliers_and_sum logic and added a unit test for this bug case. Please check if it looks good to you.

@dsiganos
Copy link
Contributor

dsiganos commented Jan 24, 2022

Yeap, I'll look at it now. I pushed a PR to your election scheduler RPC PR branch.
pwojcikdev#1

thank you for your contributions, they are great!

@dsiganos dsiganos merged commit 4d07daf into nanocurrency:develop Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants