Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky TestSharesAvailable_Full #787

Closed
Wondertan opened this issue Jun 1, 2022 · 18 comments · Fixed by #804
Closed

flaky TestSharesAvailable_Full #787

Wondertan opened this issue Jun 1, 2022 · 18 comments · Fixed by #804
Assignees
Labels
bug Something isn't working

Comments

@Wondertan
Copy link
Member

image

@Wondertan Wondertan added the bug Something isn't working label Jun 1, 2022
@Wondertan
Copy link
Member Author

We need to understand how to extract the data of the random test that already happened and to reproduce the failure, as it seems like it maybe something regarding rsmt2d correctness
cc @liamsi @adlerjohn

@liamsi
Copy link
Member

liamsi commented Jun 1, 2022

ref: celestiaorg/rsmt2d#68

@liamsi
Copy link
Member

liamsi commented Jun 1, 2022

A simple while loop and dumping the shares when the test fails (e.g. by temporarily introducing a global var for that) could get the job done (in a hacky way). I'm currently running while go test -v -test.run TestSharesAvailable_Full; end for this. Will dump any test-vector here. It's easier to reproduce with a 32x32 square but that yields a massive test-vector.

@liamsi
Copy link
Member

liamsi commented Jun 1, 2022

These shares were used in one failing test: https://gist.github.com/liamsi/34bb1269e311392184581c0191c1cfab
(I dumped those by temporarily using a global var if the test failed)

@liamsi
Copy link
Member

liamsi commented Jun 1, 2022

I've used that as a test-vector above instead of RandFillDAG but the test passes without any problem 🤔

@adlerjohn
Copy link
Member

Yeah, there's no way to pinpoint any bugs without shares to repro

@liamsi
Copy link
Member

liamsi commented Jun 2, 2022

Yeah, the problem is that even with the shares above (with which the test failed), I can not repro the failure. So the bug must either really be in the decoding of rsmt2d (unlikely), or, in the Retrieve method itself (more likely because go-routines are almost the only source of non-determinism here).

@liamsi
Copy link
Member

liamsi commented Jun 2, 2022

So if I ran my modified test with the same testvector it also passes +99% of the time. But also fails rarely with:

2022-06-02T02:41:52.420+0200	ERROR	share	share/full_availability.go:35	availability validation failed	{"root": "cXf4EqqtzHAJxmfZmEG91DaH+4mtJ6SppB2A3kq2/iQ=", "err": "byzantine error. isRow:false, Index:16"}
    full_availability_test.go:39: 
        	Error Trace:	full_availability_test.go:39
        	Error:      	Received unexpected error:
        	            	byzantine error. isRow:false, Index:16
        	Test:       	TestSharesAvailable_Full_TC

So it is not really reproducible via the input. The only interesting thing about this is that it is also the same Index, namely 16 as in Hlib's screenshot. Maybe additionally a stack trace whenever we encounter an unexpected byzantine error could help here.

@liamsi
Copy link
Member

liamsi commented Jun 2, 2022

Another hint: if I try with 32 instead of 16, the error happens much more often. Again the index is always (or at least every time I tried) the full original square size (or ODS.width), here, 32 and of a column. So it looks like the first parity share in a column (in the last quadrant) seems to cause the problem

@liamsi
Copy link
Member

liamsi commented Jun 2, 2022

@Wondertan how do I print out the debug logs during tests? Specifically this one:

log.Debugw("requesting quadrant", "axis", q.source, "x", q.x, "y", q.y, "size", size)

@liamsi
Copy link
Member

liamsi commented Jun 2, 2022

My hypothesis is that the bug is somewhere in

eds, err := ses.retrieve(ctx, q)
and not in rsmt2d

It could also simply have to do sth with the last quadrant but then I would expect we would see this more often.

@liamsi
Copy link
Member

liamsi commented Jun 2, 2022

Also, when I always fetch from the 1st quadrant by commenting out this line:

// shuffle quadrants to be fetched in random order
rand.Shuffle(len(quadrants), func(i, j int) { quadrants[i], quadrants[j] = quadrants[j], quadrants[i] })

It also works more reliably. Do I use the last quadrant (by not shuffling and explicitly using it) and with width 32, the test reliably fails.

@Wondertan Wondertan moved this from TODO to In Progress in Celestia Node Jun 7, 2022
@Wondertan Wondertan self-assigned this Jun 7, 2022
@Wondertan
Copy link
Member Author

Wondertan commented Jun 7, 2022

@liamsi, I also believe this is something with retrieval logic and not with rsmt2d. I am able to reproduce the issue in my reconstruction tests in #702 now. Moreover, they always fail with the same error which was not the case before.

Now I am trying to understand which PR caused the regression. My first suspect was #738, but it was merged after the issue appeared, so it is not it. Continue investigation and now checking #730

@Wondertan
Copy link
Member Author

Some alpha celestiaorg/rsmt2d#83 fixes the bug completely.

@Wondertan
Copy link
Member Author

The current workaround is to reimport the square each reconstruction attempt.

@Wondertan
Copy link
Member Author

Mainly we need the share to be set like this, so it's set into axis slice.

@Wondertan
Copy link
Member Author

I was able to deflake the test and reproduce the issue with 100% chance. Then to see how reimporting fixes it.

@adlerjohn
Copy link
Member

What is the bug specifically?

Repository owner moved this from In Progress to Done in Celestia Node Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Archived in project
3 participants