Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calls for block/transaction information sometimes don't return #21430

Closed
wbt opened this issue Aug 7, 2020 · 13 comments · Fixed by #25457
Closed

Calls for block/transaction information sometimes don't return #21430

wbt opened this issue Aug 7, 2020 · 13 comments · Fixed by #25457
Assignees

Comments

@wbt
Copy link
Contributor

wbt commented Aug 7, 2020

This is intended to be a migration of this issue originally filed on web3. However, I dove all the way in to the W3CWebSocket used by web3 and found that the requests are being sent but the responses were not being received at all from geth, so I don't think it's a web3 issue.

Additionally, I have an odd observation from Geth which might help in debugging.

System information

Geth version: 1.9.18
OS & Version: Windows/Linux/OSX: Windows 10

Steps to reproduce the behaviour

Follow the setup here. Use a WebSocket URL when connecting to geth (in truffle-config) and in migrations/3_increment_counter.js you can set batchSize to 100 without an issue. This should produce a blockchain with >100K transactions in just over 1000 blocks, with most blocks containing 100 transactions (sometimes that's split across a couple).

The expected and actual behavior are as listed there. If the responses are not going to be responded to, some error is expected; if no error at least some logging and documentation about how to configure the threshold would be better than status quo silent failure.

Now here's the twist. If I make all three of the following changes (from the code as of the 1.9.18 release tag), I get the expected results. The changes are all Geth console logging and shouldn't affect correctness of operation, but in combination, these three log statements do. The changes are:

  • In internal/ethapi/api.go, the getBlockByNumber function, add at the start of the function log.Info("In api.go GetBlockByNumber", "number", number);
  • In core/blockchain.go, the getBlock function, add at the start of the function log.Info("In getBlock for number", "number", number);
  • In core/blockchain.go, the getBlock function, add at the start of the if block for checking the cache log.Info("Returning from getBlock (cached) for number", "number", number);

The third one gets lots of use because of how the transaction generation and querying process works; if yours isn't doing that you could add a similar log statement (without "cached") just before the return statement.

Comment out any of those three and Geth only returns data for roughly 325 blocks and 0 transactions out of an expected 100K of each. Include all three, and you get the expected 100K block and 100K transactions returned.

I suspect that there may be some buffer or limit that is being overwhelmed, causing some glitch that leads Geth to stop sending responses to queries for at least a little while, and that these log statements slow down processing just enough that the issue threshold is never exceeded. However, I'm not yet quite sure where that threshold is set.

@GriffGreen
Copy link

This issue we were thinking was with web3... but maybe its geth...? web3/web3.js#3671

@GregTheGreek
Copy link
Contributor

@GriffGreen I don't think those two issues are related...

@GriffGreen
Copy link

GriffGreen commented Aug 9, 2020

Cool :-D Thx! @GregTheGreek we are scrambling over here chasing the problem down :-D We tried both the Consensys abi-decoder and ethers.js and neither has worked... but now we see that they both use ethers.js's abi encoder! So we made an issue there ethers-io/ethers.js#992

@GregTheGreek
Copy link
Contributor

@GriffGreen amazing! I did a little debugging and noticed that as well... best of luck!

@MariusVanDerWijden
Copy link
Member

@wbt I've been trying to repro the issue today but I'm not sure what the problem is.
The only problem I encounter is in truffle migrate

Counter value estimate: 202471
Returned error: already known

which can be mitigated by yarn truffle migrate -f 3 as you wrote.

The successful completions tick by normally.

Successful completions:  308798
Successful completions:  308799
So far have received 308799 events and successfully queried for 308799 blocks and 308799 transactions.

Am I missing something?
I'm using geth 1.9.20.

@MariusVanDerWijden
Copy link
Member

I ran it for 2.4 mil contract calls now, 570.000 blocks and I haven't seen a problem yet

@ricmoo
Copy link

ricmoo commented Sep 8, 2020

I've added support in ethers 5.0.12 for supporting legacy Solidity 0.4 external event data.

For more details, please see ethers-io/ethers.js#891.

@wbt
Copy link
Contributor Author

wbt commented Sep 8, 2020

@ricmoo Thanks for your efforts on that other issue around log parsing, which definitely caused us a lot of problems. However, I don't think this one was one of them. In this issue, the blockchain node isn't sending anything back to web3; this one isn't an issue where there is an incorrectly formatted log getting returned and not handled well in parsing.

@MariusVanDerWijden It is possible that your computer is significantly more powerful than mine; maybe try running it with a compute-intensive background job? Slower is also a possibility, if your machine's resource limits cause the requests to be spaced out enough to "fix" the issue, similar to what happened when I added those three log statements. I'm not quite sure what to suggest there.

I also noticed what I think is the same issue when subscribing to events, where there are many distinct events, some of which have many instances, and the subscription goes back to the beginning of the chain: sometimes I didn't even get the events back.

I wound up implementing a workaround which limits the number of requests outstanding to the blockchain node at any given time. This mostly works but requires a dedicated node and/or enough other controls so that requests from other clients/processes, combined with the limited one, remain under the yet-unknown limit for the node as a whole. It's not an elegant or complete solution, just usually practical for this particular application. I would like to see a better solution as described in the OP above.

@holiman
Copy link
Contributor

holiman commented May 27, 2021

I wound up implementing a workaround which limits the number of requests outstanding to the blockchain node at any given time.

If that solves it, sounds like the node was overloaded, and there was an RPC timeout. According to @fjl, rpc timouts do not issue a proper error-response, which is something that could be improved.

@holiman
Copy link
Contributor

holiman commented May 27, 2021

Interesting.

func (api *PublicDebugAPI) TimeMeOut(ctx context.Context) (string, error) {
	log.Info("TimeMeOut sleeping for 10 minutes...")
	time.Sleep(time.Minute * 10)
	log.Info("TimeMeOut waking up!")
	return "Good morning", nil
}

When invoked, the node says:

INFO [05-27|11:36:46.314] TimeMeOut sleeping for 10 minutes... 
INFO [05-27|11:46:46.315] TimeMeOut waking up! 

On the caller-side (taken at a different time):

date && echo "{\"method\": \"debug_timeMeOut\", \"params\": [], \"id\": \"123\"}" | nc -U /home/user/.ethereum/geth.ipc  && date
Thu May 27 12:02:26 CEST 2021
{"jsonrpc":"2.0","id":"123","result":"Good morning"}
Thu May 27 12:12:26 CEST 2021

So the IPC connection does not time out, and it does return the data.

For http, however

$ date && curl --data '{"method":"debug_timeMeOut","params":[],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545 ; date
Thu May 27 12:50:37 CEST 2021


curl: (52) Empty reply from server
Thu May 27 13:00:46 CEST 2021

It doesn't time out either, but after the 10m, when it does reply, the reply becomes an empty reply. cc @fjl

@holiman
Copy link
Contributor

holiman commented Jun 3, 2021

diff --git a/internal/ethapi/api.go b/internal/ethapi/api.go
index 02368aac14..ea7e0347fb 100644
--- a/internal/ethapi/api.go
+++ b/internal/ethapi/api.go
@@ -1792,11 +1792,11 @@ func (api *PublicDebugAPI) GetBlockRlp(ctx context.Context, number uint64) (stri
        return fmt.Sprintf("%x", encoded), nil
 }
 
-func (api *PublicDebugAPI) TimeMeOut(ctx context.Context) (bool, error) {
-       log.Info("TimeMeOut sleeping for 10 minutes...")
-       time.Sleep(time.Minute * 10)
+func (api *PublicDebugAPI) TimeMeOut(ctx context.Context, seconds uint64) (string, error) {
+       log.Info("TimeMeOut sleeping...", "seconds", seconds)
+       time.Sleep(time.Second * time.Duration(seconds))
        log.Info("TimeMeOut waking up!")
-       return false, nil
+       return "Oll korrekt!", nil
 }
 
 // TestSignCliqueBlock fetches the given block number, and attempts to sign it as a clique header with the

After 30 seconds, we hit it:

[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[25],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"result":"Oll korrekt!"}
[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[30],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
curl: (52) Empty reply from server

@holiman
Copy link
Contributor

holiman commented Jun 3, 2021

The actual parameter deciding this is h.server.WriteTimeout in node/rpcstack.go. If set to 1 second, the issue is easier to debug

@holiman
Copy link
Contributor

holiman commented Jun 3, 2021

I don't think we need to set a timeout there. The jsonCodec has it's own internal timeout for the writes.

	defaultWriteTimeout = 10 * time.Second // used if context has no deadline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants