Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getLogs seems to be missing events when run at head #12078

Closed
4 of 7 tasks
ashwinphatak opened this issue Jun 10, 2024 · 19 comments
Closed
4 of 7 tasks

getLogs seems to be missing events when run at head #12078

ashwinphatak opened this issue Jun 10, 2024 · 19 comments
Assignees
Labels

Comments

@ashwinphatak
Copy link

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I did not make any code changes to lotus.

Lotus component

  • lotus Ethereum RPC
  • lotus FVM - Lotus FVM interactions
  • FEVM tooling
  • Other

Lotus Version

Daemon:  1.27.0-dev+mainnet+git.20640c062+api1.5.0
Local: lotus version 1.27.0-dev+mainnet+git.20640c062

Repro Steps

We don't yet have steps to reproduce. The issue showed up in long running downstream processes and only seems to happen once in a while.

Describe the Bug

So far we've found at least two instances when lotus didn't return matching events during a getLogs RPC call at head (https://filfox.info/en/tipset/3976539). The same calls returned data when made later (couple of days). These missing logs were only discovered in retrospect because the downstream service started throwing errors when processing later events. The downstream service code remained unchanged during these tests.

This screenshot shows a db table during the original run, during which the block which was processed returned no matching events:

image

This screenshot shows the result of running the same getLogs call after a couple of days:

image

To confirm this issue exists with getLogs only at HEAD, we've left the service running with changes to stay 16 blocks behind head at all times. It's been running for more than two days so far without similar problems.

Tooling

We're using etherjs to make RPC calls

Configuration Options

config.toml


[Fevm]
  # EnableEthRPC enables eth_ rpc, and enables storing a mapping of eth transaction hashes to filecoin message Cids.
  # This will also enable the RealTimeFilterAPI and HistoricFilterAPI by default, but they can be disabled by config options above.
  #
  # type: bool
  # env var: LOTUS_FEVM_ENABLEETHRPC
  #EnableEthRPC = false

  # EthTxHashMappingLifetimeDays the transaction hash lookup database will delete mappings that have been stored for more than x days
  # Set to 0 to keep all mappings
  #
  # type: int
  # env var: LOTUS_FEVM_ETHTXHASHMAPPINGLIFETIMEDAYS
  #EthTxHashMappingLifetimeDays = 0

run script:

export LOTUS_CHAINSTORE_ENABLESPLITSTORE="true"
export LOTUS_EVENTS_ENABLEACTOREVENTSAPI="true"
export LOTUS_INDEX_ENABLEMSGINDEX="true"
export LOTUS_FEVM_ENABLEETHRPC="true"
export LOTUS_FEVM_ENABLEETHHASHTOFILECOINCIDMAPPING="true"

./bin/lotus daemon
@ashwinphatak ashwinphatak added the kind/bug Kind: Bug label Jun 10, 2024
@rjan90 rjan90 added this to FilOz Jun 10, 2024
@rjan90 rjan90 moved this to 📌 Triage in FilOz Jun 10, 2024
@eshon
Copy link
Contributor

eshon commented Jun 10, 2024

Is this possibly a reorg issue?

Update: @ashwinphatak says the block hash is the same

@rvagg
Copy link
Member

rvagg commented Jun 11, 2024

We're using etherjs to make RPC calls

@ashwinphatak are you able to provide details of the actual call being made? "the block hash is the same". The API gives the ability to specific a precise tipset with the blockHash property, but you'd have to have that tipset from another source and it only gives you events for that tipset only. Alternatively there's the fromBlock and toBlock properties that take a hex encoded epoch number, or "latest" or "earliest".

The behaviour is very different whether you're using blockHash or not. If you're using blockHash, where are you getting the value from before making the call? Perhaps this is a question for etherjs but I'm not familiar with how it works unfortunately.

If it's not using a blockHash, then unfortunately you're probably experiencing reorg/finality issues as @eshon suggested. The hashed epoch in fromBlock or toBlock could be pointing to an entirely different tipset and you have no assurances prior to finality @ 900 epochs, but it's almost always much quicker than that.

@ashwinphatak
Copy link
Author

ashwinphatak commented Jun 11, 2024

Here is the equivalent curl request:

curl -X POST -H "Content-Type: application/json" \
  --data '{"method":"eth_getLogs","params":[{
    "fromBlock": "0x3CAD5B",
    "toBlock": "0x3CAD5B",
    "address": ["0xc35DADB65012eC5796536bD9864eD8773aBc74C4","0xF4d73326C13a4Fc5FD7A064217e12780e9Bd62c3","0x1d1375281265e4Dd496d90455f7C82f4fbd85cC2","0xDaD662D11C33F92E19bB3536eE3e0c833a6a8E0F","0x46188244C463198249c3A3F45c1FdDc6633CD64E","0xF16f045Ec3a1Ef8a002cA093A971EDee0F2F7b25","0xb803896a112eB8988E7B9541c86596cdFAD68D93","0x46Bc9C94b030Adbfdc1EF2725184f6911069f186","0x510Ce4716D33C9d460d2DFf37c9DbfA12daebA22","0xd234D741371dadEB6085B828b1B06768b9DaE5F9","0xB29eB187b3721e2138F6188a384E21D8b084A7Dd","0x4d09aF86521461733d2c484BD676710d3774288E","0x728cFED2232883dAFBE4165c47BFC30cFfB77207","0xB902b8CA0505CC98fC4D82dE0e22E09781aD1570","0x62b623cA3f3323EcdedAA2c5fD5AD403E56c130a","0xFb89125D2361871F1E31eb062c0EA5b87234e11f","0x514559E26782D40AeC0b779e3c76aB44cb7baC78","0x5347a028cfbBeB9983E6c5af3D98d2FF43765905","0xF87E12802E6Ba3FfF1ad0ef97Be7c3BFF2904EeB","0xB9494535d3174dDcaAfEB8D1C5543ee6929Ca325","0x0f23e3D469374225bd1763f7Ed95c7ec94A6c391","0x1DCD0396e92d0c92355f99e60054c0d8467b1DAe","0x5B2B4Be7C22c308e3612EFBa2BD85b945B8489f7","0x913c7C9d6307AD5311D5c023086E0658Cf901B7b","0xAC3B888Bd4adE788BFd60dAeb88EA524E39D1376","0x07d61E795A67eeeb9bb3a8066CEf1719A429339c","0xD02D9A237f3E099970e87d2DC5aC9e462405b81f","0x802C126d8fbBa47d6Ab66E2803b5Db1eB9c86732","0x6c4Bc5239c3EE99C9eADEf92b235F98629024955","0x21d31f071A6336Ee4b25c9487839e458C4d23CD1","0x41d81e6F66590FB6F3315a4A622dc5570C2B91c1","0xaFf6Af999c145E9ea5742c3DCC9C3ece21D293b1","0x2f6E3E6A49958cbd7241640530d7Dcb44258EADD","0x9FFfb60134604A1649F4D85b423728C6781FF981","0xa453055366EA33d54d56bd72941CD5DDd38d21d1","0x298f93B06821995a704072a49aeF187De7337fBF","0xa15AF06Bf8a82D388903E0Fa27ca6950915168e3","0x29F1eCFB0F320aA36B362ea1Bb5DaB8B4894256a","0x094FE771ae75037df778791585a27f6aFf8A1F07","0x82C480ec151ef0e3FF398288f7D327A70692F8fD","0x2fAF35aEE2e9e7856742f3E33862f45078ffbaF3"],
    "topics":[["0x783cca1c0412dd0d695e784568c96da2e9c22ff989357a2e8b1d9b2b4e6b7118","0x3067048beee31b25b2f1681f88dac838c8bba36af25bfb2b7cf7473a5847e35f","0x26f6a048ee9138f2c0ce266f322cb99228e8d619ae2bff30c67f8dcf9d2377b4","0x40d0efd1a53d60ecbf40971b9daf7dc90178c3aadc7aab1765632738fa8b8f01","0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef","0x98636036cb66a9c19a37435efc1e90142190214e8abeb821bdba3f2990dd4c95","0xc42079f94a6350d7e6235f29174924f928cc2ac818eb64fed8004e115fbcca67","0x7a53080ba414158be7ec69b987b5fb7d07dee101fe85488f0853ae16239d0bde","0x0c396cd989a39f4459b5fa1aed6a9a8dcdbc45908acfd67e028cd568da98982c","0xbdbdb71d7860376ba52b25a5028beea23581364a40522f6bcfb86bb1f2dca633"]]
  }],"id":1,"jsonrpc":"2.0"}' \
  <RPC-ENDPOINT-HERE>

If it's not using a blockHash, then unfortunately you're probably experiencing reorg/finality issues as @eshon suggested. The hashed epoch in fromBlock or toBlock could be pointing to an entirely different tipset and you have no assurances prior to finality @ 900 epochs, but it's almost always much quicker than that.

Just so I understand this correctly, are you saying that a query like the curl request above is only reliable after 900 epochs?

Edit: We index the events by the block hash returned by getLogs, so effectively when processing, we're always using the block hash.

@eshon
Copy link
Contributor

eshon commented Jun 11, 2024

"HEAD-20 epochs is generally considered safe ( this is also what WindowPoST is based on )"

HEAD-20 is 10 minutes

Hard finality in Filecoin is HEAD-900 or 16.67 hours

But 10 minutes is also apparently statistically ok to use - @rvagg please confirm!

(This will change when Fast Finality comes to Filecoin soon.)

@rvagg
Copy link
Member

rvagg commented Jun 11, 2024

Yes, so 900 is the point at which clients consider the chain strictly canonical and irreversible. It's the "I never want to even think about chain reorgs" option. But there's a gradient from there to 0.

I'm going to have to try and find someone better qualified who can give a firmer answer about the probabilistic gradient between 900 and 0, but it gets significantly less probable that you'll encounter a reorg as you move backward from zero. Apparently some large users are going with 100 as a "safe-enough" and I think some might be going with 30 as an option. It's going to depend on what risk you're willing to deal with vs the recency you need. Ultimately, unless you're going to with 900+ then you really should be writing your application (and this goes for almost any blockchain app) with the assumption that consensus is a bit squishy and that you may have to deal with head reorganisation. You can sample from the head, but you should probably go back at some point and re-sample that same spot and make sure that what you have is canonical.

I was shown this today: https://docs.axelar.dev/learn/txduration#casper-friendly-finality-gadget-eg-eth-pos Alexar's definition of finality across various chains. They're putting Filecoin at 100.

The F3 project, which is currently under heavy development, targeting deployment later this year, will radically change that, the FIP has a bunch of really interesting background on how Filecoin does finality that might be worth a read:
https://github.com/filecoin-project/FIPs/blob/master/FIPS/fip-0086.md - just note though that users will always need to be prepared to fall back to EC (expected consensus) with its 900 epoch rule in cases where F3 may have trouble—it's a mechanism on top of EC to greatly improve the guarantees for the normal case, but with fallbacks for the just-in-case.

In other APIs, such as GetActorEventsRaw, and SubscribeActorEventsRaw, we communicate reorgs for events via a reverted boolean such that you should be able to use it to update your local state to match what the node you're talking to has had to do to arrive at the canonical chain with its peers. e.g. if you subscribe via that API then you'll get a stream of events as the node experience them including when it discovers it's not on the canonical chain so needs to walk back and send you events again but with reverted=true so you can update your local state. But of course that can all be tricky to juggle locally but it's an unfortunate fact of consensus that it can take time to get it right. See here for some discussion of an alternative approach we're looking at for doing a "my last query got me to head X, I now want to get to head Y (perhaps 'latest'), show me the diff in events, including reverts, that get me there", and having the node walk the path, possibly with some backward steps, for you.

@jsoares
Copy link
Member

jsoares commented Jun 11, 2024

But 10 minutes is also apparently statistically ok to use - @rvagg please confirm!

Have a look at FRC-0089. The ECFC hasn't been implemented in lotus, so you can't use that directly, but results show 30 epochs is generally very safe, though how safe depends on system conditions. We'll have a paper soon with more extensive evaluation.

@ashwinphatak
Copy link
Author

Using lotus, how can I check/confirm if block hash 0x8d5612b14d72fc9a47d59836812ac15438c9b01f956e9c905732c5c855f75b11 is canonical?

@AFDudley
Copy link

@rvagg do you understand how completely unlike Ethereum this behavior is? We understand finality and reorg safety just fine. The problem is Filecoin does a very poor job of pretending to be compatible when it's not. FEVM is acutely NOT EVM-compatible, given it half exposes it's finality model that is compatible with neither PoW or PoS Ethereum.

@AFDudley
Copy link

I'm sorry folks this is pretty insane, you provide an Etheruem blockhash but the data it represents is mutable? How is that even possible? How can we pretend this is anything but a deep violation of system invariants. If what Rod is saying is true how could that blockhash have been valid in both cases?

@aarshkshah1992
Copy link
Contributor

aarshkshah1992 commented Jun 11, 2024

@AFDudley

"I'm sorry folks this is pretty insane, you provide an Etheruem blockhash but the data it represents is mutable"

What do you mean here when you say that the "data represented by a blockhash is mutable" ?

The blockhash pointed to by an epoch is mutable -> the same epoch can point to a different block if there is a re-org and so point to different events which is what you are observing here

However, the data/events a blockhash points to do not get mutated. The problem here is that you are fetching events by specifying a certain epoch and then indexing those events by the hash of the block. However, if the block at that epoch gets re-orged, you will see a different set of events for the same epoch.

@AFDudley
Copy link

The issue is we have a workaround that works for geth-based systems that doesn't work for Lotus. Basically, with geth if you get a blockhash from a blocknumber and then immediately call getLogs for that number you'll get the logs for the blockhash at that number. With Lotus, it appears that no logs are returned in that case, not because of reorg, but because the logs have yet to be processed. This hypothesis fits the data I have at the moment better than a reorg happened in the interval of time between the calls. if the logs were non-zero then we could positively assert a reorg. but because we don't have a blockhash with the 0 result, we can't confirm a reorg, this is an issue with the Ethereum "spec".

@AFDudley
Copy link

AFDudley commented Jun 11, 2024

#12078 (comment) digresses this conversation in a confusing direction. Please positively assert to us that the block number we provided was reorg'ed instead of everyone just assuming it was. We have other reorgs in our database that don't exhibit this behavior. Furthermore, we should also see the block with zero logs that was reorg'ed out. Thanks.

EDIT: There was not a fork at that block height:
https://filecoin.blockscout.com/block/0x8d5612b14d72fc9a47d59836812ac15438c9b01f956e9c905732c5c855f75b11
https://filecoin.blockscout.com/blocks?tab=reorgs&page=2&next_page_params=%257B%2522block_number%2522%253A3980402%252C%2522items_count%2522%253A50%257D

Again maybe Lotus reporting 0 is correct behavior but it was not doing that because of a reorg.

@eshon
Copy link
Contributor

eshon commented Jun 11, 2024

It looks like eth_getBlockByNumber has these params to identify reorg safety, wondering what Lotus does for safe?

@ZenGround0
Copy link
Contributor

I think this is what @aarshkshah1992 is working on here: https://filecoinproject.slack.com/archives/CP50PPW2X/p1718124022999279

@rjan90 rjan90 moved this from 📌 Triage to 🐱Todo in FilOz Jun 11, 2024
@ashwinphatak
Copy link
Author

EDIT: There was not a fork at that block height:

This is consistent with what the downstream processes (watchers) saw. The watcher code is fully reorg aware.

The first screenshot above already shows it found only one block at height 3976539. The following screenshot shows the reorgs it ran into and how it correctly pruned out the non-canonical blocks.

image

@aarshkshah1992
Copy link
Contributor

Closed by #12080. Will go out in the next Lotus release.

@ashwinphatak
Copy link
Author

Closed by #12080. Will go out in the next Lotus release.

Has this fix made it into a release yet? If not, when is it expected?

I believe this issue exists even when getLogs is called with block hash. Can you confirm?

@rvagg
Copy link
Member

rvagg commented Jun 26, 2024

Didn't make it into 1.27.1 but we discussed a follow up with a 1.27.2 to get this in. No ETA at the moment because there's so much nv23 activity at the moment focused on 1.28.0.
If you were super keen you could try out the 1.28.0 rc, it has most of what's on master including this. But this has a database migration that would make downgrading a node difficult if you wanted to go backwards.

And off the top of my head I believe this deals with the block hash case too.

@aarshkshah1992
Copy link
Contributor

@ashwinphatak Yes, this fix will work for blockhash as well. Let me see if we can get this in an RC for you.

@rjan90 rjan90 moved this from ⌨️In Progress to 🎉 Done in FilOz Jun 30, 2024
@rjan90 rjan90 moved this from 🎉 Done to ☑️Done (Archive) in FilOz Jun 30, 2024
@ashwinphatak ashwinphatak mentioned this issue Jul 10, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: ☑️ Done (Archive)
Development

No branches or pull requests

9 participants