Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getHistoryInterpreter: AcquireFailurePointTooOld #434

Closed
erikd opened this issue Dec 10, 2020 · 4 comments
Closed

getHistoryInterpreter: AcquireFailurePointTooOld #434

erikd opened this issue Dec 10, 2020 · 4 comments
Labels
bug Something isn't working priority high issues/PRs that MUST be addressed. The release can't happen without this;

Comments

@erikd
Copy link
Contributor

erikd commented Dec 10, 2020

If the node state is deleted and then node and db-sync started together then sometimes we get a error:

getHistoryInterpreter: AcquireFailurePointTooOld

Would be nice to find a neat work around for this.

@erikd
Copy link
Contributor Author

erikd commented Dec 10, 2020

Currently, when we hit this issue, db-sync will retry (possibly multiple times) and eventually work correctly.

@ArturWieczorek
Copy link
Contributor

@erikd I am using:

db-sync:

git log -1
commit e9474143ba6ce07c2f6d498ee80178c007d68a28 (HEAD -> release/7.1.x, origin/release/7.1.x)
Author: Duncan Coutts <[email protected]>
Date:   Fri Dec 11 16:07:07 2020 +0000

    Version 7.1.2

node: 1.24.2 commit 400d18092ce604352cf36fe5f105b0d7c78be074

Steps for building/running:

node:

git clone https://github.com/input-output-hk/cardano-node
cd cardano-node
git checkout tags/1.24.2
nix-build -A scripts.mainnet.node -o mainnet-node-local

# Start node:
./mainnet-node-local

db-sync:

git clone https://github.com/input-output-hk/cardano-db-sync
cd cardano-db-sync
git checkout release/7.1.x
nix-build -A cardano-db-sync-extended -o db-sync-node-extended

# Create DB:
PGPASSFILE=config/pgpass-mainnet scripts/postgresql-setup.sh --createdb
All good!

# Start db-sync:
PGPASSFILE=config/pgpass-mainnet db-sync-node-extended/bin/cardano-db-sync-extended \
--config config/mainnet-config.yaml \
--socket-path ../cardano-node/state-node-mainnet/node.socket \
--schema-dir schema/ \
--state-dir ledger-state/mainnet 

This is how it ends for me after running db-sync for several minutes (both node and db-sync started around the same time, there is maybe 2 sec delay between starting those components ) - after displaying error it just stops without restart:

[db-sync-node:Info:43] [2020-12-14 10:15:48.36 UTC] insertByronBlock: slot 380000, block 379948, hash d32a3695c15ff1a87fccae2100948f4ff19423136f98cc1a6aeaaf206c5f069b
[db-sync-node:Info:43] [2020-12-14 10:16:17.93 UTC] insertByronBlock: slot 385000, block 384948, hash 5266a9ca0a5195dcb4368968f35900bcf8d1f81a6fb53a93a115bd1e87eefe1b
[db-sync-node:Error:46] [2020-12-14 10:16:19.52 UTC] recvMsgRollForward: FatalError {fatalErrorMessage = "getHistoryInterpreter: AcquireFailurePointTooOld"}
[db-sync-node:Error:43] [2020-12-14 10:16:19.53 UTC] runDBThread: AsyncCancelled
[db-sync-node:Error:38] [2020-12-14 10:16:19.53 UTC] ChainSyncWithBlocksPtcl: FatalError {fatalErrorMessage = "getHistoryInterpreter: AcquireFailurePointTooOld"}
[db-sync-node.Mux:Info:35] [2020-12-14 10:16:19.53 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "state-node-mainnet/node.socket"} event: State: Dead
[db-sync-node.Mux:Notice:35] [2020-12-14 10:16:19.53 UTC] Bearer on ConnectionId {localAddress = LocalAddress "", remoteAddress = LocalAddress "state-node-mainnet/node.socket"} event: Miniprotocol MiniProtocolNum 5 InitiatorDir terminated with exception FatalError {fatalErrorMessage = "getHistoryInterpreter: AcquireFailurePointTooOld"}
[db-sync-node.Subscription:Error:33] [2020-12-14 10:16:19.53 UTC] [String "Application Exception: LocalAddress \"../cardano-node/state-node-mainnet/node.socket\" FatalError {fatalErrorMessage = \"getHistoryInterpreter: AcquireFailurePointTooOld\"}",String "SubscriptionTrace"]
cardano-db-sync-extended: FatalError {fatalErrorMessage = "getHistoryInterpreter: AcquireFailurePointTooOld"}
[db-sync-node.ErrorPolicy:Error:4] [2020-12-14 10:16:19.53 UTC] [String "ErrorPolicyLocalNodeError (ApplicationExceptionTrace (FatalError {fatalErrorMessage = \"getHistoryInterpreter: AcquireFailurePointTooOld\"}))",String "ErrorPolicyTrace",String "LocalAddress \"../cardano-node/state-node-mainnet/node.socket\""]
artur@artur-desktop:~/Projects/db-sync-7-1/cardano-db-sync$

However when I built locally docker image using this version of db-sync and used that image in docker-compose file from GQL repo, db-sync restarted shortly after displaying above error...

@kderme
Copy link
Contributor

kderme commented Dec 14, 2020

Looking at the logs above hash 5266a9ca0a5195dcb4368968f35900bcf8d1f81a6fb53a93a115bd1e87eefe1b this is epoch 17. This is the epoch with a record number of transactions for the Byron era. In my tests the error appears between epochs 16-20, which have many transactions in general. This can't be a coincidence. Looking at logs it takes 10-15s between recvMsgRollForward and when the block is finally added to the DB during this period. This probably shows some big congestion at the DB queue, I haven't verified this with logs for the queue size though.

evalSlotDetails always fails for EBBs and a new interpreter is acquired. When there is a congestion and a new interpreter is required, the AcquireFailurePointTooOld is usually triggered.

I'm not sure if there is any way to actually fix the race condition without changes at ouroboros-netowrk. On our side, we can try to take the interpreter as fast as possible after recvMsgRollForward runs. I found very suspicious that I sometimes get Error: Switchboard's queue full, dropping log items! warnings. Digging deeper, I found this from iohk-monitoring

instance IsEffectuator Switchboard a where
    effectuate switchboard item = do
        let writequeue :: TBQ.TBQueue (LogObject a) -> LogObject a -> IO ()
            writequeue q i = do
                    nocapacity <- atomically $ TBQ.isFullTBQueue q
                    if nocapacity
                    then handleOverflow switchboard
                    else atomically $ TBQ.writeTBQueue q i

        sb <- readMVar (getSB switchboard)

        if (sbRunning sb) == SwitchboardRunning
            then writequeue (sbQueue sb) item
            else TIO.hPutStrLn stderr "Error: Switchboard is not running, dropping log items!"
        
    handleOverflow _ = TIO.hPutStrLn stderr "Error: Switchboard's queue full, dropping log items!"

It seems there is another race condition here: First it checks if the queue is full and then writes on it. But These are not atomic, so the queue may get full in between, so the thread will sleep.

I need to have a better look on exactly how these issues trigger the error.

@vhulchenko-iohk vhulchenko-iohk added bug Something isn't working priority high issues/PRs that MUST be addressed. The release can't happen without this; labels Jan 6, 2021
@erikd
Copy link
Contributor Author

erikd commented Jan 21, 2021

This should have been fixed in PR #495 , specifically, with commit dad6e36 .

@erikd erikd closed this as completed Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority high issues/PRs that MUST be addressed. The release can't happen without this;
Projects
None yet
Development

No branches or pull requests

4 participants