set OMG.State GenServer timeout to 10s #1517

achiurizo · 2020-05-18T21:45:50Z

Overview

Increasing GenServer timeouts for OMG.State per conversation with @kevsul. Concern is that the current timeout on the GenServer is too low(default 5sec) so increasing this will help with the testing.

Changes

Increases timeouts to 10sec for OMG.State Genserver

Testing

N/A

InoMurko · 2020-05-19T16:05:46Z

I didn't find any data that would imply increasing timeouts would resolve an issue. Can you point me to logs? What is the issue we're addressing?

The problem with logs is that the excessive water you're pouring into a glass needs to go somewhere.

In our case, State is responsible for handling a bunch of Ethereum event handlers and if someone is waiting for a synchronous response for N amount of seconds that is blocking other processes.

kevsul · 2020-05-19T19:44:35Z

Here's the original issue
https://github.com/omisego/security-issues/issues/26

The problem is that there are 2 different timeouts involved and they're not in sync. The call chain is:

Client calls transaction.submit via the HTTP api
Phoenix middleware calls exec on the OMG.State GenServer with a default timeout of 5 seconds
OMG.State calls the DB GenServer with a timeout of 10 minutes
If that takes more than 5 seconds, then Phoenix server times out and returns a 500 error to the client.
Meanwhile the DB call returns before 10 minutes are up and the tx is successfully submitted.
However, the client thinks that their transaction has not been successfully submitted, which can cause retries and utxo not found errors and much confusion.

the excessive water you're pouring into a glass needs to go somewhere.

In this case, I think the 10 minute timeout is soaking it up 😄

That 5 second timeout wasn't a deliberate choice - it just happens to be the default for a GenServer call. The problem is that it's the bottleneck right now. The idea is to remove it and see where the next bottleneck is.

In any case, we really shouldn't tell the client that a tx has failed when it has in fact succeeded.

I believe that in a future implementation of the childchain we can solve this with an asynchronous api, but I don't think that will happen in this version.

InoMurko · 2020-05-19T20:03:23Z

Maybe I'm reading this incorrectly but looking at the code, that doesn't seem to be the case.
Your 3, 4 and 5 are not reflected in the codebase.

OMG.State exec goes like this:

def handle_call({:exec, tx, fees}, _from, state) do
    db_utxos =
      tx
      |> Transaction.get_inputs()
      |> fetch_utxos_from_db(state)

    state
    |> Core.with_utxos(db_utxos)
    |> Core.exec(tx, fees)
    |> case do
      {:ok, tx_result, new_state} ->
        {:reply, {:ok, tx_result}, new_state}

      {tx_result, new_state} ->
        {:reply, tx_result, new_state}
    end
  end

@spec fetch_utxos_from_db(list(OMG.Utxo.Position.t()), Core.t()) :: UtxoSet.t()
  defp fetch_utxos_from_db(utxo_pos_list, state) do
    utxo_pos_list
    |> Stream.reject(&Core.utxo_processed?(&1, state))
    |> Enum.map(&utxo_from_db/1)
    |> UtxoSet.init()
  end

  defp utxo_from_db(input_pointer) do
    # `DB` query can return `:not_found` which is filtered out by following `is_input_pointer?`
    with {:ok, utxo_kv} <- DB.utxo(Utxo.Position.to_input_db_key(input_pointer)),
         do: utxo_kv
  end

RocksDB

  def utxo(utxo_pos, server_name \\ @server_name) do
    GenServer.call(server_name, {:utxo, utxo_pos})
  end

Reading from DB has a 5 second timeout.

InoMurko · 2020-05-19T20:07:24Z

In any case, we really shouldn't tell the client that a tx has failed when it has in fact succeeded.

This is absolutely true. But with the current design of a large singleton OMG State, there's no way around this. Childchain solves this, hopefully. :P @achiurizo

coveralls · 2020-05-19T20:44:14Z

Coverage remained the same at 78.384% when pulling f9c76b1 on achiurizo/genserver-timeouts into 7e4a231 on master.

kevsul · 2020-05-19T20:47:20Z

Ah, you're right @InoMurko - I was getting mixed up with a separate issue where I was seeing the
"Reading UTXO set, this might take a while. Allowing #{inspect(@ten_minutes)} ms" log

But then I don't understand why the tx would actually get submitted - unless the entire OMG.State exec call takes more than 5 seconds, but not just the fetch_utxos_from_db part.

It has to update the DB with the new tx as well, but it looks like it doesn't wait for that to complete

InoMurko · 2020-05-19T20:50:38Z

Entirely possible. There are around 10 processes that call State on the same interval (8seconds) they're all doing synchronous calls and they might be eating away these 5 seconds.

kevsul · 2020-05-28T10:37:06Z

After further testing I've realised that this error happens when a tx is submitted at the same time that a block is being formed. If a combination of State.form_block and State.exec takes longer than 5 seconds, then an error is returned to the client, despite the tx being submitted correctly.

Increasing the timeout does solve the problem right now, as it allows more time for the block to be formed, although if forming a block takes longer than 10 seconds then we'll get the error again. But if that happens, then we've probably got bigger problems.

I think it's fine as a quick fix - we can solve it properly in another version of the childchain

What do you think? @InoMurko @achiurizo

boolafish · 2020-06-02T05:37:12Z

wanna check the status of this PR, is this a still going-to-🛳 PR?
lot's of sentry issues are sort of complaining on the OMG.State timeout. Though I don't if this is the best approach or not, I kinda flag those as waiting for this PR...

This reverts commit 522fb32.

set OMG.State GenServer timeout to 10s

f8d3590

achiurizo requested review from pnowosie, kevsul, InoMurko and unnawut May 18, 2020 21:45

InoMurko approved these changes May 19, 2020

View reviewed changes

kevsul approved these changes May 19, 2020

View reviewed changes

Merge branch 'master' into achiurizo/genserver-timeouts

f9c76b1

kevsul and others added 4 commits June 5, 2020 08:23

Merge branch 'master' into achiurizo/genserver-timeouts

a6dab32

Merge branch 'master' into achiurizo/genserver-timeouts

bc4c357

Merge branch 'master' into achiurizo/genserver-timeouts

5677f5d

Merge branch 'master' into achiurizo/genserver-timeouts

8c38f06

achiurizo merged commit 522fb32 into master Jul 24, 2020

achiurizo deleted the achiurizo/genserver-timeouts branch July 24, 2020 22:33

achiurizo added a commit that referenced this pull request Jul 27, 2020

Revert "set OMG.State GenServer timeout to 10s (#1517)"

2816aff

This reverts commit 522fb32.

achiurizo mentioned this pull request Jul 27, 2020

Revert "set OMG.State GenServer timeout to 10s" #1672

Closed

unnawut added the chore Technical work that does not affect service behaviour label Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set OMG.State GenServer timeout to 10s #1517

set OMG.State GenServer timeout to 10s #1517

achiurizo commented May 18, 2020 •

edited by kevsul

Loading

InoMurko commented May 19, 2020

kevsul commented May 19, 2020

InoMurko commented May 19, 2020 •

edited

Loading

InoMurko commented May 19, 2020

coveralls commented May 19, 2020 •

edited

Loading

kevsul commented May 19, 2020

InoMurko commented May 19, 2020

kevsul commented May 28, 2020

boolafish commented Jun 2, 2020

set OMG.State GenServer timeout to 10s #1517

set OMG.State GenServer timeout to 10s #1517

Conversation

achiurizo commented May 18, 2020 • edited by kevsul Loading

Overview

Changes

Testing

InoMurko commented May 19, 2020

kevsul commented May 19, 2020

InoMurko commented May 19, 2020 • edited Loading

InoMurko commented May 19, 2020

coveralls commented May 19, 2020 • edited Loading

kevsul commented May 19, 2020

InoMurko commented May 19, 2020

kevsul commented May 28, 2020

boolafish commented Jun 2, 2020

achiurizo commented May 18, 2020 •

edited by kevsul

Loading

InoMurko commented May 19, 2020 •

edited

Loading

coveralls commented May 19, 2020 •

edited

Loading