Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set OMG.State GenServer timeout to 10s #1517

Merged
merged 6 commits into from
Jul 24, 2020
Merged

Conversation

achiurizo
Copy link
Contributor

@achiurizo achiurizo commented May 18, 2020

Overview

Increasing GenServer timeouts for OMG.State per conversation with @kevsul. Concern is that the current timeout on the GenServer is too low(default 5sec) so increasing this will help with the testing.

Changes

Increases timeouts to 10sec for OMG.State Genserver

Testing

N/A

@InoMurko
Copy link
Contributor

I didn't find any data that would imply increasing timeouts would resolve an issue. Can you point me to logs? What is the issue we're addressing?

The problem with logs is that the excessive water you're pouring into a glass needs to go somewhere.

In our case, State is responsible for handling a bunch of Ethereum event handlers and if someone is waiting for a synchronous response for N amount of seconds that is blocking other processes.

@kevsul
Copy link
Contributor

kevsul commented May 19, 2020

Here's the original issue
https://github.com/omisego/security-issues/issues/26

The problem is that there are 2 different timeouts involved and they're not in sync. The call chain is:

  1. Client calls transaction.submit via the HTTP api
  2. Phoenix middleware calls exec on the OMG.State GenServer with a default timeout of 5 seconds
  3. OMG.State calls the DB GenServer with a timeout of 10 minutes
  4. If that takes more than 5 seconds, then Phoenix server times out and returns a 500 error to the client.
  5. Meanwhile the DB call returns before 10 minutes are up and the tx is successfully submitted.
  6. However, the client thinks that their transaction has not been successfully submitted, which can cause retries and utxo not found errors and much confusion.

the excessive water you're pouring into a glass needs to go somewhere.

In this case, I think the 10 minute timeout is soaking it up 😄

That 5 second timeout wasn't a deliberate choice - it just happens to be the default for a GenServer call. The problem is that it's the bottleneck right now. The idea is to remove it and see where the next bottleneck is.

In any case, we really shouldn't tell the client that a tx has failed when it has in fact succeeded.

I believe that in a future implementation of the childchain we can solve this with an asynchronous api, but I don't think that will happen in this version.

@InoMurko
Copy link
Contributor

InoMurko commented May 19, 2020

Maybe I'm reading this incorrectly but looking at the code, that doesn't seem to be the case.
Your 3, 4 and 5 are not reflected in the codebase.

OMG.State exec goes like this:

def handle_call({:exec, tx, fees}, _from, state) do
    db_utxos =
      tx
      |> Transaction.get_inputs()
      |> fetch_utxos_from_db(state)

    state
    |> Core.with_utxos(db_utxos)
    |> Core.exec(tx, fees)
    |> case do
      {:ok, tx_result, new_state} ->
        {:reply, {:ok, tx_result}, new_state}

      {tx_result, new_state} ->
        {:reply, tx_result, new_state}
    end
  end
@spec fetch_utxos_from_db(list(OMG.Utxo.Position.t()), Core.t()) :: UtxoSet.t()
  defp fetch_utxos_from_db(utxo_pos_list, state) do
    utxo_pos_list
    |> Stream.reject(&Core.utxo_processed?(&1, state))
    |> Enum.map(&utxo_from_db/1)
    |> UtxoSet.init()
  end

  defp utxo_from_db(input_pointer) do
    # `DB` query can return `:not_found` which is filtered out by following `is_input_pointer?`
    with {:ok, utxo_kv} <- DB.utxo(Utxo.Position.to_input_db_key(input_pointer)),
         do: utxo_kv
  end

RocksDB

  def utxo(utxo_pos, server_name \\ @server_name) do
    GenServer.call(server_name, {:utxo, utxo_pos})
  end

Reading from DB has a 5 second timeout.

@InoMurko
Copy link
Contributor

In any case, we really shouldn't tell the client that a tx has failed when it has in fact succeeded.

This is absolutely true. But with the current design of a large singleton OMG State, there's no way around this. Childchain solves this, hopefully. :P @achiurizo

@coveralls
Copy link

coveralls commented May 19, 2020

Coverage Status

Coverage remained the same at 78.384% when pulling f9c76b1 on achiurizo/genserver-timeouts into 7e4a231 on master.

@kevsul
Copy link
Contributor

kevsul commented May 19, 2020

Ah, you're right @InoMurko - I was getting mixed up with a separate issue where I was seeing the
"Reading UTXO set, this might take a while. Allowing #{inspect(@ten_minutes)} ms" log

But then I don't understand why the tx would actually get submitted - unless the entire OMG.State exec call takes more than 5 seconds, but not just the fetch_utxos_from_db part.

It has to update the DB with the new tx as well, but it looks like it doesn't wait for that to complete

@InoMurko
Copy link
Contributor

Entirely possible. There are around 10 processes that call State on the same interval (8seconds) they're all doing synchronous calls and they might be eating away these 5 seconds.

@kevsul
Copy link
Contributor

kevsul commented May 28, 2020

After further testing I've realised that this error happens when a tx is submitted at the same time that a block is being formed. If a combination of State.form_block and State.exec takes longer than 5 seconds, then an error is returned to the client, despite the tx being submitted correctly.

Increasing the timeout does solve the problem right now, as it allows more time for the block to be formed, although if forming a block takes longer than 10 seconds then we'll get the error again. But if that happens, then we've probably got bigger problems.

I think it's fine as a quick fix - we can solve it properly in another version of the childchain

What do you think? @InoMurko @achiurizo

@boolafish
Copy link
Contributor

wanna check the status of this PR, is this a still going-to-🛳 PR?
lot's of sentry issues are sort of complaining on the OMG.State timeout. Though I don't if this is the best approach or not, I kinda flag those as waiting for this PR...

@achiurizo achiurizo merged commit 522fb32 into master Jul 24, 2020
@achiurizo achiurizo deleted the achiurizo/genserver-timeouts branch July 24, 2020 22:33
achiurizo added a commit that referenced this pull request Jul 27, 2020
@unnawut unnawut added the chore Technical work that does not affect service behaviour label Aug 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Technical work that does not affect service behaviour
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants