`:ets.lookup` error from Cachex Overseer Table #388

probably-not · 2024-11-03T13:00:47Z

I'm running code that looks like the following:

Cachex.fetch(:mycache, %CacheKey{}, &wrapped_fetch/1)

And this is throwing the following error:

{:EXIT, {:badarg, [{:ets, :lookup, [:cachex_overseer_table, :mycache], [error_info: %{cause: :id, module: :erl_stdlib_errors}]}, {Cachex.Services.Overseer, :retrieve, 1, [file: ~c"lib/cachex/services/overseer.ex", line: 90]}, {Cachex.Services.Overseer, :with, 2, [file: ~c"lib/cachex/services/overseer.ex", line: 148]}, {Cachex, :fetch, 4, []}]}}

Cachex is started in my application's supervisor with the following spec:

import Cachex.Spec

...
# In the children list:
      {
        Cachex,
        [
          :mycache,
          [
            router: router(module: Cachex.Router.Ring, options: [monitor: true])
          ]
        ]
      }
...

I don't see any other errors - so I expect that all the tables have been started properly. I'm not sure why I would get this issue, but because it is uncaught it's triggering a full crash on every request in my server.

I've currently reverted the use of Cachex in my code due to this error - but I'm happy to help try and debug if you need any help here.

whitfin · 2024-11-03T16:25:27Z

This has most likely happened because your don't have :cachex registered as an application in your project, somehow. Are you maybe on an older Elixir version?

You need to make sure that it's included in :applications or :extra_applications, or that it has been started with Application.ensure_all_started(:cachex).

The main overseer table is initialized globally, rather than as part of your single cache supervision tree.

probably-not · 2024-11-03T16:41:25Z

I'm on the latest Elixir (1.17), so I assumed that it's started as part of the applications since it has a runtime (like all the other applications that I have in my dependencies). I don't see anywhere in the Cachex docs that says I have to explicitly add it... is this still necessary as of the later Elixir versions?

whitfin · 2024-11-04T00:12:47Z

It shouldn't be necessary, no, but I have no other explanation...

If you start your app via iex -S mix and then do the lookup yourself, I guess this reproduces the issue?

$ iex -S mix
Compiling 56 files (.ex)
Generated cachex app
Erlang/OTP 26 [erts-14.2.5.1] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]

Interactive Elixir (1.16.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :ets.lookup(:cachex_overseer_table, :mycache)
[]

As I said though, this table is constructed by the main application - are you sure there's nothing explicitly disabling that? Nobody else has ever reported this, which makes me feel it's something about your application causing it to not start properly.

probably-not · 2024-11-04T05:57:36Z

Interesting - I'm not sure why my application specifically would have an issue here - I don't have any issue with other things that start globally. Incidentally, switching to the Local router strategy seemed to solve this (although it may have been solved purely because the app was redeployed). I'll try to dig into the logs and see whether maybe there was some other crash that took down the Overseer process during startup... could there be a crash in the Overseer or in Eternal somewhere that would cause the table to be not found?

I had to revert this in our production environment (which was where it was happening), but I don't mind returning it in the coming days (just need to find time where the impact will be at its lowest) in order to help to try and debug this. If you have any specific things that you want me to run in order to try and narrow down the issue, let me know, I'm happy to help.

jcartwright · 2024-11-14T16:24:23Z

Hi there! I am seeing a similar issue with our distributed configuration that we just deployed to our staging environment on k8s. We are currently on Elixir 1.14.5 and Erlang/OTP 25.

In the application's Supervisor:

      {Cachex,
       [
         name: Core.Helpers.Cache.domain_cache(),
         hooks: cachex_hooks(),
         router:
           router(
             module: Cachex.Router.Ring,
             options: [
               monitor: true
             ]
           )
       ]}

Getting this error when trying to access Cachex.keys/1 or Cachex.size/1:

iex([email protected])3> Cachex.keys(:domain_cache)
** (FunctionClauseError) no function clause matching in Cachex.Router.result_merge/2

    The following arguments were given to Cachex.Router.result_merge/2:

        # 1
        {:EXIT,
         {:badarg,
          [
            {:ets, :select, [:domain_cache, [{{:_, :"$1", :"$2", :"$3", :"$4"}, [{:orelse, {:==, :"$4", nil}, {:>, {:+, :"$3", :"$4"}, 1731599051817}}], [:"$1"]}]],
             [error_info: %{cause: :id, module: :erl_stdlib_errors}]},
            {Cachex.Actions.Keys, :execute, 2, [file: 'lib/cachex/actions/keys.ex', line: 30]}
          ]}}

        # 2
        []

    (cachex 4.0.2) lib/cachex/router.ex:142: Cachex.Router.result_merge/2
    (elixir 1.14.5) lib/enum.ex:2468: Enum."-reduce/3-lists^foldl/2-0-"/3
    (cachex 4.0.2) lib/cachex/router.ex:284: Cachex.Router.route_cluster/3

iex([email protected])4> Cachex.size(:domain_cache)
** (FunctionClauseError) no function clause matching in Cachex.Router.result_merge/2

    The following arguments were given to Cachex.Router.result_merge/2:

        # 1
        :undefined

        # 2
        2

    (cachex 4.0.2) lib/cachex/router.ex:142: Cachex.Router.result_merge/2
    (elixir 1.14.5) lib/enum.ex:2468: Enum."-reduce/3-lists^foldl/2-0-"/3
    (cachex 4.0.2) lib/cachex/router.ex:284: Cachex.Router.route_cluster/3

whitfin · 2024-11-14T17:03:33Z

This is very odd.

Are you both using the same Elixir/OTP versions? I can't reproduce this, at all. If we can somehow narrow it down, obviously happy to fix/resolve it!

For @jcartwright I would like to double check:

Cachex.keys(:domain_cache) vs Cachex.keys(Core.Helpers.Cache.domain_cache())?

jcartwright · 2024-11-14T17:06:08Z

Core.Helpers.Cache.domain_cache() just returns the atom :domain_cache. In the console I just type the atom cause it's shorter.

I think @probably-not was on 1.17, so we're not on the same versions.

whitfin · 2024-11-14T17:11:47Z

@jcartwright and just so I know, sounds like this is an upgrade? Were you using Cachex with no issues previously? I'm trying to determine if this is a new issue. It most likely is, given this is the first time in 10 years it's been reported... but it's unclear whether it was introduced in Cachex v4.x or it's some other external thing moving (Elixir/OTP versions, for example).

jcartwright · 2024-11-14T17:14:47Z

@whitfin we are already running Cachex in production, but we are implementing "distributed" for the first time. Previously, we just used short TTLs to avoid needing a complex strategy. We are already on 4.x but added the router() config just now.

probably-not · 2024-11-14T17:41:30Z

I am on 1.17.2 in my production env - so not the same version. Cachex version 4.0.2 if that helps.

We were only seeing it when we added the router to enabled distribution, using the Local strategy solved the issue - our caches for now aren't that big in any case so it isn't a problem for now. However, it does limit being able to use Cachex for anything that is larger that we would want distribution for.

jcartwright · 2024-11-14T17:53:48Z

@whitfin I'd be happy to screenshare since you're not able to reproduce.

whitfin · 2024-11-14T17:56:20Z

Okay, sounds good, thank you both!

I can probably look this over this weekend and try narrow it down; sounds somehow as though it's related to distribution... although that doesn't really make sense to me.

It seems like the cache is dying on one specific node in the cluster. Does it work as expected if you use the Ring router, but set nodes: [node()] (i.e. without monitor)? Just to rule out the routing logic itself.

I guess something is crashing the underlying table somehow when routing is enabled... but a) that shouldn't be possible, so I'm not sure I believe it and b) how is it that the CI tests aren't reproducing this either?

Note to self: both of your reproduction cases have monitor: true, that seems like a good place to begin!

jcartwright · 2024-11-14T20:57:04Z

With the change in cluster / monitor config I am no longer seeing errors. As I suspect is obvious, the cluster is effectively local only now.

      {Cachex,
       [
         name: Core.Helpers.Cache.domain_cache(),
         hooks: cachex_hooks(),
         router:
           router(
             module: Cachex.Router.Ring,
             options: [
               # monitor: true
               nodes: [node()]
             ]
           )
       ]}

iex(core@10.244.2.44)2> Cachex.size(:domain_cache)
{:ok, 5}

iex(core@10.244.2.4)5> Cachex.size(:domain_cache)
{:ok, 6}

whitfin · 2024-11-14T21:02:00Z

@jcartwright yeah, it is local only - but the logic is still flowing through the Ring router and all surrounding code, it's just selecting from a list of 1 instead of a list of N. So this would imply that either there's an issue in monitor: true or the case where there is more than one node.

I wonder if you can try the same test with monitoring disabled, but manually add more than 1 node? That would help ruling out monitoring as the cause. If not, no worries - thank you for testing the case above!

jcartwright · 2024-11-14T21:03:44Z

I can do whatever you need to try and narrow in and rule out issues.

jcartwright · 2024-11-14T21:48:42Z

So I reverted to the implicit node detection and it is behaving as expected. The cache appears to be distributed across my two nodes. Guess it might be something specific to the monitor: true config.

iex(core@10.244.1.158)3> Cachex.Router.connected()
[:"[email protected]", :"[email protected]", :"[email protected]"]

iex(core@10.244.2.34)15> Cachex.Router.connected()
[:"[email protected]", :"[email protected]", :"[email protected]"]

The Cachex keys/1 and size/1 functions are a bit slow and things are a little flaky. After a few minutes I started getting:

iex(core@10.244.1.158)12> Node.list()
[:"[email protected]"]
iex(core@10.244.1.158)13> get_order("00001bd4-0c9a-430f-9127-2a689dea0437", alternate_id: "5534603542573")
:nodedown

whitfin · 2024-11-24T02:54:05Z

Hi @probably-not & @jcartwright!

Thanks to @KiKoS0, this should be resolved in the latest patch version on Hex.pm. Our feeling that it was related to monitor: true was indeed correct; a stupid mistake on my part. I'm sorry I didn't find time to look at this sooner, but hopefully you shouldn't see any further issues here.

Thank you all for your help on this!

probably-not · 2024-11-24T15:07:31Z

@whitfin and @KiKoS0 huge thanks here! I won't be able to validate this until next week unfortunately, but I will report back on if the bug was solved. Thanks!

jcartwright · 2024-11-24T20:17:51Z

Great news! I also won't be in a position to confirm right away but will make it a top priority. Thanks for getting a fix in for this @KiKoS0

jcartwright · 2024-12-02T23:05:56Z

@whitfin it appears like this fix is good in our environment. I'll continue testing and let you know if anything else comes up.

whitfin added the discuss label Nov 3, 2024

KiKoS0 mentioned this issue Nov 24, 2024

Fix monitor exclusion not being applied for newly detected nodes #392

Merged

whitfin self-assigned this Nov 24, 2024

whitfin added the bug label Nov 24, 2024

whitfin closed this as completed in #392 Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`:ets.lookup` error from Cachex Overseer Table #388

`:ets.lookup` error from Cachex Overseer Table #388

probably-not commented Nov 3, 2024

whitfin commented Nov 3, 2024 •

edited

Loading

probably-not commented Nov 3, 2024

whitfin commented Nov 4, 2024

probably-not commented Nov 4, 2024

jcartwright commented Nov 14, 2024

whitfin commented Nov 14, 2024

jcartwright commented Nov 14, 2024

whitfin commented Nov 14, 2024

jcartwright commented Nov 14, 2024

probably-not commented Nov 14, 2024 •

edited

Loading

jcartwright commented Nov 14, 2024

whitfin commented Nov 14, 2024

jcartwright commented Nov 14, 2024

whitfin commented Nov 14, 2024

jcartwright commented Nov 14, 2024

jcartwright commented Nov 14, 2024

whitfin commented Nov 24, 2024

probably-not commented Nov 24, 2024

jcartwright commented Nov 24, 2024

jcartwright commented Dec 2, 2024

:ets.lookup error from Cachex Overseer Table #388

:ets.lookup error from Cachex Overseer Table #388

Comments

probably-not commented Nov 3, 2024

whitfin commented Nov 3, 2024 • edited Loading

probably-not commented Nov 3, 2024

whitfin commented Nov 4, 2024

probably-not commented Nov 4, 2024

jcartwright commented Nov 14, 2024

whitfin commented Nov 14, 2024

jcartwright commented Nov 14, 2024

whitfin commented Nov 14, 2024

jcartwright commented Nov 14, 2024

probably-not commented Nov 14, 2024 • edited Loading

jcartwright commented Nov 14, 2024

whitfin commented Nov 14, 2024

jcartwright commented Nov 14, 2024

whitfin commented Nov 14, 2024

jcartwright commented Nov 14, 2024

jcartwright commented Nov 14, 2024

whitfin commented Nov 24, 2024

probably-not commented Nov 24, 2024

jcartwright commented Nov 24, 2024

jcartwright commented Dec 2, 2024

`:ets.lookup` error from Cachex Overseer Table #388

`:ets.lookup` error from Cachex Overseer Table #388

whitfin commented Nov 3, 2024 •

edited

Loading

probably-not commented Nov 14, 2024 •

edited

Loading