Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

:ets.lookup error from Cachex Overseer Table #388

Closed
probably-not opened this issue Nov 3, 2024 · 20 comments · Fixed by #392
Closed

:ets.lookup error from Cachex Overseer Table #388

probably-not opened this issue Nov 3, 2024 · 20 comments · Fixed by #392
Assignees

Comments

@probably-not
Copy link

I'm running code that looks like the following:

Cachex.fetch(:mycache, %CacheKey{}, &wrapped_fetch/1)

And this is throwing the following error:

{:EXIT, {:badarg, [{:ets, :lookup, [:cachex_overseer_table, :mycache], [error_info: %{cause: :id, module: :erl_stdlib_errors}]}, {Cachex.Services.Overseer, :retrieve, 1, [file: ~c"lib/cachex/services/overseer.ex", line: 90]}, {Cachex.Services.Overseer, :with, 2, [file: ~c"lib/cachex/services/overseer.ex", line: 148]}, {Cachex, :fetch, 4, []}]}}

Cachex is started in my application's supervisor with the following spec:

import Cachex.Spec

...
# In the children list:
      {
        Cachex,
        [
          :mycache,
          [
            router: router(module: Cachex.Router.Ring, options: [monitor: true])
          ]
        ]
      }
...

I don't see any other errors - so I expect that all the tables have been started properly. I'm not sure why I would get this issue, but because it is uncaught it's triggering a full crash on every request in my server.

I've currently reverted the use of Cachex in my code due to this error - but I'm happy to help try and debug if you need any help here.

@whitfin whitfin added the discuss label Nov 3, 2024
@whitfin
Copy link
Owner

whitfin commented Nov 3, 2024

This has most likely happened because your don't have :cachex registered as an application in your project, somehow. Are you maybe on an older Elixir version?

You need to make sure that it's included in :applications or :extra_applications, or that it has been started with Application.ensure_all_started(:cachex).

The main overseer table is initialized globally, rather than as part of your single cache supervision tree.

@probably-not
Copy link
Author

I'm on the latest Elixir (1.17), so I assumed that it's started as part of the applications since it has a runtime (like all the other applications that I have in my dependencies). I don't see anywhere in the Cachex docs that says I have to explicitly add it... is this still necessary as of the later Elixir versions?

@whitfin
Copy link
Owner

whitfin commented Nov 4, 2024

It shouldn't be necessary, no, but I have no other explanation...

If you start your app via iex -S mix and then do the lookup yourself, I guess this reproduces the issue?

$ iex -S mix
Compiling 56 files (.ex)
Generated cachex app
Erlang/OTP 26 [erts-14.2.5.1] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]

Interactive Elixir (1.16.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :ets.lookup(:cachex_overseer_table, :mycache)
[]

As I said though, this table is constructed by the main application - are you sure there's nothing explicitly disabling that? Nobody else has ever reported this, which makes me feel it's something about your application causing it to not start properly.

@probably-not
Copy link
Author

Interesting - I'm not sure why my application specifically would have an issue here - I don't have any issue with other things that start globally. Incidentally, switching to the Local router strategy seemed to solve this (although it may have been solved purely because the app was redeployed). I'll try to dig into the logs and see whether maybe there was some other crash that took down the Overseer process during startup... could there be a crash in the Overseer or in Eternal somewhere that would cause the table to be not found?

I had to revert this in our production environment (which was where it was happening), but I don't mind returning it in the coming days (just need to find time where the impact will be at its lowest) in order to help to try and debug this. If you have any specific things that you want me to run in order to try and narrow down the issue, let me know, I'm happy to help.

@jcartwright
Copy link

Hi there! I am seeing a similar issue with our distributed configuration that we just deployed to our staging environment on k8s. We are currently on Elixir 1.14.5 and Erlang/OTP 25.

In the application's Supervisor:

      {Cachex,
       [
         name: Core.Helpers.Cache.domain_cache(),
         hooks: cachex_hooks(),
         router:
           router(
             module: Cachex.Router.Ring,
             options: [
               monitor: true
             ]
           )
       ]}

Getting this error when trying to access Cachex.keys/1 or Cachex.size/1:

iex([email protected])3> Cachex.keys(:domain_cache)
** (FunctionClauseError) no function clause matching in Cachex.Router.result_merge/2

    The following arguments were given to Cachex.Router.result_merge/2:

        # 1
        {:EXIT,
         {:badarg,
          [
            {:ets, :select, [:domain_cache, [{{:_, :"$1", :"$2", :"$3", :"$4"}, [{:orelse, {:==, :"$4", nil}, {:>, {:+, :"$3", :"$4"}, 1731599051817}}], [:"$1"]}]],
             [error_info: %{cause: :id, module: :erl_stdlib_errors}]},
            {Cachex.Actions.Keys, :execute, 2, [file: 'lib/cachex/actions/keys.ex', line: 30]}
          ]}}

        # 2
        []

    (cachex 4.0.2) lib/cachex/router.ex:142: Cachex.Router.result_merge/2
    (elixir 1.14.5) lib/enum.ex:2468: Enum."-reduce/3-lists^foldl/2-0-"/3
    (cachex 4.0.2) lib/cachex/router.ex:284: Cachex.Router.route_cluster/3

iex([email protected])4> Cachex.size(:domain_cache)
** (FunctionClauseError) no function clause matching in Cachex.Router.result_merge/2

    The following arguments were given to Cachex.Router.result_merge/2:

        # 1
        :undefined

        # 2
        2

    (cachex 4.0.2) lib/cachex/router.ex:142: Cachex.Router.result_merge/2
    (elixir 1.14.5) lib/enum.ex:2468: Enum."-reduce/3-lists^foldl/2-0-"/3
    (cachex 4.0.2) lib/cachex/router.ex:284: Cachex.Router.route_cluster/3

@whitfin
Copy link
Owner

whitfin commented Nov 14, 2024

This is very odd.

Are you both using the same Elixir/OTP versions? I can't reproduce this, at all. If we can somehow narrow it down, obviously happy to fix/resolve it!

For @jcartwright I would like to double check:

Cachex.keys(:domain_cache) vs Cachex.keys(Core.Helpers.Cache.domain_cache())?

@jcartwright
Copy link

Core.Helpers.Cache.domain_cache() just returns the atom :domain_cache. In the console I just type the atom cause it's shorter.

I think @probably-not was on 1.17, so we're not on the same versions.

@whitfin
Copy link
Owner

whitfin commented Nov 14, 2024

@jcartwright and just so I know, sounds like this is an upgrade? Were you using Cachex with no issues previously? I'm trying to determine if this is a new issue. It most likely is, given this is the first time in 10 years it's been reported... but it's unclear whether it was introduced in Cachex v4.x or it's some other external thing moving (Elixir/OTP versions, for example).

@jcartwright
Copy link

@whitfin we are already running Cachex in production, but we are implementing "distributed" for the first time. Previously, we just used short TTLs to avoid needing a complex strategy. We are already on 4.x but added the router() config just now.

@probably-not
Copy link
Author

probably-not commented Nov 14, 2024

I am on 1.17.2 in my production env - so not the same version. Cachex version 4.0.2 if that helps.

We were only seeing it when we added the router to enabled distribution, using the Local strategy solved the issue - our caches for now aren't that big in any case so it isn't a problem for now. However, it does limit being able to use Cachex for anything that is larger that we would want distribution for.

@jcartwright
Copy link

@whitfin I'd be happy to screenshare since you're not able to reproduce.

@whitfin
Copy link
Owner

whitfin commented Nov 14, 2024

Okay, sounds good, thank you both!

I can probably look this over this weekend and try narrow it down; sounds somehow as though it's related to distribution... although that doesn't really make sense to me.

It seems like the cache is dying on one specific node in the cluster. Does it work as expected if you use the Ring router, but set nodes: [node()] (i.e. without monitor)? Just to rule out the routing logic itself.

I guess something is crashing the underlying table somehow when routing is enabled... but a) that shouldn't be possible, so I'm not sure I believe it and b) how is it that the CI tests aren't reproducing this either?

Note to self: both of your reproduction cases have monitor: true, that seems like a good place to begin!

@jcartwright
Copy link

With the change in cluster / monitor config I am no longer seeing errors. As I suspect is obvious, the cluster is effectively local only now.

      {Cachex,
       [
         name: Core.Helpers.Cache.domain_cache(),
         hooks: cachex_hooks(),
         router:
           router(
             module: Cachex.Router.Ring,
             options: [
               # monitor: true
               nodes: [node()]
             ]
           )
       ]}
iex(core@10.244.2.44)2> Cachex.size(:domain_cache)
{:ok, 5}

iex(core@10.244.2.4)5> Cachex.size(:domain_cache)
{:ok, 6}

@whitfin
Copy link
Owner

whitfin commented Nov 14, 2024

@jcartwright yeah, it is local only - but the logic is still flowing through the Ring router and all surrounding code, it's just selecting from a list of 1 instead of a list of N. So this would imply that either there's an issue in monitor: true or the case where there is more than one node.

I wonder if you can try the same test with monitoring disabled, but manually add more than 1 node? That would help ruling out monitoring as the cause. If not, no worries - thank you for testing the case above!

@jcartwright
Copy link

I can do whatever you need to try and narrow in and rule out issues.

@jcartwright
Copy link

So I reverted to the implicit node detection and it is behaving as expected. The cache appears to be distributed across my two nodes. Guess it might be something specific to the monitor: true config.

iex(core@10.244.1.158)3> Cachex.Router.connected()
[:"[email protected]", :"[email protected]", :"[email protected]"]

iex(core@10.244.2.34)15> Cachex.Router.connected()
[:"[email protected]", :"[email protected]", :"[email protected]"]

The Cachex keys/1 and size/1 functions are a bit slow and things are a little flaky. After a few minutes I started getting:

iex(core@10.244.1.158)12> Node.list()
[:"[email protected]"]
iex(core@10.244.1.158)13> get_order("00001bd4-0c9a-430f-9127-2a689dea0437", alternate_id: "5534603542573")
:nodedown

@whitfin
Copy link
Owner

whitfin commented Nov 24, 2024

Hi @probably-not & @jcartwright!

Thanks to @KiKoS0, this should be resolved in the latest patch version on Hex.pm. Our feeling that it was related to monitor: true was indeed correct; a stupid mistake on my part. I'm sorry I didn't find time to look at this sooner, but hopefully you shouldn't see any further issues here.

Thank you all for your help on this!

@probably-not
Copy link
Author

@whitfin and @KiKoS0 huge thanks here! I won't be able to validate this until next week unfortunately, but I will report back on if the bug was solved. Thanks!

@jcartwright
Copy link

Great news! I also won't be in a position to confirm right away but will make it a top priority. Thanks for getting a fix in for this @KiKoS0

@jcartwright
Copy link

@whitfin it appears like this fix is good in our environment. I'll continue testing and let you know if anything else comes up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants