-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
:ets.lookup
error from Cachex Overseer Table
#388
Comments
This has most likely happened because your don't have You need to make sure that it's included in The main overseer table is initialized globally, rather than as part of your single cache supervision tree. |
I'm on the latest Elixir (1.17), so I assumed that it's started as part of the applications since it has a runtime (like all the other applications that I have in my dependencies). I don't see anywhere in the Cachex docs that says I have to explicitly add it... is this still necessary as of the later Elixir versions? |
It shouldn't be necessary, no, but I have no other explanation... If you start your app via
As I said though, this table is constructed by the main application - are you sure there's nothing explicitly disabling that? Nobody else has ever reported this, which makes me feel it's something about your application causing it to not start properly. |
Interesting - I'm not sure why my application specifically would have an issue here - I don't have any issue with other things that start globally. Incidentally, switching to the I had to revert this in our production environment (which was where it was happening), but I don't mind returning it in the coming days (just need to find time where the impact will be at its lowest) in order to help to try and debug this. If you have any specific things that you want me to run in order to try and narrow down the issue, let me know, I'm happy to help. |
Hi there! I am seeing a similar issue with our distributed configuration that we just deployed to our staging environment on k8s. We are currently on Elixir 1.14.5 and Erlang/OTP 25. In the application's Supervisor:
Getting this error when trying to access
|
This is very odd. Are you both using the same Elixir/OTP versions? I can't reproduce this, at all. If we can somehow narrow it down, obviously happy to fix/resolve it! For @jcartwright I would like to double check:
|
I think @probably-not was on 1.17, so we're not on the same versions. |
@jcartwright and just so I know, sounds like this is an upgrade? Were you using Cachex with no issues previously? I'm trying to determine if this is a new issue. It most likely is, given this is the first time in 10 years it's been reported... but it's unclear whether it was introduced in Cachex v4.x or it's some other external thing moving (Elixir/OTP versions, for example). |
@whitfin we are already running Cachex in production, but we are implementing "distributed" for the first time. Previously, we just used short TTLs to avoid needing a complex strategy. We are already on 4.x but added the |
I am on 1.17.2 in my production env - so not the same version. Cachex version 4.0.2 if that helps. We were only seeing it when we added the router to enabled distribution, using the Local strategy solved the issue - our caches for now aren't that big in any case so it isn't a problem for now. However, it does limit being able to use Cachex for anything that is larger that we would want distribution for. |
@whitfin I'd be happy to screenshare since you're not able to reproduce. |
Okay, sounds good, thank you both! I can probably look this over this weekend and try narrow it down; sounds somehow as though it's related to distribution... although that doesn't really make sense to me. It seems like the cache is dying on one specific node in the cluster. Does it work as expected if you use the Ring router, but set I guess something is crashing the underlying table somehow when routing is enabled... but a) that shouldn't be possible, so I'm not sure I believe it and b) how is it that the CI tests aren't reproducing this either? Note to self: both of your reproduction cases have |
With the change in cluster / monitor config I am no longer seeing errors. As I suspect is obvious, the cluster is effectively local only now. {Cachex,
[
name: Core.Helpers.Cache.domain_cache(),
hooks: cachex_hooks(),
router:
router(
module: Cachex.Router.Ring,
options: [
# monitor: true
nodes: [node()]
]
)
]} iex(core@10.244.2.44)2> Cachex.size(:domain_cache)
{:ok, 5}
iex(core@10.244.2.4)5> Cachex.size(:domain_cache)
{:ok, 6} |
@jcartwright yeah, it is local only - but the logic is still flowing through the Ring router and all surrounding code, it's just selecting from a list of 1 instead of a list of N. So this would imply that either there's an issue in I wonder if you can try the same test with monitoring disabled, but manually add more than 1 node? That would help ruling out monitoring as the cause. If not, no worries - thank you for testing the case above! |
I can do whatever you need to try and narrow in and rule out issues. |
So I reverted to the implicit node detection and it is behaving as expected. The cache appears to be distributed across my two nodes. Guess it might be something specific to the iex(core@10.244.1.158)3> Cachex.Router.connected()
[:"[email protected]", :"[email protected]", :"[email protected]"]
iex(core@10.244.2.34)15> Cachex.Router.connected()
[:"[email protected]", :"[email protected]", :"[email protected]"] The Cachex iex(core@10.244.1.158)12> Node.list()
[:"[email protected]"]
iex(core@10.244.1.158)13> get_order("00001bd4-0c9a-430f-9127-2a689dea0437", alternate_id: "5534603542573")
:nodedown |
Hi @probably-not & @jcartwright! Thanks to @KiKoS0, this should be resolved in the latest patch version on Hex.pm. Our feeling that it was related to Thank you all for your help on this! |
Great news! I also won't be in a position to confirm right away but will make it a top priority. Thanks for getting a fix in for this @KiKoS0 |
@whitfin it appears like this fix is good in our environment. I'll continue testing and let you know if anything else comes up. |
I'm running code that looks like the following:
And this is throwing the following error:
Cachex is started in my application's supervisor with the following spec:
I don't see any other errors - so I expect that all the tables have been started properly. I'm not sure why I would get this issue, but because it is uncaught it's triggering a full crash on every request in my server.
I've currently reverted the use of Cachex in my code due to this error - but I'm happy to help try and debug if you need any help here.
The text was updated successfully, but these errors were encountered: