-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dalli sometimes returns incorrect values #956
Comments
Thanks for submitting. So am somewhat skeptical that this is an actual issue, in part because the reproduction code provided uses Wrapping I/O code in timeout blocks (as opposed to using built-in connection timeout settings) like that is an anti-pattern and should never be used in production code. What's happening here is that you're treating the Timeout like a non-fatal error for the underlying connection (which is absolutely should be) and then reusing the connection. So of course the socket is corrupt. Note that if this were a open, read, or write timeout generated at the connection level it would get properly translated into If you'd like to submit an example that manifests the issue without using |
Thanks for your response. I agree my examples are arbitrary - their purpose is to demonstrate that an inopportune timeout is sufficient to reproduce the issue, not that one should actually be wrapping individual cache calls in their own Timeouts. In a more realistic production scenario, it could be something like rack-timeout around the entire request. I hear you that Timeout is fundamentally dicey, you are absolutely correct about its risks. My argument is that this is still worth addressing as defense-in-depth, because this specific failure mode is so severe. Even if we grant that Timeout is a total anti pattern and nobody should use it, it is worth thinking about levels of risk. There are more and less catastrophic outcomes possible with timeouts. A caching system returning incorrect values is among the worst outcomes I can imagine due to the potential for uncontrolled data exposure. In practice request timeouts are still used in complex systems because it is not possible to rely entirely on low-level I/O and library timeouts; since the total request time is what matters, not any individual operation. And killing any process that takes too long has its own downsides. Given that request timeouts are commonly used, this is far from a purely theoretical risk, it is a failure mode that could - and has - occurred in production systems. As with any defense-in-depth argument, it is also worth considering unknown-unknowns. Unicorn and Puma have a variety of signals, and it is hard to be certain none of them could ever result in interrupting a process partway through. It goes without saying that in complex/high-traffic systems, weird failure cases are guaranteed to happen eventually. Further, the defense-in-depth solution I have proposed is an opt-in argument to use an already existing method in the memcached spec - far from radical. Complemented with an approach which locks around the combined write and read components for get_multi. Are there serious downsides to exploring either? Otherwise the balance of risk seems highly favorable to addressing the issue, even though it is a tail risk. I am happy to refine the PRs and add tests as necessary. |
So this is not a response to my objection, and is a misunderstanding of the phrase "defense in depth" Your "realistic" example represents a serious misunderstanding. With a proper use of To cover the
So if the client is not reused, the error cannot occur. The error is that there's a misguided attempt to recover the
Production systems contain lots of bugs. Whoever wrote this one doesn't understand how a request lifecycle should work. If the application discards the request, then the application discard the resources (database connections, API connections, socket connections) that are currently in use and allows its connection pools to regenerate them. If it doesn't do that, then it's simply built wrong. Honestly, your response has just firmed up my original suspicion that this is a non-issue. It demonstrates a misunderstanding of how to use persistent resources in request-driven systems. Software systems are not magic, they have rules, and guarding against supposed "unknown unknowns" is not only not feasible, but actively counterproductive. Moreover, your examples are not "unknown unknowns" - they are active misuse. If a developer chooses to actively misuse a library, it's not the responsibility of the library to save them from themselves. Closing. |
I'm not sure I understand your point about connection pooling because not all Dalli configurations involve a connection pool,
For an unthreaded server like unicorn, your claim that the connection will never be used again is absolutely not true. There is one connection per process, and the process survives the timeout; so the next request uses the same connection. That's what happens with default Rails <> Unicorn <> Dalli configuration anyway, so it is still a real danger worth addressing. I would very much be open to alternative approaches that could prevent a connection from being used after a timeout in a non-pooled configuration. I am sorry if this conversation has frustrated you but I do not understand why you are being so dismissive, or resorting to unnuanced assertions that whoever built the system doesn't understand how things should work. You are arguing that it is only active developer misuse that can cause this issue - I can demonstrate to you that a default configuration setup with rack-timeout, which many many Rails apps use, is sufficient to cause the issue. If the default setups are "misuse" in your view, that is perhaps a valid problem, but it doesn't make the scenario worth ignoring to preserve unsafe defaults. |
Hi @petergoldstein, sorry that we got off on the wrong foot last week. I've come to better appreciate your points about timeouts and request/resource management. It's clear to me now why Timeout.timeout is unsafe, and any request timeout which does not fully restart the process would be playing with fire, even if library/application code tries to make it safer. mperham/connection_pool#75 (comment) was another interesting example of this. So obviously you're right that the library isn't responsible for protecting against timeout misuse. The fundamental safety problem there is the timeouts on their own. I'm carrying that lesson into some other Rails codebases I'm involved with, so I'm grateful you took the time to engage with me and explain it. However, I'm now wondering if my misguided timeout examples were just a red herring, because they don't help to explain why I and others saw this issue occur right after upgrading. I found some other historical issues in this repo referencing the same or similar behavior, such as: I spoke directly with an engineer at another company who had mentioned seeing the same behavior a few years ago, who shared that it happened again recently:
I'm curious if you can think of any reason the version upgrade process itself might cause issues? E.g. subtle formatting differences between versions, such that interacting with entries set by the previous version might have unexpected behavior? Just a shot in the dark there. I'm following the advice above by putting the Dalli version in our cache namespace to effectively bust cache when upgrading, but nonetheless curious about what could be going on. I also see that there have been issues in the past related to reading get_multi responses which was addressed #321 - but seems similar enough to what I've seen that it's worth mentioning. Would any of those additional reports influence your evaluation of whether it could be worth exploring some simple logic in Dalli's Thanks again for your insights! |
Honestly, none of the above makes much sense to me. So no. |
Okay, that tracks. For any readers of this thread who do share my concern about the multiple independent reports of Dalli returning incorrect values after upgrading - or the next team who runs into it and comes looking for a solution: one possible approach is our fork's get implementation which enforces that only values for the correct key will ever be returned. aha-app@f6da276 |
This seems like a valid issue to me. There could be many ways to corrupt a socket; even if Dalli is a memcached library where speed is critical, we still should value correctness over speed. |
@petergoldstein If you'd like me to take this one and shepherd it through the PR process, I'd be happy to take it off your plate. |
@mperham I'm sorry, but I wholly disagree that this is a valid issue. There's no "there", here. No demonstrated issue of any actual corruption. Just a bizarre search through over a decade of past issues in search of something, anything, that might match the pattern. The original supposed problem - which was basically a simple thread safety issue - isn't even in the mix anymore. There's an "agreement" here that the original problem is simply user error, but the submitter is still asserting that there's a need based on ... what exactly? Not to mention the architecture issues with the submitted PR, which I noted in my comments (in a much more subdued form) I closed this quite deliberately, and held my tongue a bit in my responses. But to be clear I firmly believe that merging this code would be an ethical breach inconsistent with my role as maintainer. |
@petergoldstein Forgive me if I'm misunderstanding but his issue seems to be: a Dalli connection can be corrupted in such a way that a request for B can result in the response for A being returned. A timeout signal is one method of corruption and he gave sample code that shows it. I certainly try to avoid Timeout usage myself but I still try to make my gems as reliable as possible even in the face of its usage. Timeout is directly to blame for this grossness in connection_pool. Socket buffer corruption is a common issue with many network client libraries and that's the purpose of the |
Sorry for chiming in, I know it's annoying for maintainers to get lots of responses on closed issues like this. I'd just like to propose an alternative solution to cleanly handle asynchronous interrupts. What Perhaps such implementation would be more acceptable to you? I also wholeheartedly agree that |
@mperham No, I don't quite agree that's what's going on here. What was initially reported was a case of pure user error. A non-thread safe Dalli client was passed around after a Timeout. Unsurprisingly, it was corrupted. Feel free to read my comment above - I think it's a correct summary. I don't believe A solution was proposed that adds another method to the Client interface and replaces use of memcached's Despite an agreement that the original problem was user error, the proposed solution was nonetheless viewed as a solution to some set of supposed corruption issues dug out of the depths of Dalli issues (3 over a decade of issues). And this one thing he heard from a guy he works with that they always change the cache key when they upgrade Dalli versions (which would have nothing to do with socket corruption, but hey...) so obviously it's all the same problem. And so the code should obviously be merged... Well, no. Now, on to your comments. It is certainly possible for a socket to become corrupt. It can happen for reasons other than use error, and it's nice to have tooling that mitigates that. It's even nicer if that mitigation helps in cases of user error. Assuming the tradeoffs aren't too onerous. In my view, given the lack of frequency of the issue, I think the tradeoffs have to be pretty small in terms of maintainability, API coherency, and performance, . That said, that doesn't mean a mitigating PR is not possible. It would just need to meet that tradeoff threshold. As you note, in the memcached case the
@casperisfine I'd also be open to something like you're proposing, although I think it might be more complex here because of memcached pipelining (a naive port of the Redis code seems unlikely to work). But if you want to take a swing at it, that'd be fine too. |
So I had a quick look, and indeed it will have to work a bit differently, but it should be doable. Contrary to redis, with memcached pipelining you don't know how many responses to expect, but you do know when you are done. So I think incrementing by one before sending the pipelined command, and then decrementing by one once the pipeline is done reading would work. But again, I just had a quick look.
I'm at Ruby Kaigi for about a week, so @ cornu-ammonis if you feel like trying this help yourself, otherwise I'll try to have a stab at it once I'm back. |
Fix: petergoldstein#956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in production, so ideally it's better to try to handle it gracefully. This patch is inspired from redis-rb/redis-client@5f82254 before sending a request we increment a counter, and once we fully read the response(s), we decrement it. If the counter is not 0 when we start a request, we know the connection may have unread responses from a previously aborted request, and we automatically discard it.
Proposed patch #963 |
Fix: petergoldstein#956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in production, so ideally it's better to try to handle it gracefully. This patch is inspired from redis-rb/redis-client@5f82254 before sending a request we increment a counter, and once we fully read the response(s), we decrement it. If the counter is not 0 when we start a request, we know the connection may have unread responses from a previously aborted request, and we automatically discard it.
Fix: petergoldstein#956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in production, so ideally it's better to try to handle it gracefully. This patch is inspired from redis-rb/redis-client@5f82254 before sending a request we increment a counter, and once we fully read the response(s), we decrement it. If the counter is not 0 when we start a request, we know the connection may have unread responses from a previously aborted request, and we automatically discard it.
Fix: petergoldstein#956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in production, so ideally it's better to try to handle it gracefully. This patch is inspired from redis-rb/redis-client@5f82254 before sending a request we increment a counter, and once we fully read the response(s), we decrement it. If the counter is not 0 when we start a request, we know the connection may have unread responses from a previously aborted request, and we automatically discard it.
Fix: petergoldstein#956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in production, so ideally it's better to try to handle it gracefully. This patch is inspired from redis-rb/redis-client@5f82254 before sending a request we increment a counter, and once we fully read the response(s), we decrement it. If the counter is not 0 when we start a request, we know the connection may have unread responses from a previously aborted request, and we automatically discard it.
Fix: petergoldstein#956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in production, so ideally it's better to try to handle it gracefully. This patch is inspired from redis-rb/redis-client@5f82254 before sending a request we increment a counter, and once we fully read the response(s), we decrement it. If the counter is not 0 when we start a request, we know the connection may have unread responses from a previously aborted request, and we automatically discard it.
Fix: petergoldstein#956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in production, so ideally it's better to try to handle it gracefully. This patch is inspired from redis-rb/redis-client@5f82254 before sending a request we increment a counter, and once we fully read the response(s), we decrement it. If the counter is not 0 when we start a request, we know the connection may have unread responses from a previously aborted request, and we automatically discard it.
Fix: petergoldstein#956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in production, so ideally it's better to try to handle it gracefully. This patch is inspired from redis-rb/redis-client@5f82254 before sending a request we flip a flag, and once we fully read the response(s), we flip it back. If the flag is not `false` when we start a request, we know the connection may have unread responses from a previously aborted request, and we automatically discard it.
Fix: #956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in production, so ideally it's better to try to handle it gracefully. This patch is inspired from redis-rb/redis-client@5f82254 before sending a request we flip a flag, and once we fully read the response(s), we flip it back. If the flag is not `false` when we start a request, we know the connection may have unread responses from a previously aborted request, and we automatically discard it. Co-authored-by: Jean Boussier <[email protected]>
This is causing catastrophic corruption in our app, the solution is hugely appreciated but can you kindly release a new version? :) |
@rgaufman if it's really catastrophic, surely you can run it as a git gem right away? |
This is what I’m doing, but a release would be much appreciated as this seems like quite a critical issue that can cause data corruption, information leaks, etc. |
Fix: petergoldstein/dalli#956 While it's heavily discouraged, `Timeout.timeout` end up being relatively frequently used in production, so ideally it's better to try to handle it gracefully. This patch is inspired from redis-rb/redis-client@5f82254 before sending a request we flip a flag, and once we fully read the response(s), we flip it back. If the flag is not `false` when we start a request, we know the connection may have unread responses from a previously aborted request, and we automatically discard it. Co-authored-by: Jean Boussier <[email protected]>
Hi, thanks for all your work maintaining this repo! I am a security engineer investigating some incorrect Rails caching behavior observed after upgrading to the latest version from 2.7.x, and have identified a mechanism consistent with my findings:
There is a race condition in the pipelined getter implementation which allows for socket corruption. The key problem is that a lock (
request_in_progress
) is acquired on the connection only after sending the getkq requests; if something happens between sending those requests and acquiring the connection lock, the socket could be left in a state which is both corrupt and (superficially) ready to accept future requests.In the corrupted state, subsequent operations which result in data being read from the socket will have incorrect behavior, because they expect that the next bytes on the socket are the responses to their own requests but the unread getkq responses will precede their own responses. This could cause the client to return incorrect values for given keys because the writing and reading operations are out of sync / "off-by-x", where x is the number of responses to the getkq requests.
Suggested fix
I propose we should instead acquire a connection lock before writing the getkq requests to the socket. This means that if something happens after writing the getkq requests but before reading their responses, the connection will be left in the locked (
request_in_progress
) state and will refuse to accept additional requests (resetting instead). I am opening a PR which attempts to do this.Open question
Note that reading the 2.7.x code, it seems that it could have had the same issue here. I am unsure if something else changed to make this failure case more likely - or if it was just a coincidence. The time window for a timeout/failure leading to this bug is narrow, so this would certainly be a rare event with randomness at play; nonetheless I observed the incorrect behavior not long after upgrading to 3.2.3 which is suspicious and I would like to be able to explain what, if anything, changed.
Demo/POC
in Rails console I used the following steps to reproduce this issue:
=> "test1result1"
The text was updated successfully, but these errors were encountered: