Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standalone -REDIRECT handles special case of MULTI context #895

Merged
merged 6 commits into from
Aug 30, 2024

Conversation

soloestoy
Copy link
Member

@soloestoy soloestoy commented Aug 12, 2024

In standalone mode, when a -REDIRECT error occurs, special handling is required if the client is in the MULTI context.

We have adopted the same handling method as the cluster mode:

  1. If a command in the transaction encounters a REDIRECT at the time of queuing, the execution of EXEC will return an EXECABORT error (we expect the client to redirect and discard the transaction upon receiving a REDIRECT). That is:

    MULTI    ==>  +OK
    SET x y  ==>  -REDIRECT
    EXEC     ==>  -EXECABORT
    
  2. If all commands are successfully queued (i.e., QUEUED results are received) but a redirect is detected during EXEC execution (such as a primary-replica switch), a REDIRECT is returned to instruct the client to perform a redirect. That is:

    MULTI    ==>  +OK
    SET x y  ==>  +QUEUED
    failover
    EXEC     ==>  -REDIRECT
    

@soloestoy soloestoy requested review from PingXie and madolson August 12, 2024 09:45
Copy link

codecov bot commented Aug 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.44%. Comparing base (4a9b4f6) to head (72812a6).
Report is 16 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #895      +/-   ##
============================================
+ Coverage     70.41%   70.44%   +0.02%     
============================================
  Files           114      114              
  Lines         61736    61742       +6     
============================================
+ Hits          43474    43495      +21     
+ Misses        18262    18247      -15     
Files with missing lines Coverage Δ
src/server.c 88.59% <100.00%> (+0.01%) ⬆️

... and 8 files with indirect coverage changes

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, rejectCommand seem better in this case.

@zuiderkwast
Copy link
Contributor

In cluster, EXEC returns -EXECABORT if any command in the transaction returned MOVED instead of QUEUED. There's a test case here: https://github.com/valkey-io/valkey/blob/8.0.0-rc1/tests/unit/cluster/transactions-on-replica.tcl#L48

test "MULTI-EXEC with write operations is MOVED" {
    $replica MULTI
    catch {$replica HSET h b 4} err
    assert {[string range $err 0 4] eq {MOVED}}
    catch {$replica exec} err
    assert {[string range $err 0 8] eq {EXECABORT}}
}

I think standalone should behave similarly, i.e.

MULTI    ==>  +OK
SET x y  ==>  -REDIRECT
EXEC     ==>  -EXECABORT

@soloestoy
Copy link
Member Author

soloestoy commented Aug 13, 2024

@zuiderkwast you are right, the scenario you mentioned is when EXEC is actually executed, but there is another scenario where EXEC will not be executed:

MULTI    ==>  +OK
SET x y  ==>  +QUEUED
slot {x} is migrated to other node
EXEC     ==>  -MOVED

Currently, this PR also adopts this method to return -REDIRECT. Do you think we should consider changing it to return -EXECABORT for both cluster and standalone?

@hwware
Copy link
Member

hwware commented Aug 13, 2024

I think follow the rejectCommand logic is fine, but my concern is if we should add or update the error message for this case. In this pr, although we update the code flow to rejectCommand style, the error message still keep "REDIRECT"

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Aug 13, 2024

I'm not sure.

The documented behaviour is that EXEC should not fail in this case. It should just return the errors (redirects) within the array returned by EXEC. This is documented on the valkey-transactions man page:

   Errors inside a transaction
       During  a  transaction  it  is possible to encounter two kinds of
       command errors:

       • A command may fail to be queued, so there may be an  error  be‐
         fore EXEC is called.  For instance the command may be syntacti‐
         cally  wrong  (wrong  number  of arguments, wrong command name,
         ...), or there may be some critical condition like  an  out  of
         memory  condition (if the server is configured to have a memory
         limit using the maxmemory directive).

       • A command may fail after EXEC is called, for instance since  we
         performed an operation against a key with the wrong value (like
         calling a list operation against a string value).

       The  server  will detect an error during the accumulation of com‐
       mands.  It will then refuse to execute the transaction  returning
       an error during EXEC, discarding the transaction.

       Errors  happening after EXEC instead are not handled in a special
       way: all the other commands will be executed even if some command
       fails during the transaction.

       This is more clear on the protocol level.  In the following exam‐
       ple one command will fail when executed even  if  the  syntax  is
       right:

              Trying 127.0.0.1...
              Connected to localhost.
              Escape character is '^]'.
              MULTI
              +OK
              SET a abc
              +QUEUED
              LPOP a
              +QUEUED
              EXEC
              *2
              +OK
              -WRONGTYPE Operation against a key holding the wrong kind of value

It's not very good that some commands can succeed and some other commands can fail in the same transaction, but that seems to be the case in the example with -WRONGTYPE.

For redirect after queueing, I would prefer that a client has some possibility to retry the transaction on the correct node. If we implement a good behaviour and document it, it is not unrealistic for clients to implement redirect of transactions. A transaction is usually sent in a pipeline anyway. (It can be a problem if WATCH is used though. WATCH is typically not in the same pipeline, as the user normally needs to check some value after WATCH-ing a key.)

@gmbnomis
Copy link
Contributor

The documented behaviour is that EXEC should not fail in this case. It should just return the errors (redirects) within the array returned by EXEC. This is documented on the valkey-transactions man page:

Looking at the test cases in multi.tcl the MOVED case mentioned by @soloestoy is not the only case in which EXEC may return an error without issuing an error for a queued command. The general idea seems to be: if something fundamental has changed on the server and we wouldn't accept the queued commands anymore at EXEC time, then let's abort the transaction.

In standalone mode, this happens e.g. when the role changes after write commands have already been queued successfully (EXECABORT Transaction discarded because of: READONLY You can't write against a read only replica.)

   The  server  will detect an error during the accumulation of com‐
   mands.  It will then refuse to execute the transaction  returning
   an error during EXEC, discarding the transaction.

At least, the behavior I described above is not contradicting this sentence (when read in isolation) 😉

For redirect after queueing, I would prefer that a client has some possibility to retry the transaction on the correct node. If we implement a good behaviour and document it, it is not unrealistic for clients to implement redirect of transactions.

Yes, in general I think that pointing the client into the right direction is helpful. Although a MOVED or REDIRECT response to EXEC is somewhat surprising, it is the clearest indication of what to do next, IMHO.

It can be a problem if WATCH is used though. WATCH is typically not in the same pipeline, as the user normally needs to check some value after WATCH-ing a key.

This depends on the abstraction the client offers. Looking at redis-py for example, watch seems to be part of the pipeline and there is even a convenience function transaction which handles all the retry/watch boilerplate. (That being said, I have no idea whether following a redirect response would actually be easy in that client)

@PingXie
Copy link
Member

PingXie commented Aug 14, 2024

I don't think the documentation accounts for the "redirection" case, which is understandable since the MULTI command predates the cluster. So if the documentation was never updated after the introduction of cluster (along with the redirection behavior), we are effectively entering an undefined territory.

Ignoring the fact that this is the current cluster behavior, I would still think returning MOVED for EXEC makes sense in general, in the cluster mode, because it allows the client to respond to a topology change more promptly (as opposed to waiting for a periodical topology refresher timer to go off).

I would also agree that, ideally, standalone should follow suit and align with the cluster's behavior. I am not sure if the standalone client has this concept of "topology refresh" and if not, even though the "MOVED" error could be confusing, it would still be treated as "abort" effectively; but if yes, "MOVED" would be a better error IMO.

@soloestoy
Copy link
Member Author

test "MULTI-EXEC with write operations is MOVED" {
    $replica MULTI
    catch {$replica HSET h b 4} err
    assert {[string range $err 0 4] eq {MOVED}}
    catch {$replica exec} err
    assert {[string range $err 0 8] eq {EXECABORT}}
}

@PingXie Do you think we should also modify the response for cluster to return MOVED when executing EXEC in this situation?

p.s. actually, we originally returned MOVED, the change to return EXECABORT was not intentional, it was a side effect of this redis/redis@e2046b3, and the test case was also adapted after redis/redis#7881.

@PingXie
Copy link
Member

PingXie commented Aug 16, 2024

@PingXie Do you think we should also modify the response for cluster to return MOVED when executing EXEC in this situation?

Yes. That would be my preference (and the reason being it gives more chance to the client to update its view of the cluster topology).

@soloestoy
Copy link
Member Author

current fix is:
image

option 1:
image

option 2:
image

@valkey-io/core-team please vote, keep current fix or pick 1 or 2?

p.s. option 2 is difficult to achieve for the cluster, as it is unable to check the slot to which the error should be attributed if an error occurs during enqueuing.

@zuiderkwast
Copy link
Contributor

I vote for option 1.

✔️ The client will get a redirect either from one of the commands inside the transaction (instead of QUEUED) or from EXEC, so a smart client can figure this out.

✔️ EXEC consistently returns -EXECABORT if any command returned error instead of QUEUED. Redirects are errors in this sense.

@enjoy-binbin
Copy link
Member

thanks for doing this pic, it look clear. i vote for the option 1, we keep the old cluster way and also make standlone to be the same way

@hwware
Copy link
Member

hwware commented Aug 28, 2024

option 1 is my choice, Thanks @soloestoy

@soloestoy
Copy link
Member Author

It looks like we've reached a consensus chose option 1. @PingXie, if you don't object, I'll go ahead : )

Before make the changes, I think it's better to merge #961 first. If an error occurs within the transaction, we will clear the commands and only retain the flag. This way, both the cluster and standalone instances can directly allow the EXEC command to return -EXECABORT, making the implementation much simpler. @zuiderkwast please take a look.

@PingXie
Copy link
Member

PingXie commented Aug 29, 2024

It looks like we've reached a consensus chose option 1. @PingXie, if you don't object, I'll go ahead : )

I still like option 2's experience but option 1 works for me too and I can see it being the safer option as it doesn't change the existing behavior on clusters.

We should also update the documentation for both cluster and standalone after closing this PR.

@madolson
Copy link
Member

Option 1 also seems OK to me.

@zuiderkwast zuiderkwast added the needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. label Aug 29, 2024
@soloestoy
Copy link
Member Author

soloestoy commented Aug 29, 2024

After #961, we don't need to modify any code to achieve option 1, we just need to add test cases : )

Signed-off-by: zhaozhao.zz <[email protected]>
Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #961, we don't need to modify any code to achieve option 1, we just need to add test cases : )

LGTM. so since we reset the mstate (actually is the cmd_flags), the exec can go directly to EXECABORT (it did not pass the CLIENT_CAPA_REDIRECT check), that is new to me.

@soloestoy soloestoy merged commit 743f5ac into valkey-io:unstable Aug 30, 2024
47 checks passed
madolson pushed a commit that referenced this pull request Sep 2, 2024
In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.

We have adopted the same handling method as the cluster mode:

1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  -REDIRECT
    EXEC     ==>  -EXECABORT
    ```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  +QUEUED
    failover
    EXEC     ==>  -REDIRECT
    ```

---------

Signed-off-by: zhaozhao.zz <[email protected]>
madolson pushed a commit that referenced this pull request Sep 3, 2024
In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.

We have adopted the same handling method as the cluster mode:

1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  -REDIRECT
    EXEC     ==>  -EXECABORT
    ```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  +QUEUED
    failover
    EXEC     ==>  -REDIRECT
    ```

---------

Signed-off-by: zhaozhao.zz <[email protected]>
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Sep 4, 2024
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…#895)

In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.

We have adopted the same handling method as the cluster mode:

1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  -REDIRECT
    EXEC     ==>  -EXECABORT
    ```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  +QUEUED
    failover
    EXEC     ==>  -REDIRECT
    ```

---------

Signed-off-by: zhaozhao.zz <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…#895)

In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.

We have adopted the same handling method as the cluster mode:

1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  -REDIRECT
    EXEC     ==>  -EXECABORT
    ```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  +QUEUED
    failover
    EXEC     ==>  -REDIRECT
    ```

---------

Signed-off-by: zhaozhao.zz <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…#895)

In standalone mode, when a `-REDIRECT` error occurs, special handling is
required if the client is in the `MULTI` context.

We have adopted the same handling method as the cluster mode:

1. If a command in the transaction encounters a `REDIRECT` at the time
of queuing, the execution of `EXEC` will return an `EXECABORT` error (we
expect the client to redirect and discard the transaction upon receiving
a `REDIRECT`). That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  -REDIRECT
    EXEC     ==>  -EXECABORT
    ```
2. If all commands are successfully queued (i.e., `QUEUED` results are
received) but a redirect is detected during `EXEC` execution (such as a
primary-replica switch), a `REDIRECT` is returned to instruct the client
to perform a redirect. That is:

    ```
    MULTI    ==>  +OK
    SET x y  ==>  +QUEUED
    failover
    EXEC     ==>  -REDIRECT
    ```

---------

Signed-off-by: zhaozhao.zz <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
soloestoy added a commit to valkey-io/valkey-doc that referenced this pull request Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants