Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: DB selection support for proxy-bridged cluster #2642

Closed
baronwangr opened this issue Feb 7, 2024 · 8 comments · Fixed by #2646
Closed

Feature Request: DB selection support for proxy-bridged cluster #2642

baronwangr opened this issue Feb 7, 2024 · 8 comments · Fixed by #2646

Comments

@baronwangr
Copy link

Description:
Some of the Proxy-bridged cluster can support multiple db selection.
In order to give full support of advanced features, the proxy-bridged cluster worked as a single node cluster that:

  • support "Cluster" command, if user emit "cluster nodes" it will return its IP with full slot range [0, 16383]
  • "cluster_enabled" is set to 1 in info, with corresponding "redis_mode" a cluster
  • it supports mget/mset series and multi-key commands (non-atomic)

in that mode, it can give the compatibility at best effort and user could write same code for standalone mode or cluster mode, especially when user scales standalone mode to cluster, with db selection data inside.

An unhandled timeout exception was encountered when doing db selection in cluster mode in some early version of the library, and we tried the latest package 2.7.17 but also failed.

Unhandled exception. StackExchange.Redis.RedisTimeoutException: Timeout performing HSET (10000ms), inst: 0, qu: 0, qs: 0, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: 172.123.35.71:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: 47e4280ce570, PerfCounterHelperkeyHashSlot: 11933, IOCP: (Busy=0,Free=1000,Min=32,Max=1000), WORKER: (Busy=0,Free=32767,Min=32,Max=32767), v: 2.2.79.4591

the case can be reproduced by using code below,

        // singleton connection
        ConnectionMultiplexer cm = RedisConnSingleton.getRedisConn();
        var db = cm.GetDatabase();
        

        for (int i = 1; i <= 100; i++)
        {
            string key = i.ToString();
            string value = i.ToString();

            // after switch db selection, the GetDatabase returned a unexpect handle so failed to send
            db.GetDatabase(i); 

            db.StringSet(key, value);
            String ret = db.StringGet(key);
            Console.WriteLine("get key: " + ret);

            Thread.Sleep(1000);
        }    
  • Hope to support db-selection in cluster mode
  • Is there a way of getting correct cluster link to send raw "select" command? Please advise.

Thanks,
Baron

@mgravell
Copy link
Collaborator

mgravell commented Feb 7, 2024

Some of the Proxy-bridged cluster can support multiple db selection.

Which, exactly? You say something can be repro'd, but: we can't repro anything without knowing what server configuration is required. Is this redislabs/redis-cluster-proxy (which seems to be pretty idle in terms of GitHub)? Or is this some other setup that spoofs the redis-cluster feature on top of a different shard model? How does this cluster bridge identify itself? This is an important question because "normal" cluster doesn't support multiple databases, so we disable the feature in the client.

Perhaps alternatively; if everything is going to be routed via the proxy, maybe we don't even need to know.that it is in cluster mode - deferring all of that to the proxy, in which case we need to know how it identified itself simply so we can ignore the cluster aspect.

However, the most interesting thing for me here is that you say you're getting a timeout when changing database, but: I would expect GetDatabase to fail if we detected cluster mode - so this seems to be a different failure mode, and not one that the library is forcing. Alternatively, if we didn't identify a cluster, I would expect everything to work already.

But: I can do precisely zero to investigate any of this without knowing exactly what server/proxy setup you have here.

@baronwangr
Copy link
Author

baronwangr commented Feb 7, 2024

Hi, thanks for the quick response! It could be late at night...
Alibaba cloud Redis supports proxy mode, which begins from 2.8 in static sharding mode (earlier) and now a bridge for native cluster, it can support multi db selection well.

Most of the Redis SDK doesn't support standalone-cluster auto-nego well so cloud vendors are spending times on it. It looks that AWS is supporting unsharded-cluster mode to support user scale their Redis recently, meanwhile alibaba uses proxy to scale.

The issue happened when user scale a standalone mode to proxy-bridged cluster. It worked like:

  1. The cloud vendor, provide master-replica pair of Redis for standalone mode
  2. The Redis pair has a Virtual IP (allocated from user's virtual switch) so user could access the VIP:6379(for example) to access the standalone Redis. (Redis master's service IP:Port is using NAT to bind to VIP:6379)
  3. when user are submit a scale, the managing system is producing a new cluster, and begin to transfer data to the cluster (with proxies).
  4. when data transfer is done, trying to set RO mode or pause the write client to wait for the in-flight write to finish. When it is done, it rebinds the VIP:6379 to proxies(NAT with Load-balance mode). so the scale is done.

For user land, when VIP is switched, all links to VIP will be killed, so user app with SDK will encounter a link bounce.

We are still investigating the timeout issue when switch is done, current debug shows when switch to a cluster mode, some exceptions could be handled inside silently so user get a timeout. It is still unknown that if the client is re-configured into a cluster mode because in wireshark we saw "SELECT" went out and "+OK" returned.

Yes I totally agree that the rejection of SELECT in cluster mode is also a good option with fast failure. In test we found that SELECT can be sent but not well handled. @yangbodong22011(Jedis Reviewer) is working on it now, we will provide test environment.

It is also hoped that the db selection in cluster should not be a vendor specific SDK feature, we just try to find a way to enable it gracefully in cluster mode. Proxy can be fixed to adapt to auto-negotiation way, which we can propose to the Redis community. @soloestoy

We will keep you updated about the debugging, hope you get us well.

Thanks again.

@lyq2333
Copy link

lyq2333 commented Feb 7, 2024

@mgravell Hi, sorry for the bother. I think maybe I find the reason. The variable Source is in InternalFailure state but the function TryPushMessageToBridgeSync returns WriteResult.Success in ExecuteSyncImpl. Finally, source will wait until syncTimeout and return a timeout error.

image

The reason for this is that we catch the exception but return WriteResult.Success in WriteMessageToServerInsideWriteLock.
image

I'm new here. Is this by design or a bug?

@mgravell
Copy link
Collaborator

mgravell commented Feb 7, 2024

If result was Success but we failed internally, then that is odd and probably a bug; however, I'm not sure we actually wrote the select, because the inner exception message shown comes from the code that would have generated that command, however: without a test environment to test against, it is very hard to investigate directly. For this to work, we would need some way to identify that this is a server that supports this usage.

If someone can provide a server endpoint to test against, we can say more.

@yangbodong22011
Copy link

@mgravell I sent you the test environment address via slack, please check it.

mgravell added a commit that referenced this issue Feb 11, 2024
1: don't treat trivial clusters as clusters - Alibaba uses this config
2: report synchronous failures immidiately and accurately
@mgravell
Copy link
Collaborator

Thanks for providing a suitable validation environment; this isn't a "normal" redis configuration, and a reasonable argument could be made that this is a bug in their proxy (it shouldn't positively respond to cluster nodes, IMO), but: it now works!

mgravell added a commit that referenced this issue Feb 13, 2024
* fix #2642

1: don't treat trivial clusters as clusters - Alibaba uses this config
2: report synchronous failures immidiately and accurately

* instead of using node count, use explicit tracking of the DB count

* release notes
@mgravell
Copy link
Collaborator

fix merged; should have new release soon; preview is on myget, 2.7.20 or higher from nuget feed https://www.myget.org/F/stackoverflow/api/v3/index.json

@baronwangr
Copy link
Author

It passed our test and work well now.
Thank @mgravell for the support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants