-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Refactor how we store data in redis. #46861
Conversation
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
ping @rkooo567 |
can you link the original issue? |
There's no issue; it's from a community PR #46706 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general. many minor comments.
Besides, how do we test them now?
@@ -141,7 +141,7 @@ KEYS * | |||
# 1) "864b004c-6305-42e3-ac46-adfa8eb6f752" | |||
|
|||
# Step 6.5: Check the value of the key. | |||
HGETALL 864b004c-6305-42e3-ac46-adfa8eb6f752 | |||
SCAN 0 MATCH 864b004c-6305-42e3-ac46-adfa8eb6f752* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kevin85421 this is breaking change for kuberay users I think. Would this be okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect users to directly operate ray redis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is fine. My only concern is about cleaning up Redis. @rynewang will provide me with a Docker image, and I will manually test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kuberay cleans redis by calling cleanup_redis_storage which I reimplemented so should be OK. I'm giving @kevin85421 a docker image to test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not easy to get a Docker image. Let's merge it and then I will use the nightly image to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cc @edoakes we are using the function to clean up redis or the raw command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I checked with @edoakes before, we also use cleanup_redis_storage
match_pattern, | ||
"COUNT", | ||
std::to_string(batch_count)}; | ||
std::vector<std::string> args = {"HSCAN", table_name_, std::to_string(cursor_.value())}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it not use GenRedisKey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ctor of RedisScanner accepts the table name, and all its callers uses GenRedisKey so we are fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want a more typed set up, e.g. a struct represending the {external_storage_namespace, table_name} ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want a more typed set up, e.g. a struct represending the {external_storage_namespace, table_name} ?
Yeah this would be great. I am also confused why this doesn't include external_storage_namespace
(that's why I was asking why we don't use GenRedisKey)?
Signed-off-by: Ruiyang Wang <[email protected]>
@rkooo567 PTAL again |
Signed-off-by: Ruiyang Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Last couple comments (please make sure to resolve)!
Regarding Redis, can you sync with @kevin85421 and make sure we recommend to use cleanup_redis_storage
, not the raw Redis command? If we have any recommendation using raw Redis command, I think we should have a loud release note.
match_pattern, | ||
"COUNT", | ||
std::to_string(batch_count)}; | ||
std::vector<std::string> args = {"HSCAN", table_name_, std::to_string(cursor_.value())}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want a more typed set up, e.g. a struct represending the {external_storage_namespace, table_name} ?
Yeah this would be great. I am also confused why this doesn't include external_storage_namespace
(that's why I was asking why we don't use GenRedisKey)?
Signed-off-by: Ruiyang Wang <[email protected]>
Rewrote most of the file to use typed RedisCommand, RedisKey and RedisMatchPattern. PTAL! @rkooo567 |
Signed-off-by: Ruiyang Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks very good!
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
I think the test failure is real (i.e. cannot be fixed by increasing timeout) |
Signed-off-by: Ruiyang Wang <[email protected]>
it IS real. It happens when you have a large BatchDelete, where a bad move on callback function makes the callback never called. fixed and reverted the test relaxings. |
@@ -136,12 +136,22 @@ export REDIS_POD=$(kubectl get pods --selector=app=redis -o custom-columns=POD:m | |||
kubectl exec -it $REDIS_POD -- redis-cli -a "5241590000000000" | |||
|
|||
# Step 6.4: Check the keys in Redis. | |||
# Note: the schema changed in Ray 2.36.0. Previously we use a single HASH table, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be updated to 2.38.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamp
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Hi, just spotted this - wondering if corresponding changes are being made to KubeRay to ensure that it is able to clean up all of these new tables in the GCS cleanup job that runs when a cluster is terminated? Alternatively, it would be nice if these tables had expirations set (and regularly refreshed) internally so that KubeRay didn't need to clean them up. |
Yes, KubeRay calls a Ray function to clean up tables and that function is
updated accordingly. For the table expiration - that’s something separate
and yes it’s a good thing to have, though not in this PR.
…On Wed, Sep 18, 2024 at 16:43 Josh Karpel ***@***.***> wrote:
Hi, just spotted this - wondering if corresponding changes are being made
to KubeRay to ensure that it is able to clean up all of these new tables in
the GCS cleanup job that runs when a cluster is terminated?
Alternatively, it would be nice if these tables had expirations set (and
regularly refreshed) internally so that KubeRay didn't need to clean them
up.
See ray-project/kuberay#959
<ray-project/kuberay#959>
—
Reply to this email directly, view it on GitHub
<#46861 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANLX3X4MDUJR2463T43B34DZXIF3JAVCNFSM6AAAAABLVZAYHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJZGYZTGNZZHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Awesome, thank you! |
Ray GCS Fault Tolerance replies on an external Redis. We write all data for a Ray cluster into a single Redis HASH. Within the HASH, we use a Redis field prefix to represent a Ray table. This is OK for single field CRUD, but for "get all entries for a Ray table" we would have to do a full table HSCAN and filter out other Ray tables by key prefix. This slows things down and the call to get_all_node_info and get_all_actor_info are in fact fairly common. Updates the Redis usage by creating multiple Redis HASHes, 1 for each Ray table. The single field CRUD are largely the same; but "get all" operations now become a simple HSCAN on the target HASH only. This saves lots of compute. --------- Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Ray GCS Fault Tolerance replies on an external Redis. We write all data for a Ray cluster into a single Redis HASH. Within the HASH, we use a Redis field prefix to represent a Ray table. This is OK for single field CRUD, but for "get all entries for a Ray table" we would have to do a full table HSCAN and filter out other Ray tables by key prefix. This slows things down and the call to get_all_node_info and get_all_actor_info are in fact fairly common.
Updates the Redis usage by creating multiple Redis HASHes, 1 for each Ray table. The single field CRUD are largely the same; but "get all" operations now become a simple HSCAN on the target HASH only. This saves lots of compute.
Closes #46706.