-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make objectives work with vertical distributed and federated learning #9002
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite convinced that only one of the participants can have access to labels. We will have to maintain lots of these conditions in the future.
src/objective/quantile_obj.cu
Outdated
@@ -35,7 +35,7 @@ class QuantileRegression : public ObjFunction { | |||
bst_target_t Targets(MetaInfo const& info) const override { | |||
auto const& alpha = param_.quantile_alpha.Get(); | |||
CHECK_EQ(alpha.size(), alpha_.Size()) << "The objective is not yet configured."; | |||
CHECK_EQ(info.labels.Shape(1), 1) << "Multi-target is not yet supported by the quantile loss."; | |||
CHECK_LE(info.labels.Shape(1), 1) << "Multi-target is not yet supported by the quantile loss."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under which case this can be 0? Would be great if we can make sure it's greater or equal to 1 and only allow the first dimension (rows) to be zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
We have to make some assumptions for vertical federated learning, and assuming labels are only available on worker 0 is the least restrictive. We can look at the two scenarios:
So assuming labels on worker 0 would support a broader set of use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review. I'm a bit concerned about the generality of this dispatching. Is there a way to hide it in the collective instead of exposing it to the main algorithms?
Looking for ideas, @RAMitchell @hcho3 .
@@ -167,8 +170,10 @@ class QuantileRegression : public ObjFunction { | |||
common::Mean(ctx_, *base_score, &temp); | |||
double meanq = temp(0) * sw; | |||
|
|||
collective::Allreduce<collective::Operation::kSum>(&meanq, 1); | |||
collective::Allreduce<collective::Operation::kSum>(&sw, 1); | |||
if (info.IsRowSplit()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make a wrapper function for Allreduce
so that newcomers to the code base can understand these conditions? I think we need some separation for communication logic and machine learning algorithm implementation. I'm open to any idea, at the moment it's quite difficult for a "machine learning person" to add a new algorithm and make sure it works correctly with different distributed system requirements. I can see that these conditions are necessary for metrics as well since they require labels. Is there a way to make this less intrusive by abstracting the logic under collective instead of exposing it to the machine learning part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is Allreduce
is a low level primitive, but column-wise split and vertical federated learning are higher level semantic changes, so they don't always map to each other. At the level of the communicator, I don't think we can determine whether we should skip a particular Allreduce
call, even if we know what distributed/federated environment we are in.
Maybe we can come up with a clever mechanism to make this cleaner. Perhaps as a follow up?
src/objective/adaptive.h
Outdated
leaf_values.size() * sizeof(bst_float), 0); | ||
auto i = 0; | ||
auto& tree = *p_tree; | ||
for (auto nid = 0; nid < tree.NumNodes(); nid++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mapping between GetNodes
and nid
is not clear to me. There's an update leaf value function called by but host and device after calculating the quantiles and it has access to leaf value, consider moving the logic there.
Also, please consider abstracting this logic away from the main algorithm. I find it difficult to consider all possible combinations of conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can come up with a clever mechanism to make this cleaner. Perhaps as a follow up?
Yep, looking forward to your solution.
Went through all the objectives and made sure they work with both column-split distributed training and vertical federated learning.