-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-memory inputs for column split and vertical federated learning #9619
Comments
Helps with #9472 |
Let's focus on the federated learning use case and remove the data splitting in XGB entirely |
Sounds good. We'll standardize on the second approach, i.e. each worker only provides its own set of columns that are 0-indexed, and the global DMatrix is a union of all worker columns, re-indexed based on worker ranks. |
Excellent!
On 9/30/23 03:23, Rong Ou wrote:
Sounds good. We'll standardize on the second approach, i.e. each
worker only provides its own set of columns that are 0-indexed, and
the global DMatrix is a union of all worker columns, re-indexed based
on worker ranks.
—
Reply to this email directly, view it on GitHub
<#9619 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD7YPKMYGFQ3WTB7PN5XXFTX44N33ANCNFSM6AAAAAA5LT2I7Y>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--------------9PPYtvmG7ybnc0U29Ggd0Hgz
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Excellent!<br>
</p>
<div class="moz-cite-prefix">On 9/30/23 03:23, Rong Ou wrote:<br>
</div>
<blockquote type="cite" ***@***.***">
<p dir="auto">Sounds good. We'll standardize on the second
approach, i.e. each worker only provides its own set of columns
that are 0-indexed, and the global DMatrix is a union of all
worker columns, re-indexed based on worker ranks.</p>
<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br>
Reply to this email directly, <a href="#9619 (comment)" moz-do-not-send="true">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AD7YPKMYGFQ3WTB7PN5XXFTX44N33ANCNFSM6AAAAAA5LT2I7Y" moz-do-not-send="true">unsubscribe</a>.<br>
You are receiving this because you were mentioned.<img src="https://github.com/notifications/beacon/AD7YPKMQZCQHWLNHUCOBQ4TX44N33A5CNFSM6AAAAAA5LT2I72WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTHZNK5Y.gif" alt="" moz-do-not-send="true" width="1" height="1"><span style="color: transparent; font-size: 0; display: none;
visibility: hidden; overflow: hidden; opacity: 0; width: 0;
height: 0; max-width: 0; max-height: 0; mso-hide: all">Message
ID: <span><dmlc/xgboost/issues/9619/1741379036</span><span>@</span><span>github</span><span>.</span><span>com></span></span></p>
<script type="application/ld+json">[
{
***@***.***": "http://schema.org",
***@***.***": "EmailMessage",
"potentialAction": {
***@***.***": "ViewAction",
"target": "#9619 (comment)",
"url": "#9619 (comment)",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
***@***.***": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]</script>
</blockquote>
</body>
</html>
…--------------9PPYtvmG7ybnc0U29Ggd0Hgz--
|
@trivialfis another question is about labels, weights, and other metadata. When doing column split distributed training (non-federated), we assume this data is available on every worker. When loading data, do we also assume this information is loaded into every worker? If not, we'd have to broadcast it from, say, worker 0. |
I think this is a fair assumption. |
We've recently added support for column-wise data split (feature parallelism) and vertical federated learning (#8424), but the user interface in python is limited to text inputs and numpy arrays (#9365) only. We'd like to support other in-memory formats such as scipy sparse matrix, pandas data frame, cudf, and cupy.
One question is the meaning of passing in
data_split_mode=COL
. There are potentially two interpretations:data_split_mode=COL
would load the wholeDMatrix
, then split it by column according to the size of the cluster. The columns are split evenly intoworld_size
slices, with each worker'srank
determining which slice it gets. This is the approach currently used by the text inputs for feature parallel distributed training, but not for vertical federated learning.DMatrix
is a union of all the columns from all the workers, with column indices re-indexed starting from worker 0. This is the approach currently used for vertical federated learning.Now we want to support more in-memory inputs, it probably makes more sense to standardize on the second approach, since it seems wasteful to construct a
DMatrix
in memory and then slice it by column.The text was updated successfully, but these errors were encountered: