-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add dataset 'adult' for vertical_fl #423
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Please see the inline comments. Thanks!
@@ -28,7 +28,7 @@ | |||
'subreddit', 'synthetic', 'ciao', 'epinions', '.*?vertical_fl_data.*?', | |||
'.*?movielens.*?', '.*?cikmcup.*?', 'graph_multi_domain.*?', 'cora', | |||
'citeseer', 'pubmed', 'dblp_conf', 'dblp_org', 'csbm.*?', 'fb15k-237', | |||
'wn18' | |||
'wn18', 'adult_fl' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for '_fl'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this dataset will also be used for the xgb model, I wonder if it will cause contradiction after merging the xgb pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we can keep one adult
dataset but use different preprocessors/splitters according to the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I'll add some choices in the .yaml file to decide whether to preprocess the data or not.
logger = logging.getLogger(__name__) | ||
|
||
|
||
class Adult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a docstring here, thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add the source of Adult dataset (e.g., link, reference) here.
@@ -101,7 +101,7 @@ def callback_funcs_for_encryped_gradient_v(self, message: Message): | |||
en_u, en_v_B = message.content | |||
input_x = self.sample_data(index=self.batch_index) | |||
en_v_A = en_u * input_x | |||
en_v = np.concatenate([en_v_A, en_v_B], axis=-1) | |||
en_v = np.concatenate([en_v_B, en_v_A], axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xieyxclack Please help to check this change, thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client A used to hold the label, for convenience, I let client B hold the label, so here two values should be switched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little tricky for me... maybe we should add a cfg item to specify which client own labels? (add a TODO later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since for xgb model, we are able to set the number of clients, and partition the data, for convenience, in the data partition program, I always let the last client hold the label. Maybe, later for the final version, we can add a cfg item as you suggest?(I will add a TODO)
for file in self.raw_file: | ||
if self._check_existence(file): | ||
logger.info(file + " files already exist") | ||
# print("Files already downloaded and verified") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line.
test_set = combined_set[train_set.shape[0]:] | ||
return train_set, test_set | ||
|
||
# standardization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment might be wrong. I guess it should be #normalization
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll fix it. By the way, as I said above, this dataset may also be used for xgb model and do not need to do normalization or standardization there, I wonder if it is suitable to process the data here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use an arg to control the process of data.
if cfg.data.args['is_norm']=True:
self.norm()
...
else:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
_range = np.max(data) - np.min(data) | ||
return (data - np.min(data)) / _range | ||
|
||
# normalization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look above.
|
||
self._get_data() | ||
|
||
base_folder = 'adult' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the class attribute behind __init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please refer to the inline comments, thx!
@@ -28,7 +28,7 @@ | |||
'subreddit', 'synthetic', 'ciao', 'epinions', '.*?vertical_fl_data.*?', | |||
'.*?movielens.*?', '.*?cikmcup.*?', 'graph_multi_domain.*?', 'cora', | |||
'citeseer', 'pubmed', 'dblp_conf', 'dblp_org', 'csbm.*?', 'fb15k-237', | |||
'wn18' | |||
'wn18', 'adult_fl' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we can keep one adult
dataset but use different preprocessors/splitters according to the methods.
return data, config | ||
else: | ||
raise ValueError('You must provide the data file') | ||
if generate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use if ... elif....else
here
logger = logging.getLogger(__name__) | ||
|
||
|
||
class Adult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add the source of Adult dataset (e.g., link, reference) here.
@@ -0,0 +1,39 @@ | |||
use_gpu: False | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove the blank line to make it consistent with other yaml files.
@@ -101,7 +101,7 @@ def callback_funcs_for_encryped_gradient_v(self, message: Message): | |||
en_u, en_v_B = message.content | |||
input_x = self.sample_data(index=self.batch_index) | |||
en_v_A = en_u * input_x | |||
en_v = np.concatenate([en_v_A, en_v_B], axis=-1) | |||
en_v = np.concatenate([en_v_B, en_v_A], axis=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little tricky for me... maybe we should add a cfg item to specify which client own labels? (add a TODO later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
add one dataset 'adult' and rewrite a general dataloader for vertical fl