Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dataset 'adult' for vertical_fl #423

Merged
merged 13 commits into from
Nov 10, 2022

Conversation

qbc2016
Copy link
Collaborator

@qbc2016 qbc2016 commented Nov 8, 2022

add one dataset 'adult' and rewrite a general dataloader for vertical fl

@xieyxclack xieyxclack added the enhancement New feature or request label Nov 9, 2022
Copy link
Collaborator

@rayrayraykk rayrayraykk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Please see the inline comments. Thanks!

@@ -28,7 +28,7 @@
'subreddit', 'synthetic', 'ciao', 'epinions', '.*?vertical_fl_data.*?',
'.*?movielens.*?', '.*?cikmcup.*?', 'graph_multi_domain.*?', 'cora',
'citeseer', 'pubmed', 'dblp_conf', 'dblp_org', 'csbm.*?', 'fb15k-237',
'wn18'
'wn18', 'adult_fl'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for '_fl'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this dataset will also be used for the xgb model, I wonder if it will cause contradiction after merging the xgb pr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we can keep one adult dataset but use different preprocessors/splitters according to the methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I'll add some choices in the .yaml file to decide whether to preprocess the data or not.

logger = logging.getLogger(__name__)


class Adult:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a docstring here, thx!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the source of Adult dataset (e.g., link, reference) here.

@@ -101,7 +101,7 @@ def callback_funcs_for_encryped_gradient_v(self, message: Message):
en_u, en_v_B = message.content
input_x = self.sample_data(index=self.batch_index)
en_v_A = en_u * input_x
en_v = np.concatenate([en_v_A, en_v_B], axis=-1)
en_v = np.concatenate([en_v_B, en_v_A], axis=-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xieyxclack Please help to check this change, thx!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client A used to hold the label, for convenience, I let client B hold the label, so here two values should be switched.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little tricky for me... maybe we should add a cfg item to specify which client own labels? (add a TODO later)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since for xgb model, we are able to set the number of clients, and partition the data, for convenience, in the data partition program, I always let the last client hold the label. Maybe, later for the final version, we can add a cfg item as you suggest?(I will add a TODO)

for file in self.raw_file:
if self._check_existence(file):
logger.info(file + " files already exist")
# print("Files already downloaded and verified")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

test_set = combined_set[train_set.shape[0]:]
return train_set, test_set

# standardization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment might be wrong. I guess it should be #normalization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll fix it. By the way, as I said above, this dataset may also be used for xgb model and do not need to do normalization or standardization there, I wonder if it is suitable to process the data here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use an arg to control the process of data.

if cfg.data.args['is_norm']=True:
    self.norm()
    ...
else:
    ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

_range = np.max(data) - np.min(data)
return (data - np.min(data)) / _range

# normalization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look above.


self._get_data()

base_folder = 'adult'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the class attribute behind __init__.

Copy link
Collaborator

@xieyxclack xieyxclack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please refer to the inline comments, thx!

@@ -28,7 +28,7 @@
'subreddit', 'synthetic', 'ciao', 'epinions', '.*?vertical_fl_data.*?',
'.*?movielens.*?', '.*?cikmcup.*?', 'graph_multi_domain.*?', 'cora',
'citeseer', 'pubmed', 'dblp_conf', 'dblp_org', 'csbm.*?', 'fb15k-237',
'wn18'
'wn18', 'adult_fl'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we can keep one adult dataset but use different preprocessors/splitters according to the methods.

return data, config
else:
raise ValueError('You must provide the data file')
if generate:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use if ... elif....else here

logger = logging.getLogger(__name__)


class Adult:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the source of Adult dataset (e.g., link, reference) here.

@@ -0,0 +1,39 @@
use_gpu: False

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the blank line to make it consistent with other yaml files.

@@ -101,7 +101,7 @@ def callback_funcs_for_encryped_gradient_v(self, message: Message):
en_u, en_v_B = message.content
input_x = self.sample_data(index=self.batch_index)
en_v_A = en_u * input_x
en_v = np.concatenate([en_v_A, en_v_B], axis=-1)
en_v = np.concatenate([en_v_B, en_v_A], axis=-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little tricky for me... maybe we should add a cfg item to specify which client own labels? (add a TODO later)

Copy link
Collaborator

@xieyxclack xieyxclack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rayrayraykk rayrayraykk merged commit 16b5c25 into alibaba:master Nov 10, 2022
@qbc2016 qbc2016 deleted the dev_vertical_data branch November 10, 2022 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants