-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
相同参数情况下 分布式和单机训练模型精度出现差异 #273
Comments
目前我尝试使用单PS和多Worker进行分布式训练,同时在数据载入时使用了全量的数据载入, train_table = os.path.join(data_folder, 'train_table')
test_table = os.path.join(data_folder, 'test_table')
valid_table = os.path.join(data_folder, 'valid_table')
node_table = os.path.join(data_folder, node_table)
edge_table = os.path.join(data_folder, edge_table)
g = gl.Graph() \
.node(node_table, node_type=node_type, decoder=gl.Decoder(labeled=True, attr_types=['float'] * args.input_dim, attr_delimiter=":")) \
.edge(edge_table, edge_type=(node_type, node_type, edge_type), decoder=gl.Decoder(weighted=True), directed=False) \
.node(train_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TRAIN) \
.node(valid_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.VAL) \
.node(test_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TEST) 结果基本接近单机训练模式。 train_table = os.path.join(data_folder, 'train_table')
test_table = os.path.join(data_folder, 'test_table')
valid_table = os.path.join(data_folder, 'valid_table')
node_table = os.path.join(data_folder, node_table + str(task_index))
edge_table = os.path.join(data_folder, edge_table + str(task_index))
g = gl.Graph() \
.node(node_table, node_type=node_type, decoder=gl.Decoder(labeled=True, attr_types=['float'] * args.input_dim, attr_delimiter=":")) \
.edge(edge_table, edge_type=(node_type, node_type, edge_type), decoder=gl.Decoder(weighted=True), directed=False) \
.node(train_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TRAIN) \
.node(valid_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.VAL) \
.node(test_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TEST) 则产生了上述差异。 train_table = os.path.join(data_folder, 'train_table_' + str(task_index))
test_table = os.path.join(data_folder, 'test_table_' + str(task_index))
valid_table = os.path.join(data_folder, 'valid_table_' + str(task_index))
node_table = os.path.join(data_folder, node_table + str(task_index))
edge_table = os.path.join(data_folder, edge_table + str(task_index))
g = gl.Graph() \
.node(node_table, node_type=node_type, decoder=gl.Decoder(labeled=True, attr_types=['float'] * args.input_dim, attr_delimiter=":")) \
.edge(edge_table, edge_type=(node_type, node_type, edge_type), decoder=gl.Decoder(weighted=True), directed=False) \
.node(train_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TRAIN) \
.node(valid_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.VAL) \
.node(test_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TEST) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
使用graphlearn v1.1.0中参考代码,将train_supervised的模型训练部分替换到dist_train的worker任务中测试分布式的监督学习任务。训练数据集选择使用ogbn-arxiv并在分布式训练时将点和边均分成两个文件,分布式训练集群配置为2PS-2Worker,其余代码和模型超参保持不变。结果分布式训练的loss下降到1.6左右开始震荡(单机能下降至1左右),请问这种情况如何解决。
The text was updated successfully, but these errors were encountered: