Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the state of message for evaluation #470

Merged
merged 1 commit into from
Dec 9, 2022

Conversation

xieyxclack
Copy link
Collaborator

@xieyxclack xieyxclack commented Dec 9, 2022

Fix #462 , and I provide an example below to show the modified output order.

image

Preliminary

  • The provided example is part of the output logs when running fedavg_convnet2_on_femnist.yaml (simulation mode);
  • I specify eval.freq=10. Thus, the server would perform evaluation after 9th training round (0-9);
  • With simulation mode, the messages are handled one by one without interruption;

Observations

From the figure we can observe that:

  • In the 1st part, the logs indicate that clients perform local training and print the results. Note that these logs are generated from the perspective of clients and happen at the end of local training processes.
  • In the 2nd part, the logs indicate that the server starts the evaluation at the end of 9th round. In the implementation, the server broadcasts evaluate messages to all the clients. However, these evaluate messages would not be handled at this moment, since the handling operations of the server here have not been over yet and cannot be interrupted. These logs are generated from the perspective of the server.
  • In the 3rd part, immediately after broadcasting the evaluate messages, the server broadcasts the training request messages for starting a new training round (i.e., the 10th round). After that, the server finishes the handling operations, and some of the clients have received two messages from the server, i.e., evaluate (at the end of the 9th round) and training request for the 10th round.
  • Each client handles the evaluate and/or training request messages one by one. When handling the evaluate message, clients would not print any results locally, and the evaluation metrics would be sent to the server. When handling the training request, the client would print the training results, as shown in the 4th part of the provided example, and the updated models would be sent to the server after training. Thus, in the 4th part, although we can only observe the logs of training results, the clients also handle the evaluate message here (and return the metrics to the server). Note that these logs are generated from the perspective of clients.
  • In the 5th part, after receiving the evaluation metrics (for the 9th round), the server prints the evaluation results.
  • In the 6th part, after receiving the updated models (for the 10th round), the server performs federated aggregation and starts a new training round (i.e., the 11th round)

Summary

In summary, although the logs show that the evaluation results (from the server) of 9th round are printed after the training results (from clients) of 10th round, the order of handling messages is precise and the same as our expectation:

Clients locally train at the 9th round (part 1)
-> Server starts evaluation at the end of the 9th round (part 2)
-> Server starts training for the 10th round (part 3)
-> Clients perform evaluation of the 9th round, and clients perform local training for the 10th round (part 4)
-> Server merges and prints the evaluation results of the 9th round (part 5)
-> Server starts training for the 11th round (part 6)

@xieyxclack xieyxclack requested a review from joneswong December 9, 2022 04:23
@xieyxclack
Copy link
Collaborator Author

Please @joneswong check whether the above explanations are clear enough to resolve the confusion

Copy link
Collaborator

@joneswong joneswong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved.

@joneswong joneswong added the enhancement New feature or request label Dec 9, 2022
@joneswong joneswong merged commit caa0611 into alibaba:master Dec 9, 2022
@xieyxclack xieyxclack deleted the fix_output_order branch April 3, 2023 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inappropriate output order
2 participants