fix the state of message for evaluation #470

xieyxclack · 2022-12-09T04:23:07Z

Fix #462 , and I provide an example below to show the modified output order.

Preliminary

The provided example is part of the output logs when running fedavg_convnet2_on_femnist.yaml (simulation mode);
I specify eval.freq=10. Thus, the server would perform evaluation after 9th training round (0-9);
With simulation mode, the messages are handled one by one without interruption;

Observations

From the figure we can observe that:

In the 1st part, the logs indicate that clients perform local training and print the results. Note that these logs are generated from the perspective of clients and happen at the end of local training processes.
In the 2nd part, the logs indicate that the server starts the evaluation at the end of 9th round. In the implementation, the server broadcasts evaluate messages to all the clients. However, these evaluate messages would not be handled at this moment, since the handling operations of the server here have not been over yet and cannot be interrupted. These logs are generated from the perspective of the server.
In the 3rd part, immediately after broadcasting the evaluate messages, the server broadcasts the training request messages for starting a new training round (i.e., the 10th round). After that, the server finishes the handling operations, and some of the clients have received two messages from the server, i.e., evaluate (at the end of the 9th round) and training request for the 10th round.
Each client handles the evaluate and/or training request messages one by one. When handling the evaluate message, clients would not print any results locally, and the evaluation metrics would be sent to the server. When handling the training request, the client would print the training results, as shown in the 4th part of the provided example, and the updated models would be sent to the server after training. Thus, in the 4th part, although we can only observe the logs of training results, the clients also handle the evaluate message here (and return the metrics to the server). Note that these logs are generated from the perspective of clients.
In the 5th part, after receiving the evaluation metrics (for the 9th round), the server prints the evaluation results.
In the 6th part, after receiving the updated models (for the 10th round), the server performs federated aggregation and starts a new training round (i.e., the 11th round)

Summary

In summary, although the logs show that the evaluation results (from the server) of 9th round are printed after the training results (from clients) of 10th round, the order of handling messages is precise and the same as our expectation:

Clients locally train at the 9th round (part 1)
-> Server starts evaluation at the end of the 9th round (part 2)
-> Server starts training for the 10th round (part 3)
-> Clients perform evaluation of the 9th round, and clients perform local training for the 10th round (part 4)
-> Server merges and prints the evaluation results of the 9th round (part 5)
-> Server starts training for the 11th round (part 6)

xieyxclack · 2022-12-09T06:50:44Z

Please @joneswong check whether the above explanations are clear enough to resolve the confusion

joneswong

approved.

fix the state of message for evaluation

3b02e3e

xieyxclack requested a review from joneswong December 9, 2022 04:23

joneswong approved these changes Dec 9, 2022

View reviewed changes

joneswong added the enhancement New feature or request label Dec 9, 2022

joneswong merged commit caa0611 into alibaba:master Dec 9, 2022

xieyxclack deleted the fix_output_order branch April 3, 2023 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix the state of message for evaluation #470

fix the state of message for evaluation #470

xieyxclack commented Dec 9, 2022 •

edited

Loading

xieyxclack commented Dec 9, 2022

joneswong left a comment

fix the state of message for evaluation #470

fix the state of message for evaluation #470

Conversation

xieyxclack commented Dec 9, 2022 • edited Loading

Preliminary

Observations

Summary

xieyxclack commented Dec 9, 2022

joneswong left a comment

Choose a reason for hiding this comment

xieyxclack commented Dec 9, 2022 •

edited

Loading