Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[profiling]add show_straggler argument to log_summary() #3579

Merged
merged 17 commits into from
Jun 30, 2023

Conversation

delock
Copy link
Collaborator

@delock delock commented May 19, 2023

In tensor parallel inference straggler effect is one of the factor that impacts scaling efficiency. Between any two allreduce of tensor parallel, one worker may run slower than other workers, this worker become the straggler for this duration. Different duration (between two allreduces) might have different straggler.

When do profiling, Straggler effect usually show as part of communication latency. This is because faster worker needs to wait for straggler during communication, this usually cause long communicatio latency. Part of communication latency is waiting for straggler to start, the rest is the time real communication takes.

It would helpful to quantify straggler effect in a workload, this PR extends communciation logging by add an optional argument to log_summary:

log_summary(show_straggler=True)

When show_straggler is set to True, communication logging would show a new table showing communciation latency and straggler effect. Communication latency is defined as the duration between the time when all ranks started communication and the communication finished. Straggler effect is define as the time between current rank started communication and all ranks started communication.

Below is an example:

Comm. Op            Message Size        Count               Total Latency(ms)   Avg Latency(ms)     tput_avg (Gbps)     busbw_avg (Gbps)
broadcast
                    4.94 KB             1                   0.22                0.22                0.18                0.18
all_reduce
                    32.0 KB             3472                1281.64             0.34                1.71                0.86
                    1.0 MB              112                 178.83              1.51                11.17               5.58
log_summary_barrier
                    0B                  1                   1.33                1.33                0.00                0.00
_______________________________
Breakdown with straggler effect
-------------------------------
Comm. Op            Message Size        Count               Total comm lat(ms)  Total straggler(ms) Avg comm lat(ms)    Avg straggler(ms)
broadcast
                    4.94 KB             1                   0.22                0.00                0.22                0.00
all_reduce
                    4.0 B               1                   0.10                0.00                0.10                0.00
                    32.0 KB             3472                966.38              315.26              0.24                0.08
                    1.0 MB              112                 178.52              0.32                1.51                0.00
log_summary_barrier
                    0B                  1                   0.12                1.21                0.12                1.21

The column 'Total straggler(ms)' show the straggler time. In this case, for 32KB allreduce, real communciation latency is 966.38ms and straggler effect is 315.26ms. Which shows significant straggler effect in the workload.

In this PR, straggler effect is defined as:

sum(latency)-sum(allreduce(latency, OP=MIN))

@delock delock requested a review from mrwyattii as a code owner May 21, 2023 09:02
@delock delock changed the title add show_straggler argument to log_summary() [profiling]add show_straggler argument to log_summary() May 31, 2023
@tjruwase tjruwase merged commit 5d1124f into deepspeedai:master Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants