-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pooling mode to device bench #1194
Conversation
✅ Deploy Preview for eclectic-stroopwafel-199537 ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Summary: This is an attempt to optimize the nobag forward kernel for tables whose embedding dim is smaller or equal to 32. I was exploring this as some of our production models have embedding_dim = 32. The optimization results in 10%~30% enhancement for small embedding_dim and could be applied to other kernels. However, it's worth noticing that a 10% enhancement on 1 kernel can barely have any effect on the overall training speed. Therefore, I'm totally fine with whether this optimization gets accepted, just trying to share some ideas we had to prevent others' repetitive work :) The main rationale is that the current implementation will use all 32 threads in a warp to load 1 embedding vector, which means when the embedding dim is smaller than 128, some threads in the warp do nothing but wait. This PR will split threads into groups, e.g. for embedding_dim=32, each group has 8 threads, and let the threads process each embedding vector in group instead of in warp. The performance enhancement of this trick is benchmarked with #1194: ``` python bench/split_table_batched_embeddings_benchmark.py device --pooling=none --iters=1000 --embedding-dim=$EMB_DIM ``` And figures are: |embedding dim|4|8|16|32| |---------------|-|-|-|-| |before(us)|136|138|142|154| |after(us)|96|100|113|141| Thank you for your time on this PR and it will be great if you could share your thoughts on this type of optimization :) Pull Request resolved: #1197 Reviewed By: jasonjk-park Differential Revision: D37739239 Pulled By: jianyuh fbshipit-source-id: 23b35d74eed28f977793cb52e311ba0f824ac634
@colin2328 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Thanks! Just curious in your use case, do you have both unpooled embedding (none) and pooled embedding (sum), or only unpooled embedding (none)? |
@@ -191,6 +205,7 @@ def device( # noqa C901 | |||
weights_precision=weights_precision, | |||
stochastic_rounding=stoc, | |||
output_dtype=output_dtype, | |||
pooling_mode=pooling_mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the BW calculation formula in Line 254 and Line 279 ? Basically referring to
FBGEMM/fbgemm_gpu/bench/split_table_batched_embeddings_benchmark.py
Lines 983 to 991 in b3e129c
if do_pooling: | |
read_write_bytes = ( | |
output_size_multiplier * B * T * D + param_size_multiplier * B * T * L * D | |
) | |
else: | |
read_write_bytes = ( | |
output_size_multiplier * B * T * L * D | |
+ param_size_multiplier * B * T * L * D | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@colin2328 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This would help us benchmark EmbeddingCollection in torchrec.
Thank you for your time in reviewing this PR :)