Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pooling mode to device bench #1194

Closed
wants to merge 2 commits into from

Conversation

zhuzilin
Copy link
Contributor

@zhuzilin zhuzilin commented Jul 8, 2022

This would help us benchmark EmbeddingCollection in torchrec.

Thank you for your time in reviewing this PR :)

@netlify
Copy link

netlify bot commented Jul 8, 2022

Deploy Preview for eclectic-stroopwafel-199537 ready!

Name Link
🔨 Latest commit 741d944
🔍 Latest deploy log https://app.netlify.com/sites/eclectic-stroopwafel-199537/deploys/62d0e27ea4dd9e000aa5601a
😎 Deploy Preview https://deploy-preview-1194--eclectic-stroopwafel-199537.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

facebook-github-bot pushed a commit that referenced this pull request Jul 13, 2022
Summary:
This is an attempt to optimize the nobag forward kernel for tables whose embedding dim is smaller or equal to 32. I was exploring this as some of our production models have embedding_dim = 32. The optimization results in 10%~30% enhancement for small embedding_dim and could be applied to other kernels. However, it's worth noticing that a 10% enhancement on 1 kernel can barely have any effect on the overall training speed. Therefore, I'm totally fine with whether this optimization gets accepted, just trying to share some ideas we had to prevent others' repetitive work :)

The main rationale is that the current implementation will use all 32 threads in a warp to load 1 embedding vector, which means when the embedding dim is smaller than 128, some threads in the warp do nothing but wait. This PR will split threads into groups, e.g. for embedding_dim=32,  each group has 8 threads, and let the threads process each embedding vector in group instead of in warp.

The performance enhancement of this trick is benchmarked with #1194:
```
python bench/split_table_batched_embeddings_benchmark.py device --pooling=none --iters=1000 --embedding-dim=$EMB_DIM
```
And figures are:
|embedding dim|4|8|16|32|
|---------------|-|-|-|-|
|before(us)|136|138|142|154|
|after(us)|96|100|113|141|

Thank you for your time on this PR and it will be great if you could share your thoughts on this type of optimization :)

Pull Request resolved: #1197

Reviewed By: jasonjk-park

Differential Revision: D37739239

Pulled By: jianyuh

fbshipit-source-id: 23b35d74eed28f977793cb52e311ba0f824ac634
@facebook-github-bot
Copy link
Contributor

@colin2328 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jianyuh
Copy link
Member

jianyuh commented Jul 14, 2022

Thanks! Just curious in your use case, do you have both unpooled embedding (none) and pooled embedding (sum), or only unpooled embedding (none)?

@jianyuh jianyuh self-requested a review July 14, 2022 08:00
@@ -191,6 +205,7 @@ def device( # noqa C901
weights_precision=weights_precision,
stochastic_rounding=stoc,
output_dtype=output_dtype,
pooling_mode=pooling_mode,
Copy link
Member

@jianyuh jianyuh Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the BW calculation formula in Line 254 and Line 279 ? Basically referring to

if do_pooling:
read_write_bytes = (
output_size_multiplier * B * T * D + param_size_multiplier * B * T * L * D
)
else:
read_write_bytes = (
output_size_multiplier * B * T * L * D
+ param_size_multiplier * B * T * L * D
)
. The number of write bytes for unpooled embedding are increased by a factor of L.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@zhuzilin zhuzilin requested a review from jianyuh July 15, 2022 03:44
@facebook-github-bot
Copy link
Contributor

@colin2328 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants