Add pooling mode to device bench #1194

zhuzilin · 2022-07-08T12:46:58Z

This would help us benchmark EmbeddingCollection in torchrec.

Thank you for your time in reviewing this PR :)

netlify · 2022-07-08T12:47:10Z

✅ Deploy Preview for eclectic-stroopwafel-199537 ready!

Name	Link
🔨 Latest commit	`741d944`
🔍 Latest deploy log	https://app.netlify.com/sites/eclectic-stroopwafel-199537/deploys/62d0e27ea4dd9e000aa5601a
😎 Deploy Preview	https://deploy-preview-1194--eclectic-stroopwafel-199537.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

Summary: This is an attempt to optimize the nobag forward kernel for tables whose embedding dim is smaller or equal to 32. I was exploring this as some of our production models have embedding_dim = 32. The optimization results in 10%~30% enhancement for small embedding_dim and could be applied to other kernels. However, it's worth noticing that a 10% enhancement on 1 kernel can barely have any effect on the overall training speed. Therefore, I'm totally fine with whether this optimization gets accepted, just trying to share some ideas we had to prevent others' repetitive work :) The main rationale is that the current implementation will use all 32 threads in a warp to load 1 embedding vector, which means when the embedding dim is smaller than 128, some threads in the warp do nothing but wait. This PR will split threads into groups, e.g. for embedding_dim=32, each group has 8 threads, and let the threads process each embedding vector in group instead of in warp. The performance enhancement of this trick is benchmarked with #1194: ``` python bench/split_table_batched_embeddings_benchmark.py device --pooling=none --iters=1000 --embedding-dim=$EMB_DIM ``` And figures are: |embedding dim|4|8|16|32| |---------------|-|-|-|-| |before(us)|136|138|142|154| |after(us)|96|100|113|141| Thank you for your time on this PR and it will be great if you could share your thoughts on this type of optimization :) Pull Request resolved: #1197 Reviewed By: jasonjk-park Differential Revision: D37739239 Pulled By: jianyuh fbshipit-source-id: 23b35d74eed28f977793cb52e311ba0f824ac634

facebook-github-bot · 2022-07-14T06:42:34Z

@colin2328 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jianyuh · 2022-07-14T07:58:58Z

Thanks! Just curious in your use case, do you have both unpooled embedding (none) and pooled embedding (sum), or only unpooled embedding (none)?

jianyuh · 2022-07-14T08:08:58Z

fbgemm_gpu/bench/split_table_batched_embeddings_benchmark.py

@@ -191,6 +205,7 @@ def device(  # noqa C901
            weights_precision=weights_precision,
            stochastic_rounding=stoc,
            output_dtype=output_dtype,
+            pooling_mode=pooling_mode,


Could you update the BW calculation formula in Line 254 and Line 279 ? Basically referring to

FBGEMM/fbgemm_gpu/bench/split_table_batched_embeddings_benchmark.py

Lines 983 to 991 in b3e129c

if do_pooling:

read_write_bytes = (

output_size_multiplier * B * T * D + param_size_multiplier * B * T * L * D

)

else:

read_write_bytes = (

output_size_multiplier * B * T * L * D

+ param_size_multiplier * B * T * L * D

)

. The number of write bytes for unpooled embedding are increased by a factor of L.

facebook-github-bot · 2022-07-18T17:30:39Z

@colin2328 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Add pooling mode to device bench

b3e129c

facebook-github-bot added the cla signed label Jul 8, 2022

zhuzilin mentioned this pull request Jul 9, 2022

Add nobag kernel for embedding_dim <= 32 #1197

Closed

jianyuh self-requested a review July 14, 2022 08:00

jianyuh reviewed Jul 14, 2022

View reviewed changes

Fix the read_write_bytes in device bench

741d944

zhuzilin requested a review from jianyuh July 15, 2022 03:44

jianyuh approved these changes Jul 15, 2022

View reviewed changes

facebook-github-bot closed this in c9bbb77 Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pooling mode to device bench #1194

Add pooling mode to device bench #1194

zhuzilin commented Jul 8, 2022

netlify bot commented Jul 8, 2022 •

edited

Loading

facebook-github-bot commented Jul 14, 2022

jianyuh commented Jul 14, 2022

jianyuh Jul 14, 2022 •

edited

Loading

zhuzilin Jul 15, 2022

facebook-github-bot commented Jul 18, 2022

	if do_pooling:
	read_write_bytes = (
	output_size_multiplier * B * T * D + param_size_multiplier * B * T * L * D
	)
	else:
	read_write_bytes = (
	output_size_multiplier * B * T * L * D
	+ param_size_multiplier * B * T * L * D
	)

Add pooling mode to device bench #1194

Add pooling mode to device bench #1194

Conversation

zhuzilin commented Jul 8, 2022

netlify bot commented Jul 8, 2022 • edited Loading

✅ Deploy Preview for eclectic-stroopwafel-199537 ready!

facebook-github-bot commented Jul 14, 2022

jianyuh commented Jul 14, 2022

jianyuh Jul 14, 2022 • edited Loading

Choose a reason for hiding this comment

zhuzilin Jul 15, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Jul 18, 2022

netlify bot commented Jul 8, 2022 •

edited

Loading

jianyuh Jul 14, 2022 •

edited

Loading