Add nobag kernel for embedding_dim <= 32 #1197

zhuzilin · 2022-07-09T13:49:50Z

This is an attempt to optimize the nobag forward kernel for tables whose embedding dim is smaller or equal to 32. I was exploring this as some of our production models have embedding_dim = 32. The optimization results in 10%~30% enhancement for small embedding_dim and could be applied to other kernels. However, it's worth noticing that a 10% enhancement on 1 kernel can barely have any effect on the overall training speed. Therefore, I'm totally fine with whether this optimization gets accepted, just trying to share some ideas we had to prevent others' repetitive work :)

The main rationale is that the current implementation will use all 32 threads in a warp to load 1 embedding vector, which means when the embedding dim is smaller than 128, some threads in the warp do nothing but wait. This PR will split threads into groups, e.g. for embedding_dim=32, each group has 8 threads, and let the threads process each embedding vector in group instead of in warp.

The performance enhancement of this trick is benchmarked with #1194:

python bench/split_table_batched_embeddings_benchmark.py device --pooling=none --iters=1000 --embedding-dim=$EMB_DIM

And figures are:

embedding dim	4	8	16	32
before(us)	136	138	142	154
after(us)	96	100	113	141

Thank you for your time on this PR and it will be great if you could share your thoughts on this type of optimization :)

netlify · 2022-07-09T13:49:55Z

✅ Deploy Preview for eclectic-stroopwafel-199537 canceled.

Name	Link
🔨 Latest commit	`dfac00a`
🔍 Latest deploy log	https://app.netlify.com/sites/eclectic-stroopwafel-199537/deploys/62c9877f666134000826256c

facebook-github-bot · 2022-07-09T20:47:47Z

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jianyuh · 2022-07-09T21:12:08Z

Thanks for the great optimization! 10%~30% improvement for embedding dim = 32 case in the op level is impressive. Wonder if you can also help evaluate the increase of the binary size with this PR? Basically checking the generated fbgemm_gpu_*.so file binary size increase.

zhuzilin · 2022-07-10T02:02:25Z

@jianyuh The binary size of fbgemm_gpu_py.so is 275873080 bytes before and 276987184 after, increasing by around 1M.

Add nobag kernel for embedding_dim <= 32

dfac00a

facebook-github-bot added the cla signed label Jul 9, 2022

facebook-github-bot closed this in fbd89e8 Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nobag kernel for embedding_dim <= 32 #1197

Add nobag kernel for embedding_dim <= 32 #1197

zhuzilin commented Jul 9, 2022

netlify bot commented Jul 9, 2022 •

edited

Loading

facebook-github-bot commented Jul 9, 2022

jianyuh commented Jul 9, 2022

zhuzilin commented Jul 10, 2022

Add nobag kernel for embedding_dim <= 32 #1197

Add nobag kernel for embedding_dim <= 32 #1197

Conversation

zhuzilin commented Jul 9, 2022

netlify bot commented Jul 9, 2022 • edited Loading

✅ Deploy Preview for eclectic-stroopwafel-199537 canceled.

facebook-github-bot commented Jul 9, 2022

jianyuh commented Jul 9, 2022

zhuzilin commented Jul 10, 2022

netlify bot commented Jul 9, 2022 •

edited

Loading