Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nobag kernel for embedding_dim <= 32 #1197

Closed
wants to merge 1 commit into from

Conversation

zhuzilin
Copy link
Contributor

@zhuzilin zhuzilin commented Jul 9, 2022

This is an attempt to optimize the nobag forward kernel for tables whose embedding dim is smaller or equal to 32. I was exploring this as some of our production models have embedding_dim = 32. The optimization results in 10%~30% enhancement for small embedding_dim and could be applied to other kernels. However, it's worth noticing that a 10% enhancement on 1 kernel can barely have any effect on the overall training speed. Therefore, I'm totally fine with whether this optimization gets accepted, just trying to share some ideas we had to prevent others' repetitive work :)

The main rationale is that the current implementation will use all 32 threads in a warp to load 1 embedding vector, which means when the embedding dim is smaller than 128, some threads in the warp do nothing but wait. This PR will split threads into groups, e.g. for embedding_dim=32, each group has 8 threads, and let the threads process each embedding vector in group instead of in warp.

The performance enhancement of this trick is benchmarked with #1194:

python bench/split_table_batched_embeddings_benchmark.py device --pooling=none --iters=1000 --embedding-dim=$EMB_DIM

And figures are:

embedding dim 4 8 16 32
before(us) 136 138 142 154
after(us) 96 100 113 141

Thank you for your time on this PR and it will be great if you could share your thoughts on this type of optimization :)

@netlify
Copy link

netlify bot commented Jul 9, 2022

Deploy Preview for eclectic-stroopwafel-199537 canceled.

Name Link
🔨 Latest commit dfac00a
🔍 Latest deploy log https://app.netlify.com/sites/eclectic-stroopwafel-199537/deploys/62c9877f666134000826256c

@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jianyuh
Copy link
Member

jianyuh commented Jul 9, 2022

Thanks for the great optimization! 10%~30% improvement for embedding dim = 32 case in the op level is impressive. Wonder if you can also help evaluate the increase of the binary size with this PR? Basically checking the generated fbgemm_gpu_*.so file binary size increase.

@zhuzilin
Copy link
Contributor Author

@jianyuh The binary size of fbgemm_gpu_py.so is 275873080 bytes before and 276987184 after, increasing by around 1M.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants