Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ifu 2022 08 10 #15

Merged
merged 16 commits into from
Aug 11, 2022
Merged

Ifu 2022 08 10 #15

merged 16 commits into from
Aug 11, 2022

Conversation

liligwu
Copy link
Collaborator

@liligwu liligwu commented Aug 10, 2022

No conflict, all tests passed
20220728_IFU_tests.log.txt
.

jianyuh and others added 16 commits July 28, 2022 20:00
Summary:
Pull Request resolved: pytorch#1226

As title

Reviewed By: brad-mengchi

Differential Revision: D38248038

fbshipit-source-id: 21e40b5f8d38126b6004a20ddb94637c62177ef0
…nference kernel (pytorch#1224)

Summary:
Pull Request resolved: pytorch#1224

While this is technically a reference implementation, I'm sure it'll show up somewhere and this at least gives a ~2-3x improvement in perf for parameters I looked at.

Reviewed By: jianyuh, yuguo68

Differential Revision: D38186856

fbshipit-source-id: 4c734a06fc7c40d54a3aac5a7a6e422cb8e7e194
Summary:
Pull Request resolved: pytorch#1225

Original commit changeset: ce8b7f9658c9

Original Phabricator Diff: D37988742 (pytorch@308dd51)

As commented in D37988742 (pytorch@308dd51), for alignment=16 + stride=16 case, it might be still efficient to do contiguous() instead of copy.

Long term, we still need to add alignment access support in Vec4T accessor.

Reviewed By: xing-liu, sryap

Differential Revision: D38218502

fbshipit-source-id: 703804846beb04eadbd1af528e990fbce89e30e4
Summary:
Pull Request resolved: pytorch#1229

* support long type for dense_to_jagged cpu op, as mirrored in GPU.

Reviewed By: xing-liu

Differential Revision: D38290666

fbshipit-source-id: c81e071f5fd38126812487305f63900b6dc7d08f
Summary:
Pull Request resolved: pytorch#1233

Extract asmjit/asmjit.h usages into a separate header file and make sure it is not called from PyTorch directly.

This Diff/PR is to unblock the FBGEMM third party upgrade in PyTorch (e.g.,  pytorch/pytorch#82396), where it consistantly reports the <asmjit/asmjit.h> imports error (pytorch#1227):

```
[ 40%] Building C object confu-deps/XNNPACK/CMakeFiles/all_microkernels.dir/src/qs8-gemm/gen/4x4-minmax-fp32-scalar-imagic.c.o
In file included from /var/lib/jenkins/workspace/caffe2/quantization/server/group_norm_dnnlowp_op_avx2.cc:10:
In file included from /var/lib/jenkins/workspace/third_party/fbgemm/include/fbgemm/QuantUtils.h:6:
/var/lib/jenkins/workspace/third_party/fbgemm/include/fbgemm/./Utils.h:18:10: fatal error: 'asmjit/asmjit.h' file not found
#include <asmjit/asmjit.h>
         ^~~~~~~~~~~~~~~~~
1 error generated.
make[2]: *** [caffe2/quantization/server/CMakeFiles/caffe2_dnnlowp_avx2_ops.dir/build.make:104: caffe2/quantization/server/CMakeFiles/caffe2_dnnlowp_avx2_ops.dir/group_norm_dnnlowp_op_avx2.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
[ 40%] Building C object confu-deps/XNNPACK/CMakeFiles/XNNPACK.dir/src/qc8-igemm/gen/4x16c8-minmax-fp32-avx512skx.c.o

```

This is because we don't have the direct asmjit dependency in PyTorch. The workaround is to spin off Utils.h to SimdUtils.h where the original Utils.h included in PyTorch files doesn't have asmjit.h dependency any longer.

Reviewed By: jasonjk-park, jiyuanzFB

Differential Revision: D38341763

fbshipit-source-id: 77bf92206d350b220dfe13d7a5488713b33e7b47
Summary:
Pull Request resolved: pytorch#1234

This patch adds multiple CUDA versions for OSS CI (build only). Previously on the OSS side, we tested only 11.3, but this patch expands it to 11.3, 11.5, and 11.6.

This CI is useful. For example, it revealed an issue in `cub_namespace_postfix.cuh`: this change was cherry-picked  for CUB 1.13.1 (https://github.com/NVIDIA/cub/blob/1.13.X/cub/util_namespace.cuh), so the `CUB_VERSION` condition should be `CUB_VERSION >= 101301`.

Q&A
 - Why don't we test multiple ROCm versions?
    - PyTorch provides two versions for ROCm at this point: 5.0 and 5.1.1, but FBGEMM does not support ROCm 5.0.
 - Why don't we test CUDA 11.2 or older?
   - I am not sure if FBGEMM supports CUDA <= 11.2. If needed, we can add it.
 - Why don't we test CUDA 11.4?
   - PyTorch does not have nightly CUDA 11.4 build for some reasons at present.
 - Why don't we test CUDA 11.7?
   - It has some compilation issues.

Reviewed By: jianyuh, mjanderson09

Differential Revision: D38359867

fbshipit-source-id: 2a1b4735b0c31e87721ecb77cda1a8d907969afc
Summary:
The old error message may be confusing when the error is not that `fbgemm_gpu_py.so` cannot be found, e.g. CPU version of pytorch is installed and `libc10_cuda.so` cannot be found. User could still be able to find out that `fbgemm_gpu_py.so` is not there, as the error message now is:
```
/path/to/fbgemm_gpu/fbgemm_gpu_py.so: cannot open shared object file: No such file or directory
```

Thank you for your time on this PR:)

Pull Request resolved: pytorch#1235

Reviewed By: mjanderson09

Differential Revision: D38389494

Pulled By: brad-mengchi

fbshipit-source-id: 55738a45a4bb5d9d385e8ff082b1deb9c2f5e7dc
Summary:
Pull Request resolved: pytorch#1237

- pytorch has not updated their CUDA 11.5 .whl files with latest pytorch (1.13), while they have updated their CUDA 11.3 .whl files
- pytorch built with CUDA 11.3 will work with later CUDA versions, as CUDA is backward compatable
- Suggest we set a policy where we support building/installing FBGEMM with pytorch .whl which is built with CUDA 11.3 as this seems to be their preferred build. This guidance in our README and also torchrec README.
- This diff updates CI to install pytorch .whl which was built with CUDA 11.3, even when testing FBGEMM with later CUDA versions

Reviewed By: shintaro-iwasaki

Differential Revision: D38416122

fbshipit-source-id: 5dbba43e7b692a87205a2adc9b7b525dffd91569
Summary:
Pull Request resolved: pytorch#1232

Sequence Embedding on CPU
- Enabled FP32/FP16/INT8 versions
- Scalar embedding implementation ( performance may be slow)

Reviewed By: jianyuh

Differential Revision: D36049936

fbshipit-source-id: c15c0d9b9fa214939b262cf9c7ec543c4848b4c5
Summary: Pull Request resolved: pytorch#1242

Reviewed By: ajtulloch

Differential Revision: D38560260

fbshipit-source-id: 321da1a8894dc336684f9cc945fab658ec776a6c
Summary: Pull Request resolved: pytorch#1243

Reviewed By: jiyuanzFB

Differential Revision: D38561337

fbshipit-source-id: 2d24278014d39318f5e6f76f17cf57760e1866c4
@amathews-amd
Copy link

LGTM

@liligwu liligwu merged commit 92c5f2e into main Aug 11, 2022
liligwu added a commit that referenced this pull request Feb 8, 2023
* Enable merge_pooled_embeddings op. in ROCm

* Enabling the merge pool ops.

Co-authored-by: liligwu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants