Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm changes #1102

Closed
wants to merge 53 commits into from
Closed

ROCm changes #1102

wants to merge 53 commits into from

Conversation

liligwu
Copy link
Contributor

@liligwu liligwu commented May 5, 2022

Enabling FBGEMM on AMD devices.

jithunnair-amd and others added 30 commits January 25, 2022 14:18
* Hipify code

* Add correctness check

* Revert "Add correctness check"

This reverts commit a7f169dcc862e5cc8102a39eb3b7882dfa888f1b.

* Fix setup.py

* Add run_all.sh

* Update Zipf index generation

Update the Zipf index generation to generate unique indices in each bag
and shuffle indices to avoid spatial locality

Code reference: https://github.com/pytorch/FBGEMM/blob/7588d9d804826b428fc0e4fd418e9cc3f7a72e52/fbgemm_gpu/bench/split_table_batched_embeddings_benchmark.py#L98-L117

* Fix ROCm version check in fbgemm_gpu's setup.py

* Fix hipification errors

Modify code to fix hipification errors.  Some ops/kernels including
merge_pooled_embeddings, quantize_ops and
embedding_forward_quantized_split ops are diabled currently.  These ops
will be enabled in the future.

* Disable AVX512 for AMD CPUs

AMD CPUs do not support AVX512.  Thus, it has to be disabled in ROCm.

* Update run_all.sh

* Fix __launch_bounds__ with kWarpSize.

* fix missing '#endif' in codegen/embedding_backward_code_generator.py

* fix the dependencies import in setup.py

* debug enum cudaMemeryAdvise

* bypass the both cudaMemoryAdvise cudaMemAdvise are mapped to hipMemAdvise, in cumem_utils.cu

* Build and import successfully but with NAN values.

* NAN values are eliminated by bypassing   res.vals[0] = hfma2(

* Remove debug lines in include/fbgemm_gpu/fbgemm_cuda_utils.cuh

Note: The tests of fbgemm-gpu do not pass. They will be addressed in future commits.

Co-authored-by: Sarunya Pumma <[email protected]>
Co-authored-by: Li Li <[email protected]>
Co-authored-by: liligwu <[email protected]>
Rocm4.3/develop. Use SHEFL_SYNC_MACRO to replace __shefl() and __shefl_sync()
* Change hipify dependency from torch.utils.torch_hipify to hipify_torch.

* add the third_party/hipify_torch to git repo
* unify function signature of jagged_xD_to_dense (pytorch#813)

Summary:
Pull Request resolved: pytorch#813

As title

Reviewed By: jiaqizhai, jianyuh

Differential Revision: D33066551

fbshipit-source-id: 8e2fd3c21f3bde67c6b20045681c2549e3583bd3

* Daily `arc lint --take CLANGFORMAT`

Reviewed By: zertosh

Differential Revision: D33183467

fbshipit-source-id: d7c37f3522a38e85891524c544eab4fdb01270de

* Assert Tensors allocated on GPU. (pytorch#819)

Summary:
Pull Request resolved: pytorch#819

Check inputs for correctness wrt to GPU allocation and device.

Reviewed By: jspark1105, jianyuh

Differential Revision: D33167469

fbshipit-source-id: 04f638d13bde93373d64cff1428ef743300400a6

* Support batched benchmark execution and fix benchmark stats reporting (pytorch#818)

Summary:
Pull Request resolved: pytorch#818

As title, support multiple execution of benchmark scripts and report aggregated metric.

Further, require `--bag-size` argument to conform to input data file for proper metric accounting.

Reviewed By: jianyuh

Differential Revision: D33182257

fbshipit-source-id: a6eeeb25646c00665b6d29df9389eddab7618d4e

* Direct Convolution JIT assembly for KH=2, KW = 6

Summary:
this diff has specialized codegen for convolution case where KH=2 and KW=6

## Performance results on local devserver with AVX2 instruction:
1, 16, 16,     {2, 126}, 1, {2, 6}, {1, 2}, {0, 0, 0, 0},     {1, 1}, {0, 0}, false
Fbgemm baseline:
3.8 GOPS
This diff:
9.2 GOPS

1, 64, 64,     {2, 257}, 1, {2, 6}, {1, 2}, {0, 0, 0, 0},     {1, 1}, {0, 0}, false
Fbgemm baseline:
43.8 GOPS
This diff:
61.2 GOPS

## How to invoke indirect convolution function:
**At offline:**
1. Weights need to be transposed to (oc/8) - (kh) - (kw) - (ic/4) - 8 - 4
2. Create the convolution function based on problem size:
```
       CodeGenBase<uint8_t, int8_t, int32_t, int32_t> codeObj;
       CodeGenBase<uint8_t, int8_t, int32_t, int32_t>::jit_micro_kernel_fp fn;
       fn = codeObj.getOrCreateDirectConv<inst_set_t::avx2>(
        true, conv_p.OUT_DIM[1], conv_p.IN_DIM[1] * conv_p.IC, conv_p.stride[1] * conv_p.IC);
```
3. Compute the *col_offsets* of weight tensor
4. Make sure you have allocated the space for: output tensor (Cint32_fb, Cint8_fb), and some temporary space for input rowsum ( InSum: IN_DIM[0] x IN_DIM[1], rowSum: OUT_DIM[0] x OUT_DIM[1])

**Online:**
Make sure we have:
conv_p ( the problem info), Aint8 (input tensor), bBuf_tr ( the transposed weight tensor), Cint32_fb ( the 32-bit results after accumulation), Cint8_fb ( the final quantized 8-bit output).

       // compute direct conv row sum
       directConvRowSum(conv_p, Aint8.data(),
            inSum, rowSum, row_offsets.data());

      // kernel for direct convolution
        for (int oc = 0; oc < conv_p.OC; oc+= 8) {
          fn(Aint8.data(),
              bBuf_tr.data() + oc * kernel_dim * conv_p.IC ,
              bBuf_tr.data(),
              Cint32_fb.data() + oc,
              conv_p.IC * conv_p.K[1],
              conv_p.OC);
        }

        requantizationParams_t<> reqObj = {
          Aint8_zero_point, // Aq_zero_point
          Bint8_zero_point.data(),
          C_zero_point,
          C_multiplier.data(),
          rowSum, // row_offsets
          //row_offsets.data(),
          col_offsets.data(), // col_offsets
          nullptr, // bias
          static_cast<std::uint32_t>(conv_p.OC), // ncols
          1, // groups
          nullptr};

        requantizeOutputProcessingAvx2<false, false, QuantizationGranularity::TENSOR,
          false, false>(Cint8_fb.data(),
              Cint32_ref.data(),
              {0, conv_p.OUT_DIM[1] * conv_p.OUT_DIM[0], 0, conv_p.OC}, conv_p.OC, conv_p.OC, reqObj);

For more details please refer to test_asmjit2.cc

Reviewed By: dskhudia

Differential Revision: D31775222

fbshipit-source-id: 294450613b0978277e75d171d6a560124c14ecda

* suppress errors in `deeplearning/fbgemm/fbgemm_gpu`

Differential Revision: D33201593

fbshipit-source-id: 251f338e03dfde1dcc4a83c4ff9df1fe27840bdb

* fix copy right header of batch_benchmark_run.py (pytorch#820)

Summary:
Pull Request resolved: pytorch#820

As title

Reviewed By: jianyuh

Differential Revision: D33213812

fbshipit-source-id: d901e87ff1047ff969c99a330aa05c8d26e1954e

* Assert Tensors allocated on GPU for generated code. (pytorch#821)

Summary:
Pull Request resolved: pytorch#821

Check inputs for correctness wrt to GPU allocation and device.

Reviewed By: jspark1105

Differential Revision: D33189944

fbshipit-source-id: 36fb5eac677466e783ef5a754c28b6d838ea09b7

* Move all fbgemm_gpu provided Python ops to fbgemm namespace from fb namespace. (pytorch#823)

Summary: Pull Request resolved: pytorch#823

Reviewed By: jianyuh

Differential Revision: D33147038

fbshipit-source-id: fdcb667dfb920b4f04b7d0b08082afabe7213cc1

* Implement generic HBC by feature. (pytorch#822)

Summary:
Pull Request resolved: pytorch#822

Implement a generic version of HBC by feature, which takes in bin_boundaries.

Reviewed By: jianyuh

Differential Revision: D33232676

fbshipit-source-id: 99c77f6d081fdc89699948a6c9482b8806f598a3

* Benchmark for newly added generic HBC by feature. (pytorch#826)

Summary:
Pull Request resolved: pytorch#826

More benchmarking for new op, and also add "double" for benchmarking type.

Reviewed By: jianyuh

Differential Revision: D33241845

fbshipit-source-id: 38f08f5453fd8d112ff55c046a6ac091c23bc3de

* Allways set dontfork on managed Tensor + new uvm clone (pytorch#824)

Summary:
Pull Request resolved: pytorch#824

Workaround for S256045.
UVM Tensors are unmapped from the process page table on fork (spawn).
The UVM fault handler then slows down the UVM CPU<->CPU copy substantially reestablishing those mappings.
The workaround sets MADV_DONTFORK on the addresses (rounded down to page size) of UVM allocations - this prevents the removal from UVM pages from the original process page table.
Additionally this introduces a single threaded UVM->CPU tensor copy to
1) Avoid 8 trainers on a host to concurrently all threads with copy_
2) Avoid high concurency in the fault handler of the uvm kernel driver.

Reviewed By: jianyuh

Differential Revision: D33192043

fbshipit-source-id: 094f3dcd302d455efbf4e912d58ed28756cb653f

* Use kWarpSize for warp size (pytorch#827)

Summary: Pull Request resolved: pytorch#827

Reviewed By: rweyrauch

Differential Revision: D33271792

fbshipit-source-id: dc66b6950b37e5d92c10406a3891568a7500e26e

* Move fb.embedding_bag_rowwise_prune to fbgemm_gpu OSS. (pytorch#825)

Summary:
Pull Request resolved: pytorch#825

Move the fb.embedding_bag_rowwise_prune op from caffe2/fb/sparsenn to fbgemm_gpu.

Reviewed By: jianyuh

Differential Revision: D33240318

fbshipit-source-id: 4db93a1ecd9666881779eeada1e3e493aa7525e4

* Allow optional Tensor args to be empty or on GPU. (pytorch#828)

Summary: Pull Request resolved: pytorch#828

Reviewed By: jianyuh

Differential Revision: D33267641

fbshipit-source-id: b193ee5b7e9ea946a20672760c320f29b217b998

* Add output_dtype to training TBE op for CPU (pytorch#829)

Summary:
Pull Request resolved: pytorch#829

This Diff adds `output_dtype` to `split_embedding_codegen_lookup_{{ optimizer }}_function_cpu()`. Note that the CUDA version (`split_embedding_codegen_lookup_{{ optimizer }}_function()`) already has this argument (D32399931 (pytorch@7e1183c)).

Reviewed By: jianyuh

Differential Revision: D32969921

fbshipit-source-id: 695e54434dc4f65f9f4c60782c60a550e38d97a7

* fix copyright header of tensor_assert_test.cpp (pytorch#831)

Summary:
Pull Request resolved: pytorch#831

As title

Reviewed By: rweyrauch

Differential Revision: D33310866

fbshipit-source-id: 1cbdee1d7c00f0e900faac570bac330866887b1c

* Add permute_pooled_embedding_modules_test into RE (pytorch#830)

Summary:
Pull Request resolved: pytorch#830

As title

Reviewed By: rweyrauch

Differential Revision: D33303898

fbshipit-source-id: c94a14bc398ecb58b68ca15d7e79204233ac67d1

* Use all to one op to do DtoD between remote and merge (pytorch#817)

Summary:
Pull Request resolved: pytorch#817

Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.

See the blue DtoD copies in the graph below.
{F686842812}

At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.

This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.

For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.

Now the copies from different devices can run in parallel.
{F686843333}

Reviewed By: yinghai, jianyuh, houseroad

Differential Revision: D33065710

fbshipit-source-id: f479fa2ea20702e14419c8b87024a87d5bbb1a68

* Add MSFP option for ads hpc model numeric emulations (pytorch#832)

Summary:
Pull Request resolved: pytorch#832

Add fake conversions between MSFP and fp32 in both forward and backward pass of the hpc ads model training.

TODO: Add compute kernels that split the FC operator into gemms for column_blocks of activations and row_blocks of weights

Reviewed By: jspark1105

Differential Revision: D30942234

fbshipit-source-id: 601d671fd00622304a50651dedffd0de3ae01ae0

* Remove benchmark CMakeLists.txt (pytorch#835)

Summary:
Pull Request resolved: pytorch#835

As title. This file is no longer needed after we decide to support setup.py only OSS build approach.

Reviewed By: jspark1105, rweyrauch

Differential Revision: D33318121

fbshipit-source-id: 4f71b23f6e9e7e78d50fab20af53cdf9f63844ad

* Increase code reuse between FP32, FP16, INT8, INT4 embedding types for infer TBE (pytorch#833)

Summary:
Pull Request resolved: pytorch#833

We merge the implementation for {FP32, FP16, INT8, INT4} weights in inference TBE into one unified template and increase the code reuse between these implementations. This will pave the way for the future enhancements (no need to change all 4 implementations for one new feature).

Reviewed By: rweyrauch

Differential Revision: D33343450

fbshipit-source-id: 24e59c4a2df5ef3da353535eb879a2365293bc1f

* minimize functions defined in headers (pytorch#836)

Summary:
Pull Request resolved: pytorch#836

We had so much stuffs that didn't need to be at header files.
Split long source files.
Put experimental quantization functions to experimental namespace

Reviewed By: rweyrauch

Differential Revision: D33358916

fbshipit-source-id: cffcec344cbe565045ee2c564ce1cef529de4cf8

* add missing C10_CUDA_KERNEL_LAUNCH_CHECK (pytorch#837)

Summary:
Pull Request resolved: pytorch#837

As title

Reviewed By: rweyrauch

Differential Revision: D33359025

fbshipit-source-id: 162dd2897a5d56e7ac8ff3ba9ae5c8689961204b

* Add seq embedding kernel for infer TBE (pytorch#834)

Summary:
Pull Request resolved: pytorch#834

- Add sequence embedding support in infer TBE kernel

- TODO: "mask" solution for the duplicated embedding row access. cc jspark1105

Reviewed By: jspark1105

Differential Revision: D33341863

fbshipit-source-id: 47babe921dbaf086e2df92f4693b4718c01bcec1

* add missing new files to CMakeLists.txt (pytorch#838)

Summary:
Pull Request resolved: pytorch#838

This was missed in D33358916 (pytorch@38a6c35)

Reviewed By: colin2328

Differential Revision: D33370387

fbshipit-source-id: 72007f51afd6757690a1898098e8b6207c3c487b

* Support int32_t indices/offsets for caching handling logics (pytorch#811)

Summary:
Pull Request resolved: pytorch#811

In training, we assume the indices / offsets are int64_t for embedding (TBE), but in inference, we assume the indices / offsets are int32_t.

This Diff enables both int32_t and int64_t supports for the caching logics so that we can reuse the same functions for both training and inference, while reducing the extra overhead to convert the indices/offsets from int to long or vice versa.

Reviewed By: jspark1105

Differential Revision: D33045589

fbshipit-source-id: 4e508a1095536a629bdab8e5577db74310032b23

* Add seq embedding benchmark

Summary: 5x ~ 10x speedup in the benchmark level.

Reviewed By: jspark1105

Differential Revision: D33355933

fbshipit-source-id: 2c609ae9ec5fd4fda48dbafa13b5eb75900fdf5f

* fix warning count check in test_bounds_check (pytorch#839)

Summary:
Pull Request resolved: pytorch#839

In GPU multiple threads in a thread block can increase warning count for the same bound errors in offset array

Reviewed By: jianyuh

Differential Revision: D33379301

fbshipit-source-id: b00520cc613bb7e15c9f8cd4bdf0c61bd4dbd83b

* fix typo in CMakeLists.txt (pytorch#840)

Summary:
Pull Request resolved: pytorch#840

Fixing a silly typo

Reviewed By: jianyuh

Differential Revision: D33380967

fbshipit-source-id: 8220cc87a2564107cb124d3f9c31b8d92cb7d1a4

* Slight perf optimization for infer TBE (pytorch#843)

Summary:
Pull Request resolved: pytorch#843

~5% perf improvement for INT4 / INT8 inference TBE on A100 GPUs.

Reviewed By: jspark1105

Differential Revision: D33388153

fbshipit-source-id: 63566e3dccd9ce4775abb3374251f9046512e131

* extract embedding input transpose out of embedding_backward_split_template.cu (pytorch#841)

Summary:
Pull Request resolved: pytorch#841

Refactoring to prepare D33381126
Other minor changes
* Remove unused sorted_linear_indices_run_lengths parameter from bwd kernels

Reviewed By: jianyuh

Differential Revision: D33380032

fbshipit-source-id: b880cc3745a6f6dd63319109e753a470d6c28c49

* increase parallelism in batched unary embeddings backward (pytorch#842)

Summary:
Pull Request resolved: pytorch#842

Sort indices and have each thread handle indices with the same values (called a run in the code)

Reviewed By: jianyuh

Differential Revision: D33381126

fbshipit-source-id: aec1c0be619b9072f5a1f9273b66c03e5106ca02

* use DISPATCH_TO_CUDA macro (pytorch#845)

Summary:
Pull Request resolved: pytorch#845

We should use the macro consistently or just drop

Reviewed By: jianyuh

Differential Revision: D33392682

fbshipit-source-id: bd99286f55fe2d6e5bab231ec65dae02f16f35c2

* Follow-up comments (pytorch#844)

Summary: Pull Request resolved: pytorch#844

Reviewed By: jspark1105

Differential Revision: D33393019

fbshipit-source-id: 1df7d8457a950a829f7ff2fe6f47595afdc9cc26

* HIP extension support for FBGEMM_GPU (pytorch#846)

Summary: Pull Request resolved: pytorch#846

Reviewed By: jspark1105

Differential Revision: D33231489

fbshipit-source-id: 6bd46ddee45c767ad25c2d52b6c05030bba94082

* correct the max_shared_bytes logit evaluation logic in embedding_backward_split_template.cu

* IFU from from upstream commit c6df576 to main. fbgemm-gpu is built and imported. Tests do NOT pass.

Co-authored-by: Xing Liu <[email protected]>
Co-authored-by: CodemodService FBSourceClangFormatLinterBot <>
Co-authored-by: Rick Weyrauch <[email protected]>
Co-authored-by: Martin Schatz <[email protected]>
Co-authored-by: Jiyuan Zhang <[email protected]>
Co-authored-by: Jongsoo Park <[email protected]>
Co-authored-by: Jason Park <[email protected]>
Co-authored-by: Stephan Uphoff <[email protected]>
Co-authored-by: Jianyu Huang <[email protected]>
Co-authored-by: Shintaro Iwasaki <[email protected]>
Co-authored-by: Shiyan Deng <[email protected]>
Co-authored-by: Summer Deng <[email protected]>
* * added skipIfRocm and TEST_WITH_ROCM in split_table_batched_embeddings_test. * added __any_sync_fbgemm that replaces __any_sync. * 26 tests ran in split_table_batched_embeddings_test 10 skipped.

* *Renamed __any_sync_fbgemm to __any_sync and changed its implementation to a more generic one. *Added 'reason' message of skipIfRocm.

* *enabled use_array_for_index_remapping in test_nbit_forward_int and test_nbit_forward_fp. *enabled test_nbit_forward_pruning.

* deleted 'assert(false)' tthat are related to __any_sync function.
…ove @skipIfRocm for TestFused8BitRowwiseQuantizationConversion and TestFusedNBitRowwiseQuantizationConversion
…h_to_new_commit

Pointing hipify_torch to the newer commit.
* An attempt of matching upstream setup.py.

* Move hipify() to CMakeList.txt.

* Removing hipify from the python script.

* Matching upstream setup.py

* #Removing the unnecessary funcitons and statements in Hip.cmake. #Reforming some of the compilation option lists in CMakeList.txt.

* Updating hipify_torch (CMake API)

* #Adding automatically detection for CUDA and ROCm. #Removing the debug code in embedding_backward_code_generator.py. #Adding 'gfx90a' in FBGEMM_ROCM_ARCH. #Minor changes on message and indentation.
* Enable merge_pooled_embeddings op. in ROCm

* Enabling the merge pool ops.

Co-authored-by: liligwu <[email protected]>
======================================================================
Two tests failures:
======================================================================
ERROR: test_generic_histogram_binning_calibration_by_feature (__main__.SparseOpsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "sparse_ops_test.py", line 1500, in test_generic_histogram_binning_calibration_by_feature
    data_type=st.sampled_from([torch.half, torch.float32]),
  File "/opt/conda/lib/python3.7/site-packages/hypothesis/core.py", line 1220, in wrapped_test
    raise the_error_hypothesis_found
  File "sparse_ops_test.py", line 1543, in test_generic_histogram_binning_calibration_by_feature
    bin_ctr_weight_value=0.9995,
RuntimeError: expected scalar type Long but found Int

----------------------------------------------------------------------

FAIL: test_lxu_cache_lookup (__main__.SplitTableBatchedEmbeddingsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "split_table_batched_embeddings_test.py", line 3994, in test_lxu_cache_lookup
    dtype=torch.int,
AssertionError: False is not true

----------------------------------------------------------------------
Ran 35 tests in 759.368s

FAILED (failures=1)
…CM_ARCH. # Enabling building on Pytorch 1.11.

set(cpp_fbgemm_files ${cpp_fbgemm_files_normal} ${cpp_fbgemm_files_avx2}
${cpp_fbgemm_files_avx512})
if(USE_CUDA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is AVX512 related to CUDA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD CPU has no support for avx512, and the compiler will complain if adding the compilation flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, HIP is for AMD GPUs so it is irrelevant. For example, I believe we can use AMD GPUs with an Intel Skylake CPU, which has AVX512.

At least, it is not related to CUDA. Can I ask you to fix properly (e.g., detecting the availability of AVX-512 instructions)?

(@jianyuh Please teach me how FBGEMM should handle AVX-512.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good point. For current FBGEMM CPU build, we also need AVX512 compiler support (https://github.com/pytorch/FBGEMM/blob/main/CMakeLists.txt#L52). There is a recent issue reported on AMD CPU build in #1094 . Ideally we should fix this and provide the users with the flexibility on using AVX2 vs. AVX512 independent from CUDA/ROCm.

this_rs_uvm_weights = rs_uvm[2]
assert this_rs_uvm_weights is not None
this_rs_gpu_weights = rs_gpu[2]
assert this_rs_gpu_weights is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this change is needed for ROCm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this change is needed for ROCm?

ROCm is on Python 3.7.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I just don't understand Python, but

  • Does ROCm depend on a specific Python version? In my understanding, it is irrelevant.
  • Does the original grammar depend on a specific Python version (e.g., only Python2, or Python 3.8+, or Python <3.5).
assert (this_rs_uvm_weights := rs_uvm[2]) is not None

Copy link
Contributor Author

@liligwu liligwu May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

":=" is a Python 3.8 feature, see https://stackoverflow.com/a/26000366
The PyTorch upstream CI jobs for ROCm are executed with python 3.7, so all our release dockers are on 3.7. However, ROCm does not dependent on a specific python version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes maybe we should change this if FBGEMM_GPU should work with Python < 3.8,, but I am not sure about the minimum Python version for FBGEMM_GPU, but @jianyuh do you know the version requirement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! If the current change works for 3.7+, it looks good to me. Ideally we should use the simple syntax and avoid using the specific Python 3.8+ features.

@liligwu
Copy link
Contributor Author

liligwu commented May 10, 2022

@jianyuh , @shintaro-iwasaki . We've addressed your comments. Please see the code change or the reply. Thank you.

Copy link
Contributor

@shintaro-iwasaki shintaro-iwasaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your clean up @liligwu!
I added some comments.

fbgemm_gpu/CMakeLists.txt Outdated Show resolved Hide resolved
fbgemm_gpu/CMakeLists.txt Outdated Show resolved Hide resolved
fbgemm_gpu/cmake/Hip.cmake Show resolved Hide resolved
struct fbgemm_gpu_enum_tag_##module_name>::registration_list;
struct fbgemm_gpu_enum_tag_##module_name>::registration_list; \
extern template class enum_registration< \
struct fbgemm_gpu_enum_tag_##module_name>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind explaining why this change is needed for ROCm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using hipcc, the compiler complains error: explicit specialization of 'registration_list' after instantiation.
Based on https://stackoverflow.com/questions/54483114/specialization-of-a-template-class-static-member-variable-when-also-using-exter, one had to explicitly declare the static class member for the specialized template prior to instantiating the specialized template later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in this file here caused an internal issue with the error message like stderr: ld.lld: error: undefined symbol: fbgemm_gpu::enum_registration<fbgemm_gpu::fbgemm_gpu_enum_tag_uvm>::enum_query[abi:cxx11](), so it wasn't merged into the trunk. Feel free to add another PR to incorporate the changes and we can do more testings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in this file here caused an internal issue with the error message like stderr: ld.lld: error: undefined symbol: fbgemm_gpu::enum_registration<fbgemm_gpu::fbgemm_gpu_enum_tag_uvm>::enum_query[abi:cxx11](), so it wasn't merged into the trunk. Feel free to add another PR to incorporate the changes and we can do more testings.

Hi @jianyuh. Is this error occurs when importing fbgemm_gpu or when compiling the project? We don't have this issue on our side. Would you please provide more info? such that we could reproduce the error. Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liligwu Thanks. We looked at this issue. It seems that the difference comes from NVCC vs AMD Clang. You can reproduce it by running uvm_test.py with CUDA enabled.

https://godbolt.org/z/johx4fEME is a very simplified current code in FBGEMM (the core code is borrowed from https://stackoverflow.com/questions/54483114/specialization-of-a-template-class-static-member-variable-when-also-using-exter). This code is acceptable for GCC (and supposedly GCC-based NVCC) while not acceptable for Clang (and supposedly Clang-based AMD compiler).

I am not sure which compiler's behavior is correct in terms of C/C++, but I'd appreciate it if you could find a solution that works for both platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liligwu Thanks. We looked at this issue. It seems that the difference comes from NVCC vs AMD Clang. You can reproduce it by running uvm_test.py with CUDA enabled.

https://godbolt.org/z/johx4fEME is a very simplified current code in FBGEMM (the core code is borrowed from https://stackoverflow.com/questions/54483114/specialization-of-a-template-class-static-member-variable-when-also-using-exter). This code is acceptable for GCC (and supposedly GCC-based NVCC) while not acceptable for Clang (and supposedly Clang-based AMD compiler).

I am not sure which compiler's behavior is correct in terms of C/C++, but I'd appreciate it if you could find a solution that works for both platforms.

We'll resolve this issue in the next PR.

fbgemm_gpu/src/split_embeddings_cache_cuda.cu Outdated Show resolved Hide resolved
fbgemm_gpu/src/split_embeddings_cache_cuda.cu Outdated Show resolved Hide resolved
fbgemm_gpu/test/test_utils.py Outdated Show resolved Hide resolved
fbgemm_gpu/src/sparse_ops.cu Show resolved Hide resolved
fbgemm_gpu/cmake/Hip.cmake Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@liligwu
Copy link
Contributor Author

liligwu commented May 11, 2022

Hi @shintaro-iwasaki , I've addressed your new comments. Would you please reply to #1102 (comment) . I appreciate your help in improving the code quality.

@@ -1377,6 +1377,7 @@ void compressed_indices_remap(

const inst_set_t isa = fbgemmInstructionSet();
if (isZmm(isa)) {
#ifndef __HIP_PLATFORM_HCC__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this change is also related to AVX-512. I might overlook the conclusion of the AVX-512 discussion, but I feel this part is irrelevant to "HIP" or "AMD GPUs".

Copy link
Contributor Author

@liligwu liligwu May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot enable the piece of code block unless the other avx512-related problems are resolved (like @jianyuh suggested #1102 (comment)). Otherwise, an "undefined symbol" error is thrown when python imports fbgemm_gpu.

@shintaro-iwasaki
Copy link
Contributor

@liligwu Thank you for your updates! Since this PR is big, I needed to (and still need to) leave nitpicking comments. Please take a look at them.

@liligwu
Copy link
Contributor Author

liligwu commented May 12, 2022

@shintaro-iwasaki, I addressed your yesterday's comments. Please feel free to continue suggesting the code. Thank you.

@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

fbgemm_gpu/CMakeLists.txt Outdated Show resolved Hide resolved
@@ -871,5 +872,39 @@ def test_quantize_and_dequantize_op_cuda_large_nrows_bf16(
torch.testing.assert_allclose(dequantized_data_gpu.cpu(), dequantized_data)


class TestDenseMLPQuantizationConversion(unittest.TestCase):
Copy link
Member

@jianyuh jianyuh May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this to sync with the trunk? We have open sourced the related op.

Copy link
Contributor Author

@liligwu liligwu May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When enabling the test, an error is thrown RuntimeError: No such operator fb::FloatToMSFPQuantized. Is there any specific version of Pytorch we need to test, please? @jianyuh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use torch.ops.fbgemm.FloatToMSFPQuantized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use torch.ops.fbgemm.FloatToMSFPQuantized

Since this PR (#1102) targets merging into the upstream commit which is a May 4th version and the change to torch.ops.fbgemm.FloatToMSFPQuantized was made on May 6th. It will be removed in the next PR.

… * Fixing the cudaMemoryAdvise mapping in hipify_torch
@facebook-github-bot
Copy link
Contributor

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants