Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm changes #1102

Closed
wants to merge 53 commits into from
Closed

ROCm changes #1102

wants to merge 53 commits into from

Commits on Jan 25, 2022

  1. Hipification of fbgemm for AMD GPUs/CPUs (#4)

    * Hipify code
    
    * Add correctness check
    
    * Revert "Add correctness check"
    
    This reverts commit a7f169dcc862e5cc8102a39eb3b7882dfa888f1b.
    
    * Fix setup.py
    
    * Add run_all.sh
    
    * Update Zipf index generation
    
    Update the Zipf index generation to generate unique indices in each bag
    and shuffle indices to avoid spatial locality
    
    Code reference: https://github.com/pytorch/FBGEMM/blob/7588d9d804826b428fc0e4fd418e9cc3f7a72e52/fbgemm_gpu/bench/split_table_batched_embeddings_benchmark.py#L98-L117
    
    * Fix ROCm version check in fbgemm_gpu's setup.py
    
    * Fix hipification errors
    
    Modify code to fix hipification errors.  Some ops/kernels including
    merge_pooled_embeddings, quantize_ops and
    embedding_forward_quantized_split ops are diabled currently.  These ops
    will be enabled in the future.
    
    * Disable AVX512 for AMD CPUs
    
    AMD CPUs do not support AVX512.  Thus, it has to be disabled in ROCm.
    
    * Update run_all.sh
    
    * Fix __launch_bounds__ with kWarpSize.
    
    * fix missing '#endif' in codegen/embedding_backward_code_generator.py
    
    * fix the dependencies import in setup.py
    
    * debug enum cudaMemeryAdvise
    
    * bypass the both cudaMemoryAdvise cudaMemAdvise are mapped to hipMemAdvise, in cumem_utils.cu
    
    * Build and import successfully but with NAN values.
    
    * NAN values are eliminated by bypassing   res.vals[0] = hfma2(
    
    * Remove debug lines in include/fbgemm_gpu/fbgemm_cuda_utils.cuh
    
    Note: The tests of fbgemm-gpu do not pass. They will be addressed in future commits.
    
    Co-authored-by: Sarunya Pumma <[email protected]>
    Co-authored-by: Li Li <[email protected]>
    Co-authored-by: liligwu <[email protected]>
    4 people authored Jan 25, 2022
    Configuration menu
    Copy the full SHA
    59267a9 View commit details
    Browse the repository at this point in the history

Commits on Jan 26, 2022

  1. Configuration menu
    Copy the full SHA
    a223936 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #6 from ROCmSoftwarePlatform/rocm4.3/develop

    Rocm4.3/develop. Use SHEFL_SYNC_MACRO to replace __shefl() and __shefl_sync()
    liligwu authored Jan 26, 2022
    Configuration menu
    Copy the full SHA
    4610075 View commit details
    Browse the repository at this point in the history

Commits on Jan 31, 2022

  1. Change the hipify dependency to hipify_torch (#7)

    * Change hipify dependency from torch.utils.torch_hipify to hipify_torch.
    
    * add the third_party/hipify_torch to git repo
    liligwu authored Jan 31, 2022
    Configuration menu
    Copy the full SHA
    a506c52 View commit details
    Browse the repository at this point in the history

Commits on Feb 14, 2022

  1. IFU, merge from upstream commit c6df576 to main. (#8)

    * unify function signature of jagged_xD_to_dense (pytorch#813)
    
    Summary:
    Pull Request resolved: pytorch#813
    
    As title
    
    Reviewed By: jiaqizhai, jianyuh
    
    Differential Revision: D33066551
    
    fbshipit-source-id: 8e2fd3c21f3bde67c6b20045681c2549e3583bd3
    
    * Daily `arc lint --take CLANGFORMAT`
    
    Reviewed By: zertosh
    
    Differential Revision: D33183467
    
    fbshipit-source-id: d7c37f3522a38e85891524c544eab4fdb01270de
    
    * Assert Tensors allocated on GPU. (pytorch#819)
    
    Summary:
    Pull Request resolved: pytorch#819
    
    Check inputs for correctness wrt to GPU allocation and device.
    
    Reviewed By: jspark1105, jianyuh
    
    Differential Revision: D33167469
    
    fbshipit-source-id: 04f638d13bde93373d64cff1428ef743300400a6
    
    * Support batched benchmark execution and fix benchmark stats reporting (pytorch#818)
    
    Summary:
    Pull Request resolved: pytorch#818
    
    As title, support multiple execution of benchmark scripts and report aggregated metric.
    
    Further, require `--bag-size` argument to conform to input data file for proper metric accounting.
    
    Reviewed By: jianyuh
    
    Differential Revision: D33182257
    
    fbshipit-source-id: a6eeeb25646c00665b6d29df9389eddab7618d4e
    
    * Direct Convolution JIT assembly for KH=2, KW = 6
    
    Summary:
    this diff has specialized codegen for convolution case where KH=2 and KW=6
    
    ## Performance results on local devserver with AVX2 instruction:
    1, 16, 16,     {2, 126}, 1, {2, 6}, {1, 2}, {0, 0, 0, 0},     {1, 1}, {0, 0}, false
    Fbgemm baseline:
    3.8 GOPS
    This diff:
    9.2 GOPS
    
    1, 64, 64,     {2, 257}, 1, {2, 6}, {1, 2}, {0, 0, 0, 0},     {1, 1}, {0, 0}, false
    Fbgemm baseline:
    43.8 GOPS
    This diff:
    61.2 GOPS
    
    ## How to invoke indirect convolution function:
    **At offline:**
    1. Weights need to be transposed to (oc/8) - (kh) - (kw) - (ic/4) - 8 - 4
    2. Create the convolution function based on problem size:
    ```
           CodeGenBase<uint8_t, int8_t, int32_t, int32_t> codeObj;
           CodeGenBase<uint8_t, int8_t, int32_t, int32_t>::jit_micro_kernel_fp fn;
           fn = codeObj.getOrCreateDirectConv<inst_set_t::avx2>(
            true, conv_p.OUT_DIM[1], conv_p.IN_DIM[1] * conv_p.IC, conv_p.stride[1] * conv_p.IC);
    ```
    3. Compute the *col_offsets* of weight tensor
    4. Make sure you have allocated the space for: output tensor (Cint32_fb, Cint8_fb), and some temporary space for input rowsum ( InSum: IN_DIM[0] x IN_DIM[1], rowSum: OUT_DIM[0] x OUT_DIM[1])
    
    **Online:**
    Make sure we have:
    conv_p ( the problem info), Aint8 (input tensor), bBuf_tr ( the transposed weight tensor), Cint32_fb ( the 32-bit results after accumulation), Cint8_fb ( the final quantized 8-bit output).
    
           // compute direct conv row sum
           directConvRowSum(conv_p, Aint8.data(),
                inSum, rowSum, row_offsets.data());
    
          // kernel for direct convolution
            for (int oc = 0; oc < conv_p.OC; oc+= 8) {
              fn(Aint8.data(),
                  bBuf_tr.data() + oc * kernel_dim * conv_p.IC ,
                  bBuf_tr.data(),
                  Cint32_fb.data() + oc,
                  conv_p.IC * conv_p.K[1],
                  conv_p.OC);
            }
    
            requantizationParams_t<> reqObj = {
              Aint8_zero_point, // Aq_zero_point
              Bint8_zero_point.data(),
              C_zero_point,
              C_multiplier.data(),
              rowSum, // row_offsets
              //row_offsets.data(),
              col_offsets.data(), // col_offsets
              nullptr, // bias
              static_cast<std::uint32_t>(conv_p.OC), // ncols
              1, // groups
              nullptr};
    
            requantizeOutputProcessingAvx2<false, false, QuantizationGranularity::TENSOR,
              false, false>(Cint8_fb.data(),
                  Cint32_ref.data(),
                  {0, conv_p.OUT_DIM[1] * conv_p.OUT_DIM[0], 0, conv_p.OC}, conv_p.OC, conv_p.OC, reqObj);
    
    For more details please refer to test_asmjit2.cc
    
    Reviewed By: dskhudia
    
    Differential Revision: D31775222
    
    fbshipit-source-id: 294450613b0978277e75d171d6a560124c14ecda
    
    * suppress errors in `deeplearning/fbgemm/fbgemm_gpu`
    
    Differential Revision: D33201593
    
    fbshipit-source-id: 251f338e03dfde1dcc4a83c4ff9df1fe27840bdb
    
    * fix copy right header of batch_benchmark_run.py (pytorch#820)
    
    Summary:
    Pull Request resolved: pytorch#820
    
    As title
    
    Reviewed By: jianyuh
    
    Differential Revision: D33213812
    
    fbshipit-source-id: d901e87ff1047ff969c99a330aa05c8d26e1954e
    
    * Assert Tensors allocated on GPU for generated code. (pytorch#821)
    
    Summary:
    Pull Request resolved: pytorch#821
    
    Check inputs for correctness wrt to GPU allocation and device.
    
    Reviewed By: jspark1105
    
    Differential Revision: D33189944
    
    fbshipit-source-id: 36fb5eac677466e783ef5a754c28b6d838ea09b7
    
    * Move all fbgemm_gpu provided Python ops to fbgemm namespace from fb namespace. (pytorch#823)
    
    Summary: Pull Request resolved: pytorch#823
    
    Reviewed By: jianyuh
    
    Differential Revision: D33147038
    
    fbshipit-source-id: fdcb667dfb920b4f04b7d0b08082afabe7213cc1
    
    * Implement generic HBC by feature. (pytorch#822)
    
    Summary:
    Pull Request resolved: pytorch#822
    
    Implement a generic version of HBC by feature, which takes in bin_boundaries.
    
    Reviewed By: jianyuh
    
    Differential Revision: D33232676
    
    fbshipit-source-id: 99c77f6d081fdc89699948a6c9482b8806f598a3
    
    * Benchmark for newly added generic HBC by feature. (pytorch#826)
    
    Summary:
    Pull Request resolved: pytorch#826
    
    More benchmarking for new op, and also add "double" for benchmarking type.
    
    Reviewed By: jianyuh
    
    Differential Revision: D33241845
    
    fbshipit-source-id: 38f08f5453fd8d112ff55c046a6ac091c23bc3de
    
    * Allways set dontfork on managed Tensor + new uvm clone (pytorch#824)
    
    Summary:
    Pull Request resolved: pytorch#824
    
    Workaround for S256045.
    UVM Tensors are unmapped from the process page table on fork (spawn).
    The UVM fault handler then slows down the UVM CPU<->CPU copy substantially reestablishing those mappings.
    The workaround sets MADV_DONTFORK on the addresses (rounded down to page size) of UVM allocations - this prevents the removal from UVM pages from the original process page table.
    Additionally this introduces a single threaded UVM->CPU tensor copy to
    1) Avoid 8 trainers on a host to concurrently all threads with copy_
    2) Avoid high concurency in the fault handler of the uvm kernel driver.
    
    Reviewed By: jianyuh
    
    Differential Revision: D33192043
    
    fbshipit-source-id: 094f3dcd302d455efbf4e912d58ed28756cb653f
    
    * Use kWarpSize for warp size (pytorch#827)
    
    Summary: Pull Request resolved: pytorch#827
    
    Reviewed By: rweyrauch
    
    Differential Revision: D33271792
    
    fbshipit-source-id: dc66b6950b37e5d92c10406a3891568a7500e26e
    
    * Move fb.embedding_bag_rowwise_prune to fbgemm_gpu OSS. (pytorch#825)
    
    Summary:
    Pull Request resolved: pytorch#825
    
    Move the fb.embedding_bag_rowwise_prune op from caffe2/fb/sparsenn to fbgemm_gpu.
    
    Reviewed By: jianyuh
    
    Differential Revision: D33240318
    
    fbshipit-source-id: 4db93a1ecd9666881779eeada1e3e493aa7525e4
    
    * Allow optional Tensor args to be empty or on GPU. (pytorch#828)
    
    Summary: Pull Request resolved: pytorch#828
    
    Reviewed By: jianyuh
    
    Differential Revision: D33267641
    
    fbshipit-source-id: b193ee5b7e9ea946a20672760c320f29b217b998
    
    * Add output_dtype to training TBE op for CPU (pytorch#829)
    
    Summary:
    Pull Request resolved: pytorch#829
    
    This Diff adds `output_dtype` to `split_embedding_codegen_lookup_{{ optimizer }}_function_cpu()`. Note that the CUDA version (`split_embedding_codegen_lookup_{{ optimizer }}_function()`) already has this argument (D32399931 (pytorch@7e1183c)).
    
    Reviewed By: jianyuh
    
    Differential Revision: D32969921
    
    fbshipit-source-id: 695e54434dc4f65f9f4c60782c60a550e38d97a7
    
    * fix copyright header of tensor_assert_test.cpp (pytorch#831)
    
    Summary:
    Pull Request resolved: pytorch#831
    
    As title
    
    Reviewed By: rweyrauch
    
    Differential Revision: D33310866
    
    fbshipit-source-id: 1cbdee1d7c00f0e900faac570bac330866887b1c
    
    * Add permute_pooled_embedding_modules_test into RE (pytorch#830)
    
    Summary:
    Pull Request resolved: pytorch#830
    
    As title
    
    Reviewed By: rweyrauch
    
    Differential Revision: D33303898
    
    fbshipit-source-id: c94a14bc398ecb58b68ca15d7e79204233ac67d1
    
    * Use all to one op to do DtoD between remote and merge (pytorch#817)
    
    Summary:
    Pull Request resolved: pytorch#817
    
    Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices.
    
    See the blue DtoD copies in the graph below.
    {F686842812}
    
    At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case.
    
    This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device.
    
    For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies.
    
    Now the copies from different devices can run in parallel.
    {F686843333}
    
    Reviewed By: yinghai, jianyuh, houseroad
    
    Differential Revision: D33065710
    
    fbshipit-source-id: f479fa2ea20702e14419c8b87024a87d5bbb1a68
    
    * Add MSFP option for ads hpc model numeric emulations (pytorch#832)
    
    Summary:
    Pull Request resolved: pytorch#832
    
    Add fake conversions between MSFP and fp32 in both forward and backward pass of the hpc ads model training.
    
    TODO: Add compute kernels that split the FC operator into gemms for column_blocks of activations and row_blocks of weights
    
    Reviewed By: jspark1105
    
    Differential Revision: D30942234
    
    fbshipit-source-id: 601d671fd00622304a50651dedffd0de3ae01ae0
    
    * Remove benchmark CMakeLists.txt (pytorch#835)
    
    Summary:
    Pull Request resolved: pytorch#835
    
    As title. This file is no longer needed after we decide to support setup.py only OSS build approach.
    
    Reviewed By: jspark1105, rweyrauch
    
    Differential Revision: D33318121
    
    fbshipit-source-id: 4f71b23f6e9e7e78d50fab20af53cdf9f63844ad
    
    * Increase code reuse between FP32, FP16, INT8, INT4 embedding types for infer TBE (pytorch#833)
    
    Summary:
    Pull Request resolved: pytorch#833
    
    We merge the implementation for {FP32, FP16, INT8, INT4} weights in inference TBE into one unified template and increase the code reuse between these implementations. This will pave the way for the future enhancements (no need to change all 4 implementations for one new feature).
    
    Reviewed By: rweyrauch
    
    Differential Revision: D33343450
    
    fbshipit-source-id: 24e59c4a2df5ef3da353535eb879a2365293bc1f
    
    * minimize functions defined in headers (pytorch#836)
    
    Summary:
    Pull Request resolved: pytorch#836
    
    We had so much stuffs that didn't need to be at header files.
    Split long source files.
    Put experimental quantization functions to experimental namespace
    
    Reviewed By: rweyrauch
    
    Differential Revision: D33358916
    
    fbshipit-source-id: cffcec344cbe565045ee2c564ce1cef529de4cf8
    
    * add missing C10_CUDA_KERNEL_LAUNCH_CHECK (pytorch#837)
    
    Summary:
    Pull Request resolved: pytorch#837
    
    As title
    
    Reviewed By: rweyrauch
    
    Differential Revision: D33359025
    
    fbshipit-source-id: 162dd2897a5d56e7ac8ff3ba9ae5c8689961204b
    
    * Add seq embedding kernel for infer TBE (pytorch#834)
    
    Summary:
    Pull Request resolved: pytorch#834
    
    - Add sequence embedding support in infer TBE kernel
    
    - TODO: "mask" solution for the duplicated embedding row access. cc jspark1105
    
    Reviewed By: jspark1105
    
    Differential Revision: D33341863
    
    fbshipit-source-id: 47babe921dbaf086e2df92f4693b4718c01bcec1
    
    * add missing new files to CMakeLists.txt (pytorch#838)
    
    Summary:
    Pull Request resolved: pytorch#838
    
    This was missed in D33358916 (pytorch@38a6c35)
    
    Reviewed By: colin2328
    
    Differential Revision: D33370387
    
    fbshipit-source-id: 72007f51afd6757690a1898098e8b6207c3c487b
    
    * Support int32_t indices/offsets for caching handling logics (pytorch#811)
    
    Summary:
    Pull Request resolved: pytorch#811
    
    In training, we assume the indices / offsets are int64_t for embedding (TBE), but in inference, we assume the indices / offsets are int32_t.
    
    This Diff enables both int32_t and int64_t supports for the caching logics so that we can reuse the same functions for both training and inference, while reducing the extra overhead to convert the indices/offsets from int to long or vice versa.
    
    Reviewed By: jspark1105
    
    Differential Revision: D33045589
    
    fbshipit-source-id: 4e508a1095536a629bdab8e5577db74310032b23
    
    * Add seq embedding benchmark
    
    Summary: 5x ~ 10x speedup in the benchmark level.
    
    Reviewed By: jspark1105
    
    Differential Revision: D33355933
    
    fbshipit-source-id: 2c609ae9ec5fd4fda48dbafa13b5eb75900fdf5f
    
    * fix warning count check in test_bounds_check (pytorch#839)
    
    Summary:
    Pull Request resolved: pytorch#839
    
    In GPU multiple threads in a thread block can increase warning count for the same bound errors in offset array
    
    Reviewed By: jianyuh
    
    Differential Revision: D33379301
    
    fbshipit-source-id: b00520cc613bb7e15c9f8cd4bdf0c61bd4dbd83b
    
    * fix typo in CMakeLists.txt (pytorch#840)
    
    Summary:
    Pull Request resolved: pytorch#840
    
    Fixing a silly typo
    
    Reviewed By: jianyuh
    
    Differential Revision: D33380967
    
    fbshipit-source-id: 8220cc87a2564107cb124d3f9c31b8d92cb7d1a4
    
    * Slight perf optimization for infer TBE (pytorch#843)
    
    Summary:
    Pull Request resolved: pytorch#843
    
    ~5% perf improvement for INT4 / INT8 inference TBE on A100 GPUs.
    
    Reviewed By: jspark1105
    
    Differential Revision: D33388153
    
    fbshipit-source-id: 63566e3dccd9ce4775abb3374251f9046512e131
    
    * extract embedding input transpose out of embedding_backward_split_template.cu (pytorch#841)
    
    Summary:
    Pull Request resolved: pytorch#841
    
    Refactoring to prepare D33381126
    Other minor changes
    * Remove unused sorted_linear_indices_run_lengths parameter from bwd kernels
    
    Reviewed By: jianyuh
    
    Differential Revision: D33380032
    
    fbshipit-source-id: b880cc3745a6f6dd63319109e753a470d6c28c49
    
    * increase parallelism in batched unary embeddings backward (pytorch#842)
    
    Summary:
    Pull Request resolved: pytorch#842
    
    Sort indices and have each thread handle indices with the same values (called a run in the code)
    
    Reviewed By: jianyuh
    
    Differential Revision: D33381126
    
    fbshipit-source-id: aec1c0be619b9072f5a1f9273b66c03e5106ca02
    
    * use DISPATCH_TO_CUDA macro (pytorch#845)
    
    Summary:
    Pull Request resolved: pytorch#845
    
    We should use the macro consistently or just drop
    
    Reviewed By: jianyuh
    
    Differential Revision: D33392682
    
    fbshipit-source-id: bd99286f55fe2d6e5bab231ec65dae02f16f35c2
    
    * Follow-up comments (pytorch#844)
    
    Summary: Pull Request resolved: pytorch#844
    
    Reviewed By: jspark1105
    
    Differential Revision: D33393019
    
    fbshipit-source-id: 1df7d8457a950a829f7ff2fe6f47595afdc9cc26
    
    * HIP extension support for FBGEMM_GPU (pytorch#846)
    
    Summary: Pull Request resolved: pytorch#846
    
    Reviewed By: jspark1105
    
    Differential Revision: D33231489
    
    fbshipit-source-id: 6bd46ddee45c767ad25c2d52b6c05030bba94082
    
    * correct the max_shared_bytes logit evaluation logic in embedding_backward_split_template.cu
    
    * IFU from from upstream commit c6df576 to main. fbgemm-gpu is built and imported. Tests do NOT pass.
    
    Co-authored-by: Xing Liu <[email protected]>
    Co-authored-by: CodemodService FBSourceClangFormatLinterBot <>
    Co-authored-by: Rick Weyrauch <[email protected]>
    Co-authored-by: Martin Schatz <[email protected]>
    Co-authored-by: Jiyuan Zhang <[email protected]>
    Co-authored-by: Jongsoo Park <[email protected]>
    Co-authored-by: Jason Park <[email protected]>
    Co-authored-by: Stephan Uphoff <[email protected]>
    Co-authored-by: Jianyu Huang <[email protected]>
    Co-authored-by: Shintaro Iwasaki <[email protected]>
    Co-authored-by: Shiyan Deng <[email protected]>
    Co-authored-by: Summer Deng <[email protected]>
    12 people authored Feb 14, 2022
    Configuration menu
    Copy the full SHA
    f596bde View commit details
    Browse the repository at this point in the history

Commits on Mar 2, 2022

  1. Enable split_table_batched_embeddings_test.py (#10)

    * * added skipIfRocm and TEST_WITH_ROCM in split_table_batched_embeddings_test. * added __any_sync_fbgemm that replaces __any_sync. * 26 tests ran in split_table_batched_embeddings_test 10 skipped.
    
    * *Renamed __any_sync_fbgemm to __any_sync and changed its implementation to a more generic one. *Added 'reason' message of skipIfRocm.
    
    * *enabled use_array_for_index_remapping in test_nbit_forward_int and test_nbit_forward_fp. *enabled test_nbit_forward_pruning.
    
    * deleted 'assert(false)' tthat are related to __any_sync function.
    liligwu authored Mar 2, 2022
    Configuration menu
    Copy the full SHA
    0cfb792 View commit details
    Browse the repository at this point in the history

Commits on Mar 7, 2022

  1. *Enable use_cache. *Enable split_embedding_inference_converter_test.p…

    …y by diabling use_cpu.
    liligwu committed Mar 7, 2022
    Configuration menu
    Copy the full SHA
    f13af44 View commit details
    Browse the repository at this point in the history
  2. Skip use_cpu.

    liligwu committed Mar 7, 2022
    Configuration menu
    Copy the full SHA
    25e5b71 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    dcbe19f View commit details
    Browse the repository at this point in the history
  4. Enable quantize_ops_test.py

    liligwu committed Mar 7, 2022
    Configuration menu
    Copy the full SHA
    fda048e View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    00abba1 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    cf307b6 View commit details
    Browse the repository at this point in the history
  7. *Uncondition use_cache in split_table_batched_embeddings_test.py *Rem…

    …ove @skipIfRocm for TestFused8BitRowwiseQuantizationConversion and TestFusedNBitRowwiseQuantizationConversion
    liligwu committed Mar 7, 2022
    Configuration menu
    Copy the full SHA
    2d66ea8 View commit details
    Browse the repository at this point in the history

Commits on Mar 8, 2022

  1. Configuration menu
    Copy the full SHA
    958679b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    e642a48 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    d0d294a View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    146f2df View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    eb0cf36 View commit details
    Browse the repository at this point in the history

Commits on Mar 11, 2022

  1. *Removed post_hipify logic in setup.py. *Removed two headerfiles that…

    … have been deleted in upstream.
    liligwu committed Mar 11, 2022
    Configuration menu
    Copy the full SHA
    0c86f2b View commit details
    Browse the repository at this point in the history
  2. Merge pull request #16 from ROCmSoftwarePlatform/remove_post_hipify

    Removed post_hipify logic in setup.py
    amathews-amd authored Mar 11, 2022
    Configuration menu
    Copy the full SHA
    6e7f13e View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2022

  1. Configuration menu
    Copy the full SHA
    edd3306 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #17 from ROCmSoftwarePlatform/pointing_hipify_torc…

    …h_to_new_commit
    
    Pointing hipify_torch to the newer commit.
    amathews-amd authored Mar 14, 2022
    Configuration menu
    Copy the full SHA
    9a45f4a View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2022

  1. Configuration menu
    Copy the full SHA
    309a3a1 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    358eaf5 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    3a915a8 View commit details
    Browse the repository at this point in the history

Commits on Mar 31, 2022

  1. Match upstream setup (#21)

    * An attempt of matching upstream setup.py.
    
    * Move hipify() to CMakeList.txt.
    
    * Removing hipify from the python script.
    
    * Matching upstream setup.py
    
    * #Removing the unnecessary funcitons and statements in Hip.cmake. #Reforming some of the compilation option lists in CMakeList.txt.
    
    * Updating hipify_torch (CMake API)
    
    * #Adding automatically detection for CUDA and ROCm. #Removing the debug code in embedding_backward_code_generator.py. #Adding 'gfx90a' in FBGEMM_ROCM_ARCH. #Minor changes on message and indentation.
    liligwu authored Mar 31, 2022
    Configuration menu
    Copy the full SHA
    40928ba View commit details
    Browse the repository at this point in the history

Commits on Apr 1, 2022

  1. Enable merge_pooled_embeddings op. in ROCm (#15)

    * Enable merge_pooled_embeddings op. in ROCm
    
    * Enabling the merge pool ops.
    
    Co-authored-by: liligwu <[email protected]>
    reza-amd and liligwu authored Apr 1, 2022
    Configuration menu
    Copy the full SHA
    69abf78 View commit details
    Browse the repository at this point in the history

Commits on Apr 14, 2022

  1. Merge remote-tracking branch 'upstream/main' into IFU-main-2022-04-07

    ======================================================================
    Two tests failures:
    ======================================================================
    ERROR: test_generic_histogram_binning_calibration_by_feature (__main__.SparseOpsTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "sparse_ops_test.py", line 1500, in test_generic_histogram_binning_calibration_by_feature
        data_type=st.sampled_from([torch.half, torch.float32]),
      File "/opt/conda/lib/python3.7/site-packages/hypothesis/core.py", line 1220, in wrapped_test
        raise the_error_hypothesis_found
      File "sparse_ops_test.py", line 1543, in test_generic_histogram_binning_calibration_by_feature
        bin_ctr_weight_value=0.9995,
    RuntimeError: expected scalar type Long but found Int
    
    ----------------------------------------------------------------------
    
    FAIL: test_lxu_cache_lookup (__main__.SplitTableBatchedEmbeddingsTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "split_table_batched_embeddings_test.py", line 3994, in test_lxu_cache_lookup
        dtype=torch.int,
    AssertionError: False is not true
    
    ----------------------------------------------------------------------
    Ran 35 tests in 759.368s
    
    FAILED (failures=1)
    liligwu committed Apr 14, 2022
    Configuration menu
    Copy the full SHA
    5c0096e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    bfac874 View commit details
    Browse the repository at this point in the history

Commits on Apr 15, 2022

  1. * Enabling the specificationn of hip architecture by using PYTORCH_RO…

    …CM_ARCH. # Enabling building on Pytorch 1.11.
    liligwu committed Apr 15, 2022
    Configuration menu
    Copy the full SHA
    1cf7e84 View commit details
    Browse the repository at this point in the history

Commits on Apr 19, 2022

  1. *Fixing the unit tests in sparse_ops_test.py. *Fixing the path of Ato…

    …mic.cuh path in embedding_backward_template_helpers.cuh.
    liligwu committed Apr 19, 2022
    Configuration menu
    Copy the full SHA
    5b33287 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2c514c5 View commit details
    Browse the repository at this point in the history

Commits on Apr 20, 2022

  1. Enable use_cpu in the tests.

    liligwu committed Apr 20, 2022
    Configuration menu
    Copy the full SHA
    0d5a012 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ae14a47 View commit details
    Browse the repository at this point in the history
  3. *Taking @skipIfRocm back in the test_utils.py. *Fixing cublasGemmStri…

    …dedBatchedEx in sparse_ops.cu.
    liligwu committed Apr 20, 2022
    Configuration menu
    Copy the full SHA
    1718605 View commit details
    Browse the repository at this point in the history
  4. Cleaning up the code.

    liligwu committed Apr 20, 2022
    Configuration menu
    Copy the full SHA
    bc902a3 View commit details
    Browse the repository at this point in the history

Commits on Apr 21, 2022

  1. Merge pull request #24 from ROCmSoftwarePlatform/IFU-main-2022-04-20

    Ifu main 2022 04 20 targeting on upstream 19c17b3
    pruthvistony authored Apr 21, 2022
    Configuration menu
    Copy the full SHA
    0d95948 View commit details
    Browse the repository at this point in the history
  2. Enabling cuda (#25)

    * Alinging with upstream with merge_pooled_embeddings_test.py and enabling cuda.
    
    * Disabling use_cpu in split_table_batched_embeddings_test since it's still unstable.
    
    Co-authored-by: root <[email protected]>
    liligwu and root authored Apr 21, 2022
    Configuration menu
    Copy the full SHA
    9a5a33b View commit details
    Browse the repository at this point in the history

Commits on Apr 22, 2022

  1. Enabling cuda (#25)

    * Alinging with upstream with merge_pooled_embeddings_test.py and enabling cuda.
    
    * Disabling use_cpu in split_table_batched_embeddings_test since it's still unstable.
    
    Co-authored-by: root <[email protected]>
    liligwu and root committed Apr 22, 2022
    Configuration menu
    Copy the full SHA
    6490dbc View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    77627ae View commit details
    Browse the repository at this point in the history

Commits on May 2, 2022

  1. Configuration menu
    Copy the full SHA
    18b48e9 View commit details
    Browse the repository at this point in the history

Commits on May 4, 2022

  1. Merge pull request #2 from ROCmSoftwarePlatform/IFU-main-2022-05-02

    Ifu main 2022 05 02
    liligwu authored May 4, 2022
    Configuration menu
    Copy the full SHA
    99a70e1 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    fed56ff View commit details
    Browse the repository at this point in the history

Commits on May 5, 2022

  1. Configuration menu
    Copy the full SHA
    4b39a70 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    785afb8 View commit details
    Browse the repository at this point in the history

Commits on May 9, 2022

  1. * Addressing the comments in PR review ROCm changes pytorch#1102. * R…

    …eoganize CMakeList.txt and minimize the differences to the upsream.
    liligwu committed May 9, 2022
    Configuration menu
    Copy the full SHA
    bbd0ad1 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9db83d8 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    eabd0a8 View commit details
    Browse the repository at this point in the history

Commits on May 10, 2022

  1. Configuration menu
    Copy the full SHA
    2038008 View commit details
    Browse the repository at this point in the history
  2. Removing build script.

    liligwu committed May 10, 2022
    Configuration menu
    Copy the full SHA
    0202078 View commit details
    Browse the repository at this point in the history

Commits on May 11, 2022

  1. Configuration menu
    Copy the full SHA
    9cf8856 View commit details
    Browse the repository at this point in the history

Commits on May 12, 2022

  1. Configuration menu
    Copy the full SHA
    b885322 View commit details
    Browse the repository at this point in the history

Commits on May 13, 2022

  1. * Changing the logic of detecting GPU vender, making CUDA as default.…

    … * Fixing the cudaMemoryAdvise mapping in hipify_torch
    liligwu committed May 13, 2022
    Configuration menu
    Copy the full SHA
    0e3dfdb View commit details
    Browse the repository at this point in the history