set kIsHeavy member variables #1012

FabianSchuetze · 2023-07-23T11:08:02Z

In response to #1011, this PR sets the kIsHeavy member variables for all activations.

I've set the following activations to kIsHeavy:

Tanh
Sigmoid
SiLu
HardSwish
GeLU

My motivation for setting them to kIsHeave was that GeLU_taylor had the same value.

For the following activations kIsHeavy is set to false:

ReLU
LeakyReLU

I've also set the kIsHeavy for the backward pass for GELU although I'm not sure this is needed here.

mnicely · 2023-08-18T11:52:14Z

Closes #1012

FabianSchuetze · 2023-08-18T13:47:09Z

Is there anything I have to do before a review can commence?

hwu36 · 2023-07-24T00:55:01Z

include/cutlass/epilogue/thread/activation.h

@@ -419,6 +430,7 @@ struct SiLu<Array<T, N>> {
 // Reference: https://pytorch.org/docs/stable/generated/torch.nn.Hardswish.html
 template <typename T>
 struct HardSwish {
+  static const bool kIsHeavy=true;


have you tested kIsHeavy=true with HardSwish? It is simple math with some min max.

Thanks for the comment! Do you mean whether I profiled two Gemms with HardShwish?

yes. preferred A100 or H100.

otherwise, we can set it as false first, so the behavior is the same as before.

Thanks for the comment!

I made some tests with an A5000 . Surprisingly, I did not see any difference in binary size or significant differences in execution time. Here are the results:

Setting kIsHeady=False root@c555a0fe3233://workspace/cutlass/build# ./examples/test_hardswish/test_hardswish --m=2048 --n=2048 --k=2048 --verify --iterations=100000 Problem: 2048-by-2048-by-2048 Runtime: 0.180821 ms GFLOPs: 95010.2 GFLOPs Memory bandwidth: 43.2056 GiB/s Passed Setting kIsHeady=true oot@c555a0fe3233://workspace/cutlass/build# ./examples/test_hardswish/test_hardswish --m=2048 --n=2048 --k=2048 --verify --iterations=100000 Problem: 2048-by-2048-by-2048 Runtime: 0.18098 ms GFLOPs: 94926.8 GFLOPs Memory bandwidth: 43.1677 GiB/s Passed

You can see the file with which I generated the results below. Do you see any possible improvements in this file? I can try to test the results with an A100 but it's not always available for me.

The results were computed with the following script

/*************************************************************************************************** * Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights *reserved. SPDX-License-Identifier: BSD-3-Clause * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, *this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright notice, * this list of conditions and the following disclaimer in the documentation * and/or other materials provided with the distribution. * * 3. Neither the name of the copyright holder nor the names of its * contributors may be used to endorse or promote products derived from * this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE *POSSIBILITY OF SUCH DAMAGE. * **************************************************************************************************/ /** */ #include <cmath> #include <iostream> #include <limits> #include "cutlass/arch/memory.h" #include "cutlass/arch/memory_sm75.h" #include "cutlass/cutlass.h" #include "cutlass/epilogue/thread/linear_combination.h" #include "cutlass/epilogue/thread/linear_combination_hardswish.h" #include "cutlass/gemm/device/gemm.h" #include "cutlass/gemm/gemm.h" #include "cutlass/layout/matrix.h" #include "cutlass/util/command_line.h" #include "cutlass/util/host_tensor.h" #include "cutlass/util/reference/device/gemm.h" #include "cutlass/util/reference/device/tensor_fill.h" #include "cutlass/util/reference/host/error_metrics.h" #include "cutlass/util/reference/host/gemm.h" #include "cutlass/util/reference/host/tensor_compare.h" #include "cutlass/util/reference/host/tensor_copy.h" #include "cutlass/util/reference/host/tensor_fill.h" #include "cutlass/util/reference/host/tensor_norm.h" #include "cutlass/util/reference/host/tensor_reduce.h" #include "cutlass/util/tensor_view_io.h" ///////////////////////////////////////////////////////////////////////////////////////////////// ///////////////////////////////////////////////////////////////////////////////////////////////// enum class Disposition { kPassed, kIncorrect, kNotVerified }; ///////////////////////////////////////////////////////////////////////////////////////////////// // Command line options parsing struct Options { bool help; cutlass::gemm::GemmCoord problem_size; int iterations; unsigned seed; float alpha; float beta; bool verification_enabled; float tolerance; Options() : help(false), problem_size({16, 24, 64}), iterations(20), seed(2022), alpha(1), beta(0), verification_enabled(true), tolerance(1e-5f) {} bool valid() { return true; } // Parses the command line void parse(int argc, char const **args) { cutlass::CommandLine cmd(argc, args); if (cmd.check_cmd_line_flag("help")) { help = true; } cmd.get_cmd_line_argument("m", problem_size.m()); cmd.get_cmd_line_argument("n", problem_size.n()); cmd.get_cmd_line_argument("k", problem_size.k()); cmd.get_cmd_line_argument("beta", beta); cmd.get_cmd_line_argument("iterations", iterations); cmd.get_cmd_line_argument("verify", verification_enabled); cmd.get_cmd_line_argument("seed", seed); cmd.get_cmd_line_argument("tolerance", tolerance); } /// Prints the usage statement. static std::ostream &print_usage(std::ostream &out) { out << "35_gemm_softmax example\n\n" << " This example uses the CUTLASS Library to compute GEMM + " "Softmax for arbitrary problem sizes.\n\n" << "Options:\n\n" << " --help If specified, displays this " "usage statement.\n\n" << " --m=<int> GEMM M dimension\n" << " --n=<int> GEMM N dimension\n" << " --k=<int> GEMM K dimension\n" << " --alpha=<f32> Epilogue scalar alpha\n" << " --beta=<f32> Epilogue scalar beta\n\n" << " --seed=<int> Random number seed (1*)\n\n" << " --iterations=<int> Number of profiling iterations " "to perform (0 to disable profiling).\n\n" << " --verify=<bool> If true, performs reference " "calculation.\n\n" << " --tolerance <float> Error tolerance\n"; out << "\n\nExamples:\n\n" << "$ ./examples/35_gemm_softmax/35_gemm_softmax --m=1024 --n=512 " "\\\n" << " --alpha=2 --beta=0.707 \n\n"; return out; } /// Returns true if the environment and Toolkit support this bool supported(bool verbose = true) const { // Ampere Tensor Core operations exposed with mma.sync and ldmatrix are // first available in CUDA 11.0. // // CUTLASS must be compiled with CUDA 11.0 Toolkit to run these // examples. if (!(__CUDACC_VER_MAJOR__ >= 11)) { if (verbose) { std::cerr << "Ampere Tensor Core operations must be compiled " "with CUDA 11.0 Toolkit or later." << std::endl; } return false; } cudaDeviceProp props; cudaError_t error = cudaGetDeviceProperties(&props, 0); if (error != cudaSuccess) { if (verbose) { std::cerr << "cudaGetDeviceProperties() returned an error: " << cudaGetErrorString(error) << std::endl; } return false; } if (!((props.major * 10 + props.minor) >= 80)) { if (verbose) { std::cerr << "Ampere Tensor Core operations must be run on a " "machine with compute capability at least 80." << std::endl; } return false; } return true; } }; ///////////////////////////////////////////////////////////////////////////////////////////////// struct Testbed { // // Type definitions // using ElementA = cutlass::half_t; using ElementB = cutlass::half_t; using ElementOutput = cutlass::half_t; using ElementCompute = float; using ElementAccumulator = float; using LayoutA = cutlass::layout::ColumnMajor; using LayoutB = cutlass::layout::ColumnMajor; using LayoutOutput = cutlass::layout::ColumnMajor; using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 32>; using WarpShape = cutlass::gemm::GemmShape<64, 64, 32>; using InstructionShape = cutlass::gemm::GemmShape<16, 8, 16>; using OperatorClass = cutlass::arch::OpClassTensorOp; using ArchTag = cutlass::arch::Sm80; using EpilogueFunctorOp = cutlass::epilogue::thread::LinearCombinationHardSwish< ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value, ElementCompute, ElementCompute>; using Gemm = cutlass::gemm::device::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementOutput, LayoutOutput, ElementAccumulator, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, EpilogueFunctorOp>; // // Data members // Options const &options; cutlass::gemm::GemmCoord problem_size = options.problem_size; // cutlass::HostTensor<ElementNorm, LayoutC> reference_N; cutlass::HostTensor<ElementA, LayoutA> tensor_a{ {problem_size.m(), problem_size.k()}}; // <- Create matrix A with dimensions M x K cutlass::HostTensor<ElementB, LayoutB> tensor_b{ {problem_size.k(), problem_size.n()}}; // <- Create matrix B with dimensions K x N cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c{ {problem_size.m(), 1}}; // <- Create matrix C with // dimensions M x 1 cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_d{ {problem_size.m(), problem_size.n()}}; // <- Create matrix D with dimensions M x N used cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_ref_d{ {problem_size.m(), problem_size.n()}}; // <- Create matrix D with dimensions M x N used // to store output from // // Methods // Testbed(Options const &options_) : options(options_) {} /// Run Disposition run() { Disposition disposition = Disposition::kNotVerified; // // Initialize the workspace // initialize(); // // Launch device kernel // cutlass::Status status = cutlass::Status::kSuccess; status = execute_device_kernel(); if (status != cutlass::Status::kSuccess) { std::cerr << "Device execution failed." << std::endl; return disposition; } cudaError_t result = cudaDeviceSynchronize(); if (result != cudaSuccess) { std::cerr << "Device synchronize failed with error " << cudaGetErrorString(result) << std::endl; return disposition; } // // Verify // if (options.verification_enabled) { bool passed = verify(); if (passed) { disposition = Disposition::kPassed; } else { disposition = Disposition::kIncorrect; } } // // Profiling // if (options.iterations) { profile(); } return disposition; } /// Random initialization void initialize() { cutlass::reference::host::TensorFillRandomUniform( tensor_a.host_view(), 1, ElementA(4), ElementA(-4), 0); // <- Fill matrix A on host with uniform-distribution random // data cutlass::reference::host::TensorFillRandomUniform( tensor_b.host_view(), 1, ElementB(4), ElementB(-4), 0); // <- Fill matrix B on host with uniform-distribution random // data cutlass::reference::host::TensorFillRandomUniform( tensor_c.host_view(), 1, ElementOutput(4), ElementOutput(-4), 0); // <- Fill matrix C on host with uniform-distribution random // data cutlass::reference::host::TensorFill( tensor_d.host_view()); // <- fill matrix D on host with zeros cutlass::reference::host::TensorFill( tensor_ref_d.host_view()); // <- fill matrix D for reference on // host with zeros // Copy data from host to GPU tensor_a.sync_device(); tensor_b.sync_device(); tensor_c.sync_device(); tensor_d.sync_device(); tensor_ref_d.sync_device(); } cutlass::Status execute_device_kernel() { cutlass::Status status = cutlass::Status::kSuccess; auto alpha = ElementCompute(options.alpha); auto beta = ElementCompute(options.beta); int split_k_slices = 1; // // Setup arguments // Gemm::Arguments args{ problem_size, // <- problem size of matrix multiplication tensor_a.device_ref(), // <- reference to matrix A on device tensor_b.device_ref(), // <- reference to matrix B on device {tensor_c.device_data(), 0}, // <- the C matrix is treated as bias tensor_d.device_ref(), // <- reference to matrix D on device {alpha, beta}, // <- alpha split_k_slices}; // <- k-dimension split factor // // Launch // Gemm gemm; // Initialize status = gemm.initialize(args); if (status != cutlass::Status::kSuccess) { return status; } // Run status = gemm(); return status; } ///// Verifies the reference matches bool verify() { cutlass::reference::device::Gemm<ElementA, LayoutA, ElementB, LayoutB, ElementOutput, LayoutOutput, ElementCompute, ElementCompute> gemm_device_reference; // Launch device reference to compute strictly the product A * B gemm_device_reference(options.problem_size, options.alpha, tensor_a.device_ref(), tensor_b.device_ref(), 0, tensor_ref_d.device_ref()); // Wait for kernels to finish cudaDeviceSynchronize(); // Copy output data from CUTLASS and reference kernel to host for // comparison tensor_d.sync_host(); tensor_ref_d.sync_host(); // Compute bias + relu in host code for (int i = 0; i < problem_size.m(); ++i) { for (int j = 0; j < problem_size.n(); ++j) { float tmp = ElementOutput(tensor_ref_d.at({i, j}) + tensor_c.at({i, 0})); if (tmp < -3) { tensor_ref_d.at({i, j}) = 0; } else if (tmp > 3) { tensor_ref_d.at({i, j}) = tmp; } else { tensor_ref_d.at({i, j}) = tmp * (tmp + 3) / 6; } } } bool equal = cutlass::reference::host::TensorEquals( tensor_d.host_view(), tensor_ref_d.host_view()); return equal; } /// Profiles bool profile() { // // Profile // cutlass::Status status = cutlass::Status::kSuccess; cudaError_t result; cudaEvent_t events[2]; int const kIterations = options.iterations; for (cudaEvent_t &evt : events) { result = cudaEventCreate(&evt); if (result != cudaSuccess) { std::cerr << "cudaEventCreate failed with error " << cudaGetErrorString(result) << std::endl; return false; } } result = cudaEventRecord(events[0]); if (result != cudaSuccess) { std::cerr << "cudaEventRecord() failed with error " << cudaGetErrorString(result) << std::endl; return false; } for (int iter = 0; iter < kIterations; ++iter) { status = execute_device_kernel(); if (status != cutlass::Status::kSuccess) { std::cerr << "Device execution failed." << std::endl; return false; } } result = cudaEventRecord(events[1]); if (result != cudaSuccess) { std::cerr << "cudaEventRecord() failed with error " << cudaGetErrorString(result) << std::endl; return false; } result = cudaDeviceSynchronize(); if (result != cudaSuccess) { std::cerr << "cudaDeviceSynchronize() failed with error " << cudaGetErrorString(result) << std::endl; return false; } float elapsed_ms = 0; result = cudaEventElapsedTime(&elapsed_ms, events[0], events[1]); if (result != cudaSuccess) { std::cerr << "cudaEventElapsedTime() failed with error " << cudaGetErrorString(result) << std::endl; return false; } for (cudaEvent_t &evt : events) { result = cudaEventDestroy(evt); if (result != cudaSuccess) { std::cerr << "cudaEventDestroy() failed with error " << cudaGetErrorString(result) << std::endl; return false; } } int64_t flops = int64_t(options.problem_size.m()) * options.problem_size.n() * options.problem_size.k() * 2; int64_t bytes = (sizeof(ElementOutput)) * options.problem_size.m() * options.problem_size.n(); double gflops_per_second = double(flops) * kIterations / double(elapsed_ms / 1000.0f) / double(1.0e9); double gbytes_per_second = double(bytes) * kIterations / double(elapsed_ms / 1000.0f) / double(1 << 30); double elapsed_ms_per_iter = double(elapsed_ms) / kIterations; std::cout << " Problem: " << options.problem_size.m() << "-by-" << options.problem_size.n() << "-by-" << options.problem_size.k() << std::endl; std::cout << " Alpha: " << options.alpha << ", beta " << options.beta << std::endl; std::cout << " Runtime: " << elapsed_ms_per_iter << " ms\n" << std::endl; std::cout << " GFLOPs: " << gflops_per_second << " GFLOPs" << std::endl; std::cout << "Memory bandwidth: " << gbytes_per_second << " GiB/s" << std::endl; return true; } }; ///////////////////////////////////////////////////////////////////////////////////////////////// int main(int argc, const char **argv) { // Options parsing Options options; options.parse(argc, argv); if (options.help) { options.print_usage(std::cout) << std::endl; return 0; } if (!options.supported()) { return 0; } // Run Testbed testbed(options); Disposition disposition = testbed.run(); std::cout << std::endl; switch (disposition) { case Disposition::kPassed: std::cout << "Passed" << std::endl; break; case Disposition::kIncorrect: std::cout << "Incorrect" << std::endl; break; case Disposition::kNotVerified: std::cout << "Not verified" << std::endl; break; } return (disposition == Disposition::kPassed ? 0 : -1); } /////////////////////////////////////////////////////////////////////////////////////////////////

thanks for the experiment. if you use LinearCombinationHardSwish, it lets the user to specify if it is heavy or not (https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/thread/linear_combination_generic.h#L75), rather than querying the activation functors.

maybe we just set it as false here now.

Thanks for the great comment and the explanation! I set kIsHeavy=false for HardSwish, as suggested. Based on your comment, I also did a few more experiments on the A5000 but couldn't notice any substantial differences.

@hwu36: Thanks for your review so far. I have updated the code based on your comment. Is there anything else I could do?

hwu36 · 2023-08-18T18:51:23Z

sorry, i wrote the comment but forgot to release it and then I went to vacation. just one simple comment.

github-actions · 2023-09-29T17:04:40Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

…heavy_1011

FabianSchuetze added 2 commits July 23, 2023 11:01

set kIsHeavy member variables

aa4db41

correct kIsHeavy value for Tanh

190af21

mnicely mentioned this pull request Jul 26, 2023

[BUG] kIsHeavy Missing in Several Activations, but LinearCombinationBiasElementwise requires that it is set #1011

Closed

hwu36 reviewed Aug 18, 2023

View reviewed changes

set kIsHeavy=false for HardSwish

f71f653

github-actions bot added the inactive-30d label Sep 29, 2023

hwu36 approved these changes Oct 4, 2023

View reviewed changes

Merge branch 'main' of https://github.com/NVIDIA/cutlass into add_kis…

64e1b9b

…heavy_1011

hwu36 approved these changes Oct 4, 2023

View reviewed changes

hwu36 merged commit 5f13dca into NVIDIA:main Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set kIsHeavy member variables #1012

set kIsHeavy member variables #1012

FabianSchuetze commented Jul 23, 2023

mnicely commented Aug 18, 2023

FabianSchuetze commented Aug 18, 2023

hwu36 Jul 24, 2023

FabianSchuetze Aug 20, 2023

hwu36 Aug 21, 2023

hwu36 Aug 21, 2023

FabianSchuetze Aug 22, 2023

hwu36 Aug 22, 2023

FabianSchuetze Aug 30, 2023

FabianSchuetze Sep 30, 2023

hwu36 commented Aug 18, 2023

github-actions bot commented Sep 29, 2023

set kIsHeavy member variables #1012

set kIsHeavy member variables #1012

Conversation

FabianSchuetze commented Jul 23, 2023

mnicely commented Aug 18, 2023

FabianSchuetze commented Aug 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hwu36 commented Aug 18, 2023

github-actions bot commented Sep 29, 2023