Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set kIsHeavy member variables #1012

Merged
merged 4 commits into from
Oct 4, 2023
Merged

Conversation

FabianSchuetze
Copy link
Contributor

In response to #1011, this PR sets the kIsHeavy member variables for all activations.

I've set the following activations to kIsHeavy:

  • Tanh
  • Sigmoid
  • SiLu
  • HardSwish
  • GeLU

My motivation for setting them to kIsHeave was that GeLU_taylor had the same value.

For the following activations kIsHeavy is set to false:

  • ReLU
  • LeakyReLU

I've also set the kIsHeavy for the backward pass for GELU although I'm not sure this is needed here.

@mnicely
Copy link
Collaborator

mnicely commented Aug 18, 2023

Closes #1012

@FabianSchuetze
Copy link
Contributor Author

Is there anything I have to do before a review can commence?

@@ -419,6 +430,7 @@ struct SiLu<Array<T, N>> {
// Reference: https://pytorch.org/docs/stable/generated/torch.nn.Hardswish.html
template <typename T>
struct HardSwish {
static const bool kIsHeavy=true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested kIsHeavy=true with HardSwish? It is simple math with some min max.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! Do you mean whether I profiled two Gemms with HardShwish?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. preferred A100 or H100.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise, we can set it as false first, so the behavior is the same as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment!

I made some tests with an A5000 . Surprisingly, I did not see any difference in binary size or significant differences in execution time. Here are the results:

Setting kIsHeady=False

root@c555a0fe3233://workspace/cutlass/build# ./examples/test_hardswish/test_hardswish  --m=2048 --n=2048 --k=2048 --verify --iterations=100000
         Problem: 2048-by-2048-by-2048
         Runtime: 0.180821 ms

          GFLOPs: 95010.2  GFLOPs
Memory bandwidth: 43.2056  GiB/s

Passed


Setting kIsHeady=true

oot@c555a0fe3233://workspace/cutlass/build# ./examples/test_hardswish/test_hardswish  --m=2048 --n=2048 --k=2048 --verify --iterations=100000
         Problem: 2048-by-2048-by-2048
         Runtime: 0.18098 ms

          GFLOPs: 94926.8  GFLOPs
Memory bandwidth: 43.1677  GiB/s

Passed

You can see the file with which I generated the results below. Do you see any possible improvements in this file? I can try to test the results with an A100 but it's not always available for me.

The results were computed with the following script

/***************************************************************************************************
 * Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights
 *reserved. SPDX-License-Identifier: BSD-3-Clause
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 * 1. Redistributions of source code must retain the above copyright notice,
 *this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright notice,
 * this list of conditions and the following disclaimer in the documentation
 * and/or other materials provided with the distribution.
 *
 * 3. Neither the name of the copyright holder nor the names of its
 * contributors may be used to endorse or promote products derived from
 * this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 *ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
 *LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 *CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 *SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 *INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 *CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 *ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 *POSSIBILITY OF SUCH DAMAGE.
 *
 **************************************************************************************************/

/**

*/

#include <cmath>
#include <iostream>
#include <limits>

#include "cutlass/arch/memory.h"
#include "cutlass/arch/memory_sm75.h"
#include "cutlass/cutlass.h"
#include "cutlass/epilogue/thread/linear_combination.h"
#include "cutlass/epilogue/thread/linear_combination_hardswish.h"
#include "cutlass/gemm/device/gemm.h"
#include "cutlass/gemm/gemm.h"
#include "cutlass/layout/matrix.h"
#include "cutlass/util/command_line.h"
#include "cutlass/util/host_tensor.h"
#include "cutlass/util/reference/device/gemm.h"
#include "cutlass/util/reference/device/tensor_fill.h"
#include "cutlass/util/reference/host/error_metrics.h"
#include "cutlass/util/reference/host/gemm.h"
#include "cutlass/util/reference/host/tensor_compare.h"
#include "cutlass/util/reference/host/tensor_copy.h"
#include "cutlass/util/reference/host/tensor_fill.h"
#include "cutlass/util/reference/host/tensor_norm.h"
#include "cutlass/util/reference/host/tensor_reduce.h"
#include "cutlass/util/tensor_view_io.h"
/////////////////////////////////////////////////////////////////////////////////////////////////

/////////////////////////////////////////////////////////////////////////////////////////////////

enum class Disposition { kPassed, kIncorrect, kNotVerified };

/////////////////////////////////////////////////////////////////////////////////////////////////

// Command line options parsing
struct Options {
    bool help;
    cutlass::gemm::GemmCoord problem_size;
    int iterations;
    unsigned seed;
    float alpha;
    float beta;
    bool verification_enabled;
    float tolerance;

    Options()
        : help(false),
          problem_size({16, 24, 64}),
          iterations(20),
          seed(2022),
          alpha(1),
          beta(0),
          verification_enabled(true),
          tolerance(1e-5f) {}

    bool valid() { return true; }

    // Parses the command line
    void parse(int argc, char const **args) {
        cutlass::CommandLine cmd(argc, args);

        if (cmd.check_cmd_line_flag("help")) {
            help = true;
        }

        cmd.get_cmd_line_argument("m", problem_size.m());
        cmd.get_cmd_line_argument("n", problem_size.n());
        cmd.get_cmd_line_argument("k", problem_size.k());

        cmd.get_cmd_line_argument("beta", beta);

        cmd.get_cmd_line_argument("iterations", iterations);
        cmd.get_cmd_line_argument("verify", verification_enabled);
        cmd.get_cmd_line_argument("seed", seed);
        cmd.get_cmd_line_argument("tolerance", tolerance);
    }

    /// Prints the usage statement.
    static std::ostream &print_usage(std::ostream &out) {
        out << "35_gemm_softmax example\n\n"
            << "  This example uses the CUTLASS Library to compute GEMM + "
               "Softmax for arbitrary problem sizes.\n\n"
            << "Options:\n\n"
            << "  --help                      If specified, displays this "
               "usage statement.\n\n"
            << "  --m=<int>                   GEMM M dimension\n"
            << "  --n=<int>                   GEMM N dimension\n"
            << "  --k=<int>                   GEMM K dimension\n"
            << "  --alpha=<f32>               Epilogue scalar alpha\n"
            << "  --beta=<f32>                Epilogue scalar beta\n\n"
            << "  --seed=<int>                Random number seed (1*)\n\n"
            << "  --iterations=<int>          Number of profiling iterations "
               "to perform (0 to disable profiling).\n\n"
            << "  --verify=<bool>             If true, performs reference "
               "calculation.\n\n"
            << "  --tolerance <float>         Error tolerance\n";

        out << "\n\nExamples:\n\n"
            << "$ ./examples/35_gemm_softmax/35_gemm_softmax --m=1024 --n=512 "
               "\\\n"
            << "     --alpha=2 --beta=0.707 \n\n";

        return out;
    }

    /// Returns true if the environment and Toolkit support this
    bool supported(bool verbose = true) const {
        // Ampere Tensor Core operations exposed with mma.sync and ldmatrix are
        // first available in CUDA 11.0.
        //
        // CUTLASS must be compiled with CUDA 11.0 Toolkit to run these
        // examples.
        if (!(__CUDACC_VER_MAJOR__ >= 11)) {
            if (verbose) {
                std::cerr << "Ampere Tensor Core operations must be compiled "
                             "with CUDA 11.0 Toolkit or later."
                          << std::endl;
            }
            return false;
        }

        cudaDeviceProp props;

        cudaError_t error = cudaGetDeviceProperties(&props, 0);
        if (error != cudaSuccess) {
            if (verbose) {
                std::cerr << "cudaGetDeviceProperties() returned an error: "
                          << cudaGetErrorString(error) << std::endl;
            }
            return false;
        }

        if (!((props.major * 10 + props.minor) >= 80)) {
            if (verbose) {
                std::cerr << "Ampere Tensor Core operations must be run on a "
                             "machine with compute capability at least 80."
                          << std::endl;
            }
            return false;
        }

        return true;
    }
};

/////////////////////////////////////////////////////////////////////////////////////////////////

struct Testbed {
    //
    // Type definitions
    //

    using ElementA = cutlass::half_t;
    using ElementB = cutlass::half_t;
    using ElementOutput = cutlass::half_t;
    using ElementCompute = float;
    using ElementAccumulator = float;

    using LayoutA = cutlass::layout::ColumnMajor;
    using LayoutB = cutlass::layout::ColumnMajor;
    using LayoutOutput = cutlass::layout::ColumnMajor;

    using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 32>;
    using WarpShape = cutlass::gemm::GemmShape<64, 64, 32>;
    using InstructionShape = cutlass::gemm::GemmShape<16, 8, 16>;

    using OperatorClass = cutlass::arch::OpClassTensorOp;
    using ArchTag = cutlass::arch::Sm80;

    using EpilogueFunctorOp =
        cutlass::epilogue::thread::LinearCombinationHardSwish<
            ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
            ElementCompute, ElementCompute>;

    using Gemm = cutlass::gemm::device::Gemm<
        ElementA, LayoutA, ElementB, LayoutB, ElementOutput, LayoutOutput,
        ElementAccumulator, OperatorClass, ArchTag, ThreadblockShape, WarpShape,
        InstructionShape, EpilogueFunctorOp>;
    //
    // Data members
    //

    Options const &options;
    cutlass::gemm::GemmCoord problem_size = options.problem_size;
    // cutlass::HostTensor<ElementNorm, LayoutC>     reference_N;
    cutlass::HostTensor<ElementA, LayoutA> tensor_a{
        {problem_size.m(),
         problem_size.k()}};  // <- Create matrix A with dimensions M x K
    cutlass::HostTensor<ElementB, LayoutB> tensor_b{
        {problem_size.k(),
         problem_size.n()}};  // <- Create matrix B with dimensions K x N

    cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c{
        {problem_size.m(), 1}};  // <- Create matrix C with
                                 // dimensions M x 1
    cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_d{
        {problem_size.m(),
         problem_size.n()}};  // <- Create matrix D with dimensions M x N used
    cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_ref_d{
        {problem_size.m(),
         problem_size.n()}};  // <- Create matrix D with dimensions M x N used
                              // to store output from
    //
    // Methods
    //

    Testbed(Options const &options_) : options(options_) {}

    /// Run
    Disposition run() {
        Disposition disposition = Disposition::kNotVerified;

        //
        // Initialize the workspace
        //

        initialize();

        //
        // Launch device kernel
        //
        cutlass::Status status = cutlass::Status::kSuccess;

        status = execute_device_kernel();

        if (status != cutlass::Status::kSuccess) {
            std::cerr << "Device execution failed." << std::endl;
            return disposition;
        }

        cudaError_t result = cudaDeviceSynchronize();
        if (result != cudaSuccess) {
            std::cerr << "Device synchronize failed with error "
                      << cudaGetErrorString(result) << std::endl;
            return disposition;
        }

        //
        // Verify
        //

        if (options.verification_enabled) {
            bool passed = verify();

            if (passed) {
                disposition = Disposition::kPassed;
            } else {
                disposition = Disposition::kIncorrect;
            }
        }

        //
        // Profiling
        //
        if (options.iterations) {
            profile();
        }

        return disposition;
    }

    /// Random initialization
    void initialize() {
        cutlass::reference::host::TensorFillRandomUniform(
            tensor_a.host_view(), 1, ElementA(4), ElementA(-4),
            0);  // <- Fill matrix A on host with uniform-distribution random
                 // data
        cutlass::reference::host::TensorFillRandomUniform(
            tensor_b.host_view(), 1, ElementB(4), ElementB(-4),
            0);  // <- Fill matrix B on host with uniform-distribution random
                 // data
        cutlass::reference::host::TensorFillRandomUniform(
            tensor_c.host_view(), 1, ElementOutput(4), ElementOutput(-4),
            0);  // <- Fill matrix C on host with uniform-distribution random
                 // data
        cutlass::reference::host::TensorFill(
            tensor_d.host_view());  // <- fill matrix D on host with zeros
        cutlass::reference::host::TensorFill(
            tensor_ref_d.host_view());  // <- fill matrix D for reference on
                                        // host with zeros

        // Copy data from host to GPU
        tensor_a.sync_device();
        tensor_b.sync_device();
        tensor_c.sync_device();
        tensor_d.sync_device();
        tensor_ref_d.sync_device();
    }

    cutlass::Status execute_device_kernel() {
        cutlass::Status status = cutlass::Status::kSuccess;

        auto alpha = ElementCompute(options.alpha);
        auto beta = ElementCompute(options.beta);
        int split_k_slices = 1;

        //
        // Setup arguments
        //
        Gemm::Arguments args{
            problem_size,           // <- problem size of matrix multiplication
            tensor_a.device_ref(),  // <- reference to matrix A on device
            tensor_b.device_ref(),  // <- reference to matrix B on device
            {tensor_c.device_data(), 0},  // <- the C matrix is treated as bias
            tensor_d.device_ref(),        // <- reference to matrix D on device
            {alpha, beta},                // <- alpha
            split_k_slices};              // <- k-dimension split factor

        //
        // Launch
        //

        Gemm gemm;

        // Initialize
        status = gemm.initialize(args);
        if (status != cutlass::Status::kSuccess) {
            return status;
        }

        // Run
        status = gemm();

        return status;
    }

    ///// Verifies the reference matches
    bool verify() {
        cutlass::reference::device::Gemm<ElementA, LayoutA, ElementB, LayoutB,
                                         ElementOutput, LayoutOutput,
                                         ElementCompute, ElementCompute>
            gemm_device_reference;

        // Launch device reference to compute strictly the product A * B
        gemm_device_reference(options.problem_size, options.alpha,
                              tensor_a.device_ref(), tensor_b.device_ref(), 0,
                              tensor_ref_d.device_ref());

        // Wait for kernels to finish
        cudaDeviceSynchronize();

        // Copy output data from CUTLASS and reference kernel to host for
        // comparison
        tensor_d.sync_host();
        tensor_ref_d.sync_host();

        // Compute bias + relu in host code
        for (int i = 0; i < problem_size.m(); ++i) {
            for (int j = 0; j < problem_size.n(); ++j) {
                float tmp = ElementOutput(tensor_ref_d.at({i, j}) +
                                          tensor_c.at({i, 0}));
                if (tmp < -3) {
                    tensor_ref_d.at({i, j}) = 0;
                } else if (tmp > 3) {
                    tensor_ref_d.at({i, j}) = tmp;
                } else {
                    tensor_ref_d.at({i, j}) = tmp * (tmp + 3) / 6;
                }
            }
        }

        bool equal = cutlass::reference::host::TensorEquals(
            tensor_d.host_view(), tensor_ref_d.host_view());
        return equal;
    }

    /// Profiles
    bool profile() {
        //
        // Profile
        //

        cutlass::Status status = cutlass::Status::kSuccess;
        cudaError_t result;
        cudaEvent_t events[2];
        int const kIterations = options.iterations;

        for (cudaEvent_t &evt : events) {
            result = cudaEventCreate(&evt);
            if (result != cudaSuccess) {
                std::cerr << "cudaEventCreate failed with error "
                          << cudaGetErrorString(result) << std::endl;
                return false;
            }
        }

        result = cudaEventRecord(events[0]);

        if (result != cudaSuccess) {
            std::cerr << "cudaEventRecord() failed with error "
                      << cudaGetErrorString(result) << std::endl;
            return false;
        }

        for (int iter = 0; iter < kIterations; ++iter) {
            status = execute_device_kernel();

            if (status != cutlass::Status::kSuccess) {
                std::cerr << "Device execution failed." << std::endl;
                return false;
            }
        }

        result = cudaEventRecord(events[1]);

        if (result != cudaSuccess) {
            std::cerr << "cudaEventRecord() failed with error "
                      << cudaGetErrorString(result) << std::endl;
            return false;
        }

        result = cudaDeviceSynchronize();

        if (result != cudaSuccess) {
            std::cerr << "cudaDeviceSynchronize() failed with error "
                      << cudaGetErrorString(result) << std::endl;
            return false;
        }

        float elapsed_ms = 0;
        result = cudaEventElapsedTime(&elapsed_ms, events[0], events[1]);

        if (result != cudaSuccess) {
            std::cerr << "cudaEventElapsedTime() failed with error "
                      << cudaGetErrorString(result) << std::endl;
            return false;
        }

        for (cudaEvent_t &evt : events) {
            result = cudaEventDestroy(evt);
            if (result != cudaSuccess) {
                std::cerr << "cudaEventDestroy() failed with error "
                          << cudaGetErrorString(result) << std::endl;
                return false;
            }
        }

        int64_t flops = int64_t(options.problem_size.m()) *
                        options.problem_size.n() * options.problem_size.k() * 2;
        int64_t bytes = (sizeof(ElementOutput)) * options.problem_size.m() *
                        options.problem_size.n();

        double gflops_per_second = double(flops) * kIterations /
                                   double(elapsed_ms / 1000.0f) / double(1.0e9);
        double gbytes_per_second = double(bytes) * kIterations /
                                   double(elapsed_ms / 1000.0f) /
                                   double(1 << 30);

        double elapsed_ms_per_iter = double(elapsed_ms) / kIterations;

        std::cout << "         Problem: " << options.problem_size.m() << "-by-"
                  << options.problem_size.n() << "-by-"
                  << options.problem_size.k() << std::endl;

        std::cout << "         Alpha: " << options.alpha << ", beta "
                  << options.beta << std::endl;

        std::cout << "         Runtime: " << elapsed_ms_per_iter << " ms\n"
                  << std::endl;

        std::cout << "          GFLOPs: " << gflops_per_second << "  GFLOPs"
                  << std::endl;
        std::cout << "Memory bandwidth: " << gbytes_per_second << "  GiB/s"
                  << std::endl;

        return true;
    }
};

/////////////////////////////////////////////////////////////////////////////////////////////////

int main(int argc, const char **argv) {
    // Options parsing
    Options options;
    options.parse(argc, argv);

    if (options.help) {
        options.print_usage(std::cout) << std::endl;
        return 0;
    }

    if (!options.supported()) {
        return 0;
    }

    // Run
    Testbed testbed(options);

    Disposition disposition = testbed.run();

    std::cout << std::endl;

    switch (disposition) {
        case Disposition::kPassed:
            std::cout << "Passed" << std::endl;
            break;
        case Disposition::kIncorrect:
            std::cout << "Incorrect" << std::endl;
            break;
        case Disposition::kNotVerified:
            std::cout << "Not verified" << std::endl;
            break;
    }

    return (disposition == Disposition::kPassed ? 0 : -1);
}

/////////////////////////////////////////////////////////////////////////////////////////////////

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the experiment. if you use LinearCombinationHardSwish, it lets the user to specify if it is heavy or not (https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/thread/linear_combination_generic.h#L75), rather than querying the activation functors.

maybe we just set it as false here now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great comment and the explanation! I set kIsHeavy=false for HardSwish, as suggested. Based on your comment, I also did a few more experiments on the A5000 but couldn't notice any substantial differences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hwu36: Thanks for your review so far. I have updated the code based on your comment. Is there anything else I could do?

@hwu36
Copy link
Collaborator

hwu36 commented Aug 18, 2023

sorry, i wrote the comment but forgot to release it and then I went to vacation. just one simple comment.

@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@hwu36 hwu36 merged commit 5f13dca into NVIDIA:main Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants