-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set kIsHeavy member variables #1012
Conversation
Closes #1012 |
Is there anything I have to do before a review can commence? |
@@ -419,6 +430,7 @@ struct SiLu<Array<T, N>> { | |||
// Reference: https://pytorch.org/docs/stable/generated/torch.nn.Hardswish.html | |||
template <typename T> | |||
struct HardSwish { | |||
static const bool kIsHeavy=true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tested kIsHeavy=true
with HardSwish? It is simple math with some min max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment! Do you mean whether I profiled two Gemms with HardShwish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. preferred A100 or H100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise, we can set it as false first, so the behavior is the same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment!
I made some tests with an A5000 . Surprisingly, I did not see any difference in binary size or significant differences in execution time. Here are the results:
Setting kIsHeady=False
root@c555a0fe3233://workspace/cutlass/build# ./examples/test_hardswish/test_hardswish --m=2048 --n=2048 --k=2048 --verify --iterations=100000
Problem: 2048-by-2048-by-2048
Runtime: 0.180821 ms
GFLOPs: 95010.2 GFLOPs
Memory bandwidth: 43.2056 GiB/s
Passed
Setting kIsHeady=true
oot@c555a0fe3233://workspace/cutlass/build# ./examples/test_hardswish/test_hardswish --m=2048 --n=2048 --k=2048 --verify --iterations=100000
Problem: 2048-by-2048-by-2048
Runtime: 0.18098 ms
GFLOPs: 94926.8 GFLOPs
Memory bandwidth: 43.1677 GiB/s
Passed
You can see the file with which I generated the results below. Do you see any possible improvements in this file? I can try to test the results with an A100 but it's not always available for me.
The results were computed with the following script
/***************************************************************************************************
* Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights
*reserved. SPDX-License-Identifier: BSD-3-Clause
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice,
*this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* 3. Neither the name of the copyright holder nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
*ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
*LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
*CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
*SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
*INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
*CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
*ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
*POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/**
*/
#include <cmath>
#include <iostream>
#include <limits>
#include "cutlass/arch/memory.h"
#include "cutlass/arch/memory_sm75.h"
#include "cutlass/cutlass.h"
#include "cutlass/epilogue/thread/linear_combination.h"
#include "cutlass/epilogue/thread/linear_combination_hardswish.h"
#include "cutlass/gemm/device/gemm.h"
#include "cutlass/gemm/gemm.h"
#include "cutlass/layout/matrix.h"
#include "cutlass/util/command_line.h"
#include "cutlass/util/host_tensor.h"
#include "cutlass/util/reference/device/gemm.h"
#include "cutlass/util/reference/device/tensor_fill.h"
#include "cutlass/util/reference/host/error_metrics.h"
#include "cutlass/util/reference/host/gemm.h"
#include "cutlass/util/reference/host/tensor_compare.h"
#include "cutlass/util/reference/host/tensor_copy.h"
#include "cutlass/util/reference/host/tensor_fill.h"
#include "cutlass/util/reference/host/tensor_norm.h"
#include "cutlass/util/reference/host/tensor_reduce.h"
#include "cutlass/util/tensor_view_io.h"
/////////////////////////////////////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////////////////////////////////////
enum class Disposition { kPassed, kIncorrect, kNotVerified };
/////////////////////////////////////////////////////////////////////////////////////////////////
// Command line options parsing
struct Options {
bool help;
cutlass::gemm::GemmCoord problem_size;
int iterations;
unsigned seed;
float alpha;
float beta;
bool verification_enabled;
float tolerance;
Options()
: help(false),
problem_size({16, 24, 64}),
iterations(20),
seed(2022),
alpha(1),
beta(0),
verification_enabled(true),
tolerance(1e-5f) {}
bool valid() { return true; }
// Parses the command line
void parse(int argc, char const **args) {
cutlass::CommandLine cmd(argc, args);
if (cmd.check_cmd_line_flag("help")) {
help = true;
}
cmd.get_cmd_line_argument("m", problem_size.m());
cmd.get_cmd_line_argument("n", problem_size.n());
cmd.get_cmd_line_argument("k", problem_size.k());
cmd.get_cmd_line_argument("beta", beta);
cmd.get_cmd_line_argument("iterations", iterations);
cmd.get_cmd_line_argument("verify", verification_enabled);
cmd.get_cmd_line_argument("seed", seed);
cmd.get_cmd_line_argument("tolerance", tolerance);
}
/// Prints the usage statement.
static std::ostream &print_usage(std::ostream &out) {
out << "35_gemm_softmax example\n\n"
<< " This example uses the CUTLASS Library to compute GEMM + "
"Softmax for arbitrary problem sizes.\n\n"
<< "Options:\n\n"
<< " --help If specified, displays this "
"usage statement.\n\n"
<< " --m=<int> GEMM M dimension\n"
<< " --n=<int> GEMM N dimension\n"
<< " --k=<int> GEMM K dimension\n"
<< " --alpha=<f32> Epilogue scalar alpha\n"
<< " --beta=<f32> Epilogue scalar beta\n\n"
<< " --seed=<int> Random number seed (1*)\n\n"
<< " --iterations=<int> Number of profiling iterations "
"to perform (0 to disable profiling).\n\n"
<< " --verify=<bool> If true, performs reference "
"calculation.\n\n"
<< " --tolerance <float> Error tolerance\n";
out << "\n\nExamples:\n\n"
<< "$ ./examples/35_gemm_softmax/35_gemm_softmax --m=1024 --n=512 "
"\\\n"
<< " --alpha=2 --beta=0.707 \n\n";
return out;
}
/// Returns true if the environment and Toolkit support this
bool supported(bool verbose = true) const {
// Ampere Tensor Core operations exposed with mma.sync and ldmatrix are
// first available in CUDA 11.0.
//
// CUTLASS must be compiled with CUDA 11.0 Toolkit to run these
// examples.
if (!(__CUDACC_VER_MAJOR__ >= 11)) {
if (verbose) {
std::cerr << "Ampere Tensor Core operations must be compiled "
"with CUDA 11.0 Toolkit or later."
<< std::endl;
}
return false;
}
cudaDeviceProp props;
cudaError_t error = cudaGetDeviceProperties(&props, 0);
if (error != cudaSuccess) {
if (verbose) {
std::cerr << "cudaGetDeviceProperties() returned an error: "
<< cudaGetErrorString(error) << std::endl;
}
return false;
}
if (!((props.major * 10 + props.minor) >= 80)) {
if (verbose) {
std::cerr << "Ampere Tensor Core operations must be run on a "
"machine with compute capability at least 80."
<< std::endl;
}
return false;
}
return true;
}
};
/////////////////////////////////////////////////////////////////////////////////////////////////
struct Testbed {
//
// Type definitions
//
using ElementA = cutlass::half_t;
using ElementB = cutlass::half_t;
using ElementOutput = cutlass::half_t;
using ElementCompute = float;
using ElementAccumulator = float;
using LayoutA = cutlass::layout::ColumnMajor;
using LayoutB = cutlass::layout::ColumnMajor;
using LayoutOutput = cutlass::layout::ColumnMajor;
using ThreadblockShape = cutlass::gemm::GemmShape<128, 128, 32>;
using WarpShape = cutlass::gemm::GemmShape<64, 64, 32>;
using InstructionShape = cutlass::gemm::GemmShape<16, 8, 16>;
using OperatorClass = cutlass::arch::OpClassTensorOp;
using ArchTag = cutlass::arch::Sm80;
using EpilogueFunctorOp =
cutlass::epilogue::thread::LinearCombinationHardSwish<
ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
ElementCompute, ElementCompute>;
using Gemm = cutlass::gemm::device::Gemm<
ElementA, LayoutA, ElementB, LayoutB, ElementOutput, LayoutOutput,
ElementAccumulator, OperatorClass, ArchTag, ThreadblockShape, WarpShape,
InstructionShape, EpilogueFunctorOp>;
//
// Data members
//
Options const &options;
cutlass::gemm::GemmCoord problem_size = options.problem_size;
// cutlass::HostTensor<ElementNorm, LayoutC> reference_N;
cutlass::HostTensor<ElementA, LayoutA> tensor_a{
{problem_size.m(),
problem_size.k()}}; // <- Create matrix A with dimensions M x K
cutlass::HostTensor<ElementB, LayoutB> tensor_b{
{problem_size.k(),
problem_size.n()}}; // <- Create matrix B with dimensions K x N
cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c{
{problem_size.m(), 1}}; // <- Create matrix C with
// dimensions M x 1
cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_d{
{problem_size.m(),
problem_size.n()}}; // <- Create matrix D with dimensions M x N used
cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_ref_d{
{problem_size.m(),
problem_size.n()}}; // <- Create matrix D with dimensions M x N used
// to store output from
//
// Methods
//
Testbed(Options const &options_) : options(options_) {}
/// Run
Disposition run() {
Disposition disposition = Disposition::kNotVerified;
//
// Initialize the workspace
//
initialize();
//
// Launch device kernel
//
cutlass::Status status = cutlass::Status::kSuccess;
status = execute_device_kernel();
if (status != cutlass::Status::kSuccess) {
std::cerr << "Device execution failed." << std::endl;
return disposition;
}
cudaError_t result = cudaDeviceSynchronize();
if (result != cudaSuccess) {
std::cerr << "Device synchronize failed with error "
<< cudaGetErrorString(result) << std::endl;
return disposition;
}
//
// Verify
//
if (options.verification_enabled) {
bool passed = verify();
if (passed) {
disposition = Disposition::kPassed;
} else {
disposition = Disposition::kIncorrect;
}
}
//
// Profiling
//
if (options.iterations) {
profile();
}
return disposition;
}
/// Random initialization
void initialize() {
cutlass::reference::host::TensorFillRandomUniform(
tensor_a.host_view(), 1, ElementA(4), ElementA(-4),
0); // <- Fill matrix A on host with uniform-distribution random
// data
cutlass::reference::host::TensorFillRandomUniform(
tensor_b.host_view(), 1, ElementB(4), ElementB(-4),
0); // <- Fill matrix B on host with uniform-distribution random
// data
cutlass::reference::host::TensorFillRandomUniform(
tensor_c.host_view(), 1, ElementOutput(4), ElementOutput(-4),
0); // <- Fill matrix C on host with uniform-distribution random
// data
cutlass::reference::host::TensorFill(
tensor_d.host_view()); // <- fill matrix D on host with zeros
cutlass::reference::host::TensorFill(
tensor_ref_d.host_view()); // <- fill matrix D for reference on
// host with zeros
// Copy data from host to GPU
tensor_a.sync_device();
tensor_b.sync_device();
tensor_c.sync_device();
tensor_d.sync_device();
tensor_ref_d.sync_device();
}
cutlass::Status execute_device_kernel() {
cutlass::Status status = cutlass::Status::kSuccess;
auto alpha = ElementCompute(options.alpha);
auto beta = ElementCompute(options.beta);
int split_k_slices = 1;
//
// Setup arguments
//
Gemm::Arguments args{
problem_size, // <- problem size of matrix multiplication
tensor_a.device_ref(), // <- reference to matrix A on device
tensor_b.device_ref(), // <- reference to matrix B on device
{tensor_c.device_data(), 0}, // <- the C matrix is treated as bias
tensor_d.device_ref(), // <- reference to matrix D on device
{alpha, beta}, // <- alpha
split_k_slices}; // <- k-dimension split factor
//
// Launch
//
Gemm gemm;
// Initialize
status = gemm.initialize(args);
if (status != cutlass::Status::kSuccess) {
return status;
}
// Run
status = gemm();
return status;
}
///// Verifies the reference matches
bool verify() {
cutlass::reference::device::Gemm<ElementA, LayoutA, ElementB, LayoutB,
ElementOutput, LayoutOutput,
ElementCompute, ElementCompute>
gemm_device_reference;
// Launch device reference to compute strictly the product A * B
gemm_device_reference(options.problem_size, options.alpha,
tensor_a.device_ref(), tensor_b.device_ref(), 0,
tensor_ref_d.device_ref());
// Wait for kernels to finish
cudaDeviceSynchronize();
// Copy output data from CUTLASS and reference kernel to host for
// comparison
tensor_d.sync_host();
tensor_ref_d.sync_host();
// Compute bias + relu in host code
for (int i = 0; i < problem_size.m(); ++i) {
for (int j = 0; j < problem_size.n(); ++j) {
float tmp = ElementOutput(tensor_ref_d.at({i, j}) +
tensor_c.at({i, 0}));
if (tmp < -3) {
tensor_ref_d.at({i, j}) = 0;
} else if (tmp > 3) {
tensor_ref_d.at({i, j}) = tmp;
} else {
tensor_ref_d.at({i, j}) = tmp * (tmp + 3) / 6;
}
}
}
bool equal = cutlass::reference::host::TensorEquals(
tensor_d.host_view(), tensor_ref_d.host_view());
return equal;
}
/// Profiles
bool profile() {
//
// Profile
//
cutlass::Status status = cutlass::Status::kSuccess;
cudaError_t result;
cudaEvent_t events[2];
int const kIterations = options.iterations;
for (cudaEvent_t &evt : events) {
result = cudaEventCreate(&evt);
if (result != cudaSuccess) {
std::cerr << "cudaEventCreate failed with error "
<< cudaGetErrorString(result) << std::endl;
return false;
}
}
result = cudaEventRecord(events[0]);
if (result != cudaSuccess) {
std::cerr << "cudaEventRecord() failed with error "
<< cudaGetErrorString(result) << std::endl;
return false;
}
for (int iter = 0; iter < kIterations; ++iter) {
status = execute_device_kernel();
if (status != cutlass::Status::kSuccess) {
std::cerr << "Device execution failed." << std::endl;
return false;
}
}
result = cudaEventRecord(events[1]);
if (result != cudaSuccess) {
std::cerr << "cudaEventRecord() failed with error "
<< cudaGetErrorString(result) << std::endl;
return false;
}
result = cudaDeviceSynchronize();
if (result != cudaSuccess) {
std::cerr << "cudaDeviceSynchronize() failed with error "
<< cudaGetErrorString(result) << std::endl;
return false;
}
float elapsed_ms = 0;
result = cudaEventElapsedTime(&elapsed_ms, events[0], events[1]);
if (result != cudaSuccess) {
std::cerr << "cudaEventElapsedTime() failed with error "
<< cudaGetErrorString(result) << std::endl;
return false;
}
for (cudaEvent_t &evt : events) {
result = cudaEventDestroy(evt);
if (result != cudaSuccess) {
std::cerr << "cudaEventDestroy() failed with error "
<< cudaGetErrorString(result) << std::endl;
return false;
}
}
int64_t flops = int64_t(options.problem_size.m()) *
options.problem_size.n() * options.problem_size.k() * 2;
int64_t bytes = (sizeof(ElementOutput)) * options.problem_size.m() *
options.problem_size.n();
double gflops_per_second = double(flops) * kIterations /
double(elapsed_ms / 1000.0f) / double(1.0e9);
double gbytes_per_second = double(bytes) * kIterations /
double(elapsed_ms / 1000.0f) /
double(1 << 30);
double elapsed_ms_per_iter = double(elapsed_ms) / kIterations;
std::cout << " Problem: " << options.problem_size.m() << "-by-"
<< options.problem_size.n() << "-by-"
<< options.problem_size.k() << std::endl;
std::cout << " Alpha: " << options.alpha << ", beta "
<< options.beta << std::endl;
std::cout << " Runtime: " << elapsed_ms_per_iter << " ms\n"
<< std::endl;
std::cout << " GFLOPs: " << gflops_per_second << " GFLOPs"
<< std::endl;
std::cout << "Memory bandwidth: " << gbytes_per_second << " GiB/s"
<< std::endl;
return true;
}
};
/////////////////////////////////////////////////////////////////////////////////////////////////
int main(int argc, const char **argv) {
// Options parsing
Options options;
options.parse(argc, argv);
if (options.help) {
options.print_usage(std::cout) << std::endl;
return 0;
}
if (!options.supported()) {
return 0;
}
// Run
Testbed testbed(options);
Disposition disposition = testbed.run();
std::cout << std::endl;
switch (disposition) {
case Disposition::kPassed:
std::cout << "Passed" << std::endl;
break;
case Disposition::kIncorrect:
std::cout << "Incorrect" << std::endl;
break;
case Disposition::kNotVerified:
std::cout << "Not verified" << std::endl;
break;
}
return (disposition == Disposition::kPassed ? 0 : -1);
}
/////////////////////////////////////////////////////////////////////////////////////////////////
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the experiment. if you use LinearCombinationHardSwish
, it lets the user to specify if it is heavy or not (https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/thread/linear_combination_generic.h#L75), rather than querying the activation functors.
maybe we just set it as false
here now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great comment and the explanation! I set kIsHeavy=false
for HardSwish
, as suggested. Based on your comment, I also did a few more experiments on the A5000 but couldn't notice any substantial differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hwu36: Thanks for your review so far. I have updated the code based on your comment. Is there anything else I could do?
sorry, i wrote the comment but forgot to release it and then I went to vacation. just one simple comment. |
This PR has been labeled |
In response to #1011, this PR sets the
kIsHeavy
member variables for all activations.I've set the following activations to
kIsHeavy
:My motivation for setting them to kIsHeave was that GeLU_taylor had the same value.
For the following activations
kIsHeavy
is set to false:I've also set the kIsHeavy for the backward pass for GELU although I'm not sure this is needed here.