Support FP8 scale calculation with scalar and cleanup #2593

Summary: Follow up on D57263833 to support FP8 scale calculation with scalar and merge two FP8 tensorwise GEMMs into one Note that besides `Sm90ScalarBroadcast` in CUTLASS, AMD CK f8f8bf16 GEMM also requires passing scales as scalar instead of tensor scalar. This support is required in both NV and AMD sides Differential Revision: D57367680