Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Update cmake to only build for present GPU #2955

Merged
merged 9 commits into from
Oct 22, 2019

Conversation

trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented Oct 3, 2019

This PR partially addresses #2902, adding the same gpu-architecture detection logic as cuML.

I expect CI scripts will need to include -DGPU_ARCHS=ALL in the cmake args, but will need help to find them all cc: @kkraus14.

@trxcllnt trxcllnt requested a review from a team as a code owner October 3, 2019 18:40
@trxcllnt trxcllnt requested review from jrhemstad and mtjrider October 3, 2019 18:41
}
")
execute_process(
COMMAND ${CMAKE_CUDA_COMPILER}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only change from the cuML version -- CUDA_NVCC_EXECUTABLE doesn't seem to be defined in the cudf scripts, but CMAKE_CUDA_COMPILER is. Any ideas why that is @mt-jones?

@OlivierNV
Copy link
Contributor

This is not directly related, but for a develop build, it might be useful to also include the verbose ptxas output option, so we can at least have an idea of register/smem usage of various kernels.

@harrism
Copy link
Member

harrism commented Oct 14, 2019

This is not directly related, but for a develop build, it might be useful to also include the verbose ptxas output option, so we can at least have an idea of register/smem usage of various kernels.

I think that should be a separate option. Otherwise when debugging build warnings / failures there's more text to scroll through.

@trxcllnt
Copy link
Contributor Author

AFAIK the last thing to do for this (aside from changelog) is update any release scripts that build the conda packages to ensure we build all the GPU architectures in the public releases -- @kkraus14 or @mike-wendt do either of you have a moment to point me to the files that'd need the new -DGPU_ARCHS=ALL flag?

@harrism
Copy link
Member

harrism commented Oct 15, 2019

@trxcllnt if you make the default to build all architectures as it always has been then I think you will risk less surprise for non-developer users who have to build from source, and you won't have to change scripts for CI.

Make this an opt-in flag for developers.

@trxcllnt
Copy link
Contributor Author

@harrism good point, that seems reasonable. I'll make the flag default to "ALL", so the cmake options are still consistent with cuML's cmake script.

@trxcllnt
Copy link
Contributor Author

I should mention local builds using the new architecture detection logic show warnings like this:

ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00003017_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored

ptxas warning : Value of threads per SM for entry gpuInitDictionaryIndices is out of range. .minnctapersm will be ignored

Is this an issue that needs to be addressed before merging, or is it safe to ignore these?

@harrism
Copy link
Member

harrism commented Oct 15, 2019

Yeah I am not comfortable without understanding the cause of those warnings. Can you tell me more? What code is causing the warnings?

@trxcllnt
Copy link
Contributor Author

Looks like the kernel launch parameters are out of bounds for my local arch/gpu?

dim3 dim_block(512, 1); // 512 threads per chunk
dim3 dim_grid(num_columns, num_rowgroups);
gpuInitDictionaryIndices <<< dim_grid, dim_block, 0, stream >>>(chunks, num_columns);

@OlivierNV
Copy link
Contributor

OlivierNV commented Oct 15, 2019

Looks like the kernel launch parameters are out of bounds for my local arch/gpu?

Hmm The intent here is to limit register count to 40, and shared mem usage is 28KB. If compiling for a GPU with only 64K of shared mem (which I'm assuming is the case here), it could result in this warning. If the result is minctapersm being ignored, it sounds like it's safe to ignore it. If on the other hand we end up with ignoring maxregcount and compiling the kernel with more than 40 registers, that would result in a near-2x perf drop, so I'd like to avoid that, but I'm not aware of another way to limit register count without also creating a minctapersm.

@OlivierNV
Copy link
Contributor

Actually, using {512, 2} also has the same intended effect of limiting regcount to 40 (at least on TU102), while just specifying {512} results in 43 registers (bad). It just looks like ptxas works a little bit harder if there is a maxregcount limit, regardless of the limit.

@trxcllnt trxcllnt requested a review from a team as a code owner October 15, 2019 18:34
@trxcllnt trxcllnt requested a review from a team October 15, 2019 18:34
@trxcllnt trxcllnt requested a review from a team as a code owner October 15, 2019 18:34
@codecov
Copy link

codecov bot commented Oct 15, 2019

Codecov Report

Merging #2955 into branch-0.11 will increase coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.11    #2955      +/-   ##
===============================================
+ Coverage         86.9%   86.95%   +0.04%     
===============================================
  Files               49       49              
  Lines             9187     9211      +24     
===============================================
+ Hits              7984     8009      +25     
+ Misses            1203     1202       -1
Impacted Files Coverage Δ
python/cudf/cudf/core/dataframe.py 93.8% <0%> (+0.02%) ⬆️
python/cudf/cudf/core/multiindex.py 88.2% <0%> (+0.06%) ⬆️
python/cudf/cudf/core/series.py 93.83% <0%> (+0.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae1cc16...8513cfc. Read the comment docs.

@trxcllnt trxcllnt force-pushed the cmake-detect-gpu-arch/2902 branch from d75361e to 3cc2315 Compare October 15, 2019 23:44
@trxcllnt trxcllnt changed the title [WIP] Update cmake to only build for present GPU [REVIEW] Update cmake to only build for present GPU Oct 16, 2019
@harrism
Copy link
Member

harrism commented Oct 16, 2019

@OlivierNV are you going to make that launch_bounds change?

@OlivierNV
Copy link
Contributor

@OlivierNV are you going to make that launch_bounds change?

Is there a way to add a file to this PR from github ? (I tried to, but it weirdly created a new branch)

@harrism
Copy link
Member

harrism commented Oct 16, 2019

Not that I know of. I still have to figure out how to fix mine anyway.

@harrism
Copy link
Member

harrism commented Oct 16, 2019

@trxcllnt what GPU arch are you compiling for that is exposing this?

@OlivierNV
Copy link
Contributor

OlivierNV commented Oct 16, 2019

@trxcllnt Not sure about 'ordered_scatter_to_tables', but for gpuInitDictionaryIndices, the {512, 3} can simply be changed to {512, 2} in cpp/src/io/orc/dict_enc.cu (though it might resurface on Tegra, I hear there are SKUs with only 32K shared mem, but as long as nobody exceeds the threads/block limit, it sounds like it's not fatal)

@cwharris
Copy link
Contributor

cwharris commented Oct 16, 2019

@OlivierNV The only way to modify a PR is to modify it's branch, which is under trxcllnt's fork. That would require adding his repo as a remote, fetching that branch, making changes, a commit, pushing those changes to your own fork, and then creating a separate PR against trxcllnt's branch (assuming you don't have push access, which you won't, since it's not your fork).

That's just for informational purposes. You might ask trxcllnt what he'd prefer.

@trxcllnt
Copy link
Contributor Author

@harrism I saw these compiling on the RTX 8000s in my workstation. There are more warnings than those I listed, but elided them for brevity. I'll recompile again here in a bit and paste the full list.

@OlivierNV the easiest way is to send a PR to this branch on my fork.

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Oct 16, 2019

@harrism @OlivierNV ok, here's the full set of warnings I'm seeing on the RTX 8000 (after pulling the latest changes from @OlivierNV). I think these are all caused by this same line in src/stream_compaction/copy_if.cuh:

constexpr int block_size = 256;

Click to expand warnings list
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEENS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEENS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype11EEENS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype11EEENS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype10EEENS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype10EEENS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype9EEENS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype9EEENS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype8EEENS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype8EEENS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIaL9gdf_dtype7EEENS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIN4cudf6detail7wrapperIaL9gdf_dtype7EEENS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS7_PKjSA_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIdNS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIdNS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIfNS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIfNS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIlNS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIlNS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIiNS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIiNS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIsNS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIsNS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIaNS_18valid_table_filterELi256ELb0EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN69_GLOBAL__N__45_tmpxft_0000296a_00000000_6_drop_nulls_cpp1_ii_9cbced0314scatter_kernelIaNS_18valid_table_filterELi256ELb1EEEvPT_PjPiPKS2_PKjS5_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype11EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype11EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype10EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype10EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype9EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype9EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype8EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype8EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIaL9gdf_dtype7EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIaL9gdf_dtype7EEENS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIdNS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIdNS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIfNS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIfNS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIlNS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIlNS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIiNS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIiNS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIsNS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIsNS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIaNS_19boolean_mask_filterILb1ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIaNS_19boolean_mask_filterILb1ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype11EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype11EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype10EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype10EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype9EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype9EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype8EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype8EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIaL9gdf_dtype7EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIaL9gdf_dtype7EEENS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIdNS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIdNS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIfNS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIfNS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIlNS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIlNS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIiNS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIiNS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIsNS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIsNS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIaNS_19boolean_mask_filterILb1ELb0EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIaNS_19boolean_mask_filterILb1ELb0EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype11EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype11EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype10EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype10EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype9EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype9EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype8EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype8EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIaL9gdf_dtype7EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIN4cudf6detail7wrapperIaL9gdf_dtype7EEENS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS8_PKjSB_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIdNS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIdNS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIfNS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIfNS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIlNS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIlNS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIiNS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIiNS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIsNS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIsNS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIaNS_19boolean_mask_filterILb0ELb1EEELi256ELb0EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN77_GLOBAL__N__53_tmpxft_0000295d_00000000_6_apply_boolean_mask_cpp1_ii_a5002f4814scatter_kernelIaNS_19boolean_mask_filterILb0ELb1EEELi256ELb1EEEvPT_PjPiPKS3_PKjS6_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype13EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype11EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype11EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype10EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype10EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype9EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIlL9gdf_dtype9EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype8EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIiL9gdf_dtype8EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIaL9gdf_dtype7EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIN4cudf6detail7wrapperIaL9gdf_dtype7EEEZNS2_25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKSC_PKjSF_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIdZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIdZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIfZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIfZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIlZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIlZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIiZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIiZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIsZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIsZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIaZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb0EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored
ptxas warning : Value of threads per SM for entry _ZN66_GLOBAL__N__42_tmpxft_00002fbf_00000000_6_scatter_cpp1_ii_d72f380914scatter_kernelIaZN4cudf6detail25ordered_scatter_to_tablesERKNS1_5tableEPKiiEUliE_Li256ELb1EEEvPT_PjPiPKS9_PKjSC_iiT0_ is out of range. .minnctapersm will be ignored

@OlivierNV
Copy link
Contributor

OlivierNV commented Oct 16, 2019

I think these are all caused by this same line in src/stream_compaction/copy_if.cuh:

I think it's really the assumption in this line:

__launch_bounds__(block_size, 2048/block_size)

The 2048 in (2048/block_size) assumes that 2048/block_size is the max number of threads per SM that can be resident. However, if there is any shared memory use (which can presumably be hidden in cub or thrust templates), it might be less than that. Or the arch itself could somehow be limited to 32 warps/SM ~ ideally we could query this in the pre-processor like the SM arch, but not sure there is anything like that. Changing this to 1024 / block_size would eliminate the warning, but could also negatively impact perf ~ on the other hand, blindly attempting to coerce ptxas into reaching max occupancy can be detrimental to perf, in which case removing the 2nd part of the __launch_bounds might very well improve perf (minctapersm is generally a case-by-case thing in my experience, and nobody is even looking at occupancy/register usage in cudf, so the ",2048/block_size" seems sketchy to me).

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Oct 16, 2019

@OlivierNV Changing this to 1024 / block_size would eliminate the warning, but could also negatively impact perf ~ on the other hand, blindly attempting to coerce ptxas into reaching max occupancy can be detrimental to perf, in which case removing the 2nd part of the __launch_bounds might very well improve perf

I'll run the scatter benchmark with 2048 and 1024 numerators for block count and see what the diff is.

If it's trivial or better perf in 1024, is that likely to translate to better perf on other architectures too? Alternatively, if 2048 is better, are we OK living with the .minnctapersm warning?

@jrhemstad
Copy link
Contributor

I think these are all caused by this same line in src/stream_compaction/copy_if.cuh:

I think it's really the assumption in this line:

__launch_bounds__(block_size, 2048/block_size)

The 2048 in (2048/block_size) assumes that 2048/block_size is the max number of threads per SM that can be resident. However, if there is any shared memory use (which can presumably be hidden in cub or thrust templates), it might be less than that. Or the arch itself could somehow be limited to 32 warps/SM ~ ideally we could query this in the pre-processor like the SM arch, but not sure there is anything like that. Changing this to 1024 / block_size would eliminate the warning, but could also negatively impact perf ~ on the other hand, blindly attempting to coerce ptxas into reaching max occupancy can be detrimental to perf, in which case removing the 2nd part of the __launch_bounds might very well improve perf (minctapersm is generally a case-by-case thing in my experience, and nobody is even looking at occupancy/register usage in cudf, so the ",2048/block_size" seems sketchy to me).

@harrism this is your code, so you may want to take a look.

@trxcllnt
Copy link
Contributor Author

Here are the results I'm seeing. I'm not familiar enough with the expected perf to know whether the variance in the numbers are conclusively better or worse:

gbenchmarks/SCATTER_BENCH with block_count=2048/block_size
2019-10-17 01:05:27
Running gbenchmarks/SCATTER_BENCH
Run on (24 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x12)
  L1 Instruction 32K (x12)
  L2 Unified 1024K (x12)
  L3 Unified 19712K (x2)
Load Average: 1.10, 1.01, 1.04
-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
benchmark/double_coalesce_x/1024/1/manual_time          49987 ns        70968 ns        13066 bytes_per_second=312.582M/s
benchmark/double_coalesce_x/2048/1/manual_time          50374 ns        71290 ns        12867 bytes_per_second=620.364M/s
benchmark/double_coalesce_x/4096/1/manual_time          50208 ns        71018 ns        12861 bytes_per_second=1.21564G/s
benchmark/double_coalesce_x/8192/1/manual_time          50237 ns        70939 ns        12883 bytes_per_second=2.42989G/s
benchmark/double_coalesce_x/16384/1/manual_time         50795 ns        71165 ns        12894 bytes_per_second=4.80642G/s
benchmark/double_coalesce_x/32768/1/manual_time         51119 ns        71126 ns        12768 bytes_per_second=9.55178G/s
benchmark/double_coalesce_x/65536/1/manual_time         52797 ns        71297 ns        11372 bytes_per_second=18.4966G/s
benchmark/double_coalesce_x/131072/1/manual_time        53584 ns        71587 ns        12213 bytes_per_second=36.4498G/s
benchmark/double_coalesce_x/262144/1/manual_time        64171 ns        82393 ns        10831 bytes_per_second=60.8721G/s
benchmark/double_coalesce_x/524288/1/manual_time       102433 ns       218136 ns         7973 bytes_per_second=76.2694G/s
benchmark/double_coalesce_x/1048576/1/manual_time      294898 ns       305420 ns         4965 bytes_per_second=52.9843G/s
benchmark/double_coalesce_x/2097152/1/manual_time      412884 ns       415107 ns         2814 bytes_per_second=75.6872G/s
benchmark/double_coalesce_x/4194304/1/manual_time      531340 ns       535555 ns         1000 bytes_per_second=117.627G/s
benchmark/double_coalesce_x/8388608/1/manual_time     1191458 ns      1205765 ns          668 bytes_per_second=104.913G/s
benchmark/double_coalesce_x/16777216/1/manual_time    2191371 ns      2181062 ns          349 bytes_per_second=114.084G/s
Segmentation fault (core dumped)
gbenchmarks/SCATTER_BENCH with block_count=1024/block_size
2019-10-17 01:08:17
Running gbenchmarks/SCATTER_BENCH
Run on (24 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x12)
  L1 Instruction 32K (x12)
  L2 Unified 1024K (x12)
  L3 Unified 19712K (x2)
Load Average: 0.59, 0.89, 1.01
-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
benchmark/double_coalesce_x/1024/1/manual_time          46689 ns        66974 ns        14927 bytes_per_second=334.664M/s
benchmark/double_coalesce_x/2048/1/manual_time          46796 ns        67194 ns        14810 bytes_per_second=667.799M/s
benchmark/double_coalesce_x/4096/1/manual_time          46482 ns        66652 ns        13968 bytes_per_second=1.31309G/s
benchmark/double_coalesce_x/8192/1/manual_time          47821 ns        67955 ns        13820 bytes_per_second=2.55267G/s
benchmark/double_coalesce_x/16384/1/manual_time         47626 ns        67416 ns        13630 bytes_per_second=5.12617G/s
benchmark/double_coalesce_x/32768/1/manual_time         48125 ns        67645 ns        13653 bytes_per_second=10.1461G/s
benchmark/double_coalesce_x/65536/1/manual_time         49762 ns        67667 ns        13100 bytes_per_second=19.6245G/s
benchmark/double_coalesce_x/131072/1/manual_time        50832 ns        68280 ns        13037 bytes_per_second=38.4231G/s
benchmark/double_coalesce_x/262144/1/manual_time        64527 ns        82484 ns        10903 bytes_per_second=60.5371G/s
benchmark/double_coalesce_x/524288/1/manual_time       100105 ns       204639 ns         7657 bytes_per_second=78.0434G/s
benchmark/double_coalesce_x/1048576/1/manual_time      296899 ns       308353 ns         4881 bytes_per_second=52.6274G/s
benchmark/double_coalesce_x/2097152/1/manual_time      414626 ns       418521 ns         2793 bytes_per_second=75.3691G/s
benchmark/double_coalesce_x/4194304/1/manual_time      540594 ns       543664 ns         1000 bytes_per_second=115.613G/s
benchmark/double_coalesce_x/8388608/1/manual_time     1169875 ns      1186101 ns          632 bytes_per_second=106.849G/s
benchmark/double_coalesce_x/16777216/1/manual_time    2199527 ns      2193205 ns          350 bytes_per_second=113.661G/s
Segmentation fault (core dumped)

The benchmark eventually segfaults attempting to allocate all the memory on the device. It starts by allocating 1/2 up front, then grows in spikes from there.

@OlivierNV
Copy link
Contributor

If it's trivial or better perf in 1024, is that likely to translate to better perf on other architectures too? Alternatively, if 2048 is better, are we OK living with the .minnctapersm warning?

I would also run the benchmark removing the ",X/block_size", eg: launch_bounds(block_size) only, and let the compiler decide register allocation.

@trxcllnt
Copy link
Contributor Author

gbenchmarks/SCATTER_BENCH no block_count
2019-10-17 17:17:15
Running gbenchmarks/SCATTER_BENCH
Run on (24 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x12)
  L1 Instruction 32K (x12)
  L2 Unified 1024K (x12)
  L3 Unified 19712K (x2)
Load Average: 1.42, 0.55, 0.26
-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
benchmark/double_coalesce_x/1024/1/manual_time          47063 ns        67617 ns        13886 bytes_per_second=332.003M/s
benchmark/double_coalesce_x/2048/1/manual_time          47516 ns        67979 ns        13808 bytes_per_second=657.672M/s
benchmark/double_coalesce_x/4096/1/manual_time          53595 ns        74825 ns        12226 bytes_per_second=1.13882G/s
benchmark/double_coalesce_x/8192/1/manual_time          47971 ns        68178 ns        13553 bytes_per_second=2.54469G/s
benchmark/double_coalesce_x/16384/1/manual_time         48286 ns        68214 ns        13719 bytes_per_second=5.05618G/s
benchmark/double_coalesce_x/32768/1/manual_time         48143 ns        67683 ns        13616 bytes_per_second=10.1422G/s
benchmark/double_coalesce_x/65536/1/manual_time         56178 ns        74915 ns        13031 bytes_per_second=17.3833G/s
benchmark/double_coalesce_x/131072/1/manual_time        51389 ns        68919 ns        12943 bytes_per_second=38.007G/s
benchmark/double_coalesce_x/262144/1/manual_time        64576 ns        82285 ns        10955 bytes_per_second=60.4906G/s
benchmark/double_coalesce_x/524288/1/manual_time       102052 ns       224054 ns         7945 bytes_per_second=76.5539G/s
benchmark/double_coalesce_x/1048576/1/manual_time      296321 ns       317989 ns         4894 bytes_per_second=52.73G/s
benchmark/double_coalesce_x/2097152/1/manual_time      395626 ns       409061 ns         2811 bytes_per_second=78.9888G/s
benchmark/double_coalesce_x/4194304/1/manual_time      515526 ns       529331 ns         1000 bytes_per_second=121.235G/s
benchmark/double_coalesce_x/8388608/1/manual_time     1227143 ns      1243755 ns          610 bytes_per_second=101.863G/s
benchmark/double_coalesce_x/16777216/1/manual_time    2277550 ns      2295213 ns          344 bytes_per_second=109.767G/s
Segmentation fault (core dumped)

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Oct 17, 2019

@OlivierNV maybe I'm not getting the full picture since the benchmark doesn't finish, but it doesn't seem like lowering or removing the second launch parameter significantly affects performance. If you agree, I'm happy to remove the parameter so we can move forward on merging this PR.

@OlivierNV
Copy link
Contributor

@OlivierNV If you agree, I'm happy to remove the parameter so we can move forward on merging this PR.

Yup, I vote for taking out the parameter and merge.

@cwharris
Copy link
Contributor

cwharris commented Oct 17, 2019

Out of the loop here, but I'm assuming we don't have to worry about the fact that those tests are segfault-ing due to memory constraints on Paul's machine even though he's running 48/192 GB of device/host ram, respectively?

@trxcllnt
Copy link
Contributor Author

I'm not worried about it since they're just benchmarks.

@harrism harrism merged commit 6c47bc6 into rapidsai:branch-0.11 Oct 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants