Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropout: make seed and states_num kernel arguments #2277

Merged
merged 4 commits into from
Aug 18, 2023

Conversation

tbennun
Copy link
Contributor

@tbennun tbennun commented Jul 26, 2023

Every time a dropout descriptor is created (even if it's with a different seed), the initialization kernel is recompiled. This PR turns the number of states to initialize and random seed into runtime parameters. There seems to be no effect on the runtime of the initialization kernel (which I'm assuming is not performance critical), but fewer JIT compilation calls.

Note: I tried setting prng_seed to be an unsigned long long parameter, but that passed a wrong value onto the kernel itself.

@tbennun tbennun changed the title Dropout: make seed and states_num arguments Dropout: make seed and states_num runtime arguments Jul 26, 2023
@tbennun tbennun changed the title Dropout: make seed and states_num runtime arguments Dropout: make seed and states_num kernel arguments Jul 26, 2023
@junliume
Copy link
Collaborator

@atamazov could you help to take a look?

JehandadKhan
JehandadKhan previously approved these changes Jul 31, 2023
@tbennun
Copy link
Contributor Author

tbennun commented Jul 31, 2023

@JehandadKhan sorry, called clang-format on the code now.

@JehandadKhan
Copy link
Contributor

@tbennun The CL file needs to formatted as well.

@tbennun
Copy link
Contributor Author

tbennun commented Jul 31, 2023

@JehandadKhan Done now. I thought I had formatted both.

std::to_string(sizeof(prngStates)) + "x" +
std::to_string(rng_mode) + "x" + std::to_string(prng_seed) + "x" +
std::to_string(wk_grp_num);
std::string network_config = "initprngs-" + std::to_string(sizeof(prngStates)) + "x" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use sdt::ostringstream instead of concatenations if there are more than two strings. It's both faster and easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I would do this, but the rest of the file (as well as other files such as softmaxocl.cpp) use this style of code. Would it be ok to not match the style of the rest of the code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, otherwise it would never really change.


auto&& kernels = handle.GetKernels(kernel_name, network_config);
if(!kernels.empty())
{
kernels.front()(prng_states);
kernels.front()(prng_states,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe prng_seed and states_num should be declared as unsigned? That would make calls single lines and remove the casts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (through a different way). Changing prng_seed to be unsigned would change the MIOpen C API (in which miopenSetDropoutDescriptor uses unsigned long long). I'd rather pass in an unsigned long long value, which I now do.

Copy link
Contributor

@DrizztDoUrden DrizztDoUrden Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have missed that. That's good now.

@junliume
Copy link
Collaborator

@tbennun could you resolve the review opinions above? We will try to get this merged.

@tbennun
Copy link
Contributor Author

tbennun commented Aug 17, 2023

@junliume I now addressed the two comments, see above and new commit.

@junliume junliume requested a review from DrizztDoUrden August 17, 2023 23:10
@junliume junliume merged commit 5ea8aa6 into ROCm:develop Aug 18, 2023
junliume added a commit that referenced this pull request Aug 20, 2023
junliume added a commit that referenced this pull request Aug 21, 2023
cderb added a commit that referenced this pull request Sep 18, 2023
* 3D group forward convolution solver (#2286)

* [HotFix] Fix build issue after #2286 (#2328)

* Fix ConvCkIgemmFwdV6r1DlopsNchw solver to reflect that it's not dynamic (#2325)

* Remove target IDs from kdb entries (#2309)

* Dropout: make seed and states_num kernel arguments (#2277)

* [MI250] Adding missing kernel objects (#2329)

* Post-merge fixups: Replace environment variable check with problem config check and reduce lambda capture for Invoker obj (#2305)

* [HotFix][CI] fix HIP tidy issue from #2277 (#2335)

* [HotFix] Update requirements.txt MLIR ignore PATH for Python conda LLD (#2324)

* [NFC] Replace long integers with int64_t and size_t for better compatibility with Windows (#2323)

* Windows compatibility: replace long integers with int64_t and size_t, replace uint with unsigned int, replace long with long long for numbers, proper casting

* Fix formatting

* Fix 3d group forward convolution

* Resolve review comments

* Fix formatting

---------

Co-authored-by: Daming Feng <[email protected]>
Co-authored-by: Evgenii Averin <[email protected]>
Co-authored-by: JD <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: amberhassaan <[email protected]>
Co-authored-by: Jun Liu <[email protected]>
@junliume
Copy link
Collaborator

@atamazov could you comment on the user's question?

The private_seg_size of 3204 seems to be caused by the precomputed XORWOW matrices, which are placed in OpenCL constant memory and, upon compilation, may be stored in every thread's memory. It might be useful to store one or both matrices in global memory to avoid the private_seg_size pollution.

CC: @zjing14

@atamazov
Copy link
Contributor

@junliume I need some clue to the code as I am not confident with dropout. What are the symbolic names of the "the precomputed XORWOW matrices"?

/cc @zjing14

@tbennun
Copy link
Contributor Author

tbennun commented Sep 21, 2023

@junliume I need some clue to the code as I am not confident with dropout. What are the symbolic names of the "the precomputed XORWOW matrices"?

/cc @zjing14

@atamazov This refers to the two included files in the Dropout kernel, which initialize large arrays in constant memory:
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/develop/src/kernels/precalc_xorwow_skipahead_sequence_matrices_kernel.h
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/develop/src/kernels/precalc_xorwow_skipahead_matrices_kernel.h

@atamazov
Copy link
Contributor

@tbennun @junliume AFAICS the parts of these matrices are copied to the private memory for a purpose, i.e. each workitem modifies its local copy. If we want to put those into global memory, then we need to allocate some workspace (enough to hold data from matrices) and then manually manage it, which doesn't seem easy to me.

But why we may need to save space in private segment? Are we hitting the limit?

/cc @zjing14

@junliume
Copy link
Collaborator

@atamazov yes we are hitting a limit on constant memory, with this fix it "gives a reduction in initialization time of a Transformer model from 10:22 minutes to 31 seconds." We should take a look at the OCL code to see if we can work on heap memory instead.

@CAHEK7
Copy link
Contributor

CAHEK7 commented Nov 2, 2023

PRNG state requires some workspace in any case, but it supposed to be tiny, but it should be double-checked.
The tables, as far as I know, are required to speedup PRNG initialization and warmup (it may require huge numbers of PRNG invocation) and if that's the case - we can hold the tables on CPU side, choose particular entry for the seed and offset, upload it to the state and save a lot of constant\workspace memory.

But to be sure we need to re-investigate our GPU PRNG implementation.

@atamazov
Copy link
Contributor

atamazov commented Nov 2, 2023

@CAHEK7 Thanks for feedback. Now we at least better understand what to begin with.

/cc @JehandadKhan @junliume

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants