-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] tSNE Lock up #2565
[BUG] tSNE Lock up #2565
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
1 similar comment
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device buffers definitely clean this PR up quite a bit! I really have just two little things. One is very minor.
cpp/src/tsne/barnes_hut.cuh
Outdated
random_vector(YY, -0.0001f, 0.0001f, (nnodes + 1) * 2, stream, random_state); | ||
ASSERT(YY != NULL && rep_forces != NULL, "[ERROR] Possibly no more memory"); | ||
device_buffer<float> YY(d_alloc, stream, (nnodes + 1) * 2); | ||
device_buffer<float> YY_prev(d_alloc, stream, (nnodes + 1) * 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this problem is so dataset dependent and doesn't seem to occur on most real-world datasets, it would be unfortunate to have to double the amount of embedding memory by default. While it's not nearly the same as duplicating the input data, training 50M vertices still requires 400mb of extra memory just for this feature.
What do you think about making this feature optional and maybe mentioning the option in the warning? If the option is disabled, we just return the embedding the way it is w/ the NaN values. The option can be enabled and use a little extra memory if users still want the embeddings knowing that training wasn't able to complete successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. Are you thinking the option would be exposed at the Python level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just a simple flag exposed through the TSNE constructor would be fine.
cpp/src/tsne/barnes_hut.cuh
Outdated
"Non-finite result detected during Barnes Hut iteration: %d, returning last " | ||
"known good positions.", | ||
iter); | ||
MLCommon::copy(YY.data(), YY_prev.data(), (nnodes + 1) * 2, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you been able to visually inspect the output embeddings of any of the datasets that are causing this failure? It would be nice to know if they have been reasonably embedded by the time the rollback occurs or if they are just garbage.
I don't know if there are multiple ways to cause lockups, but I've been investigating one of them that is due to rounding error in the equation below when using 32-bit precision. That's the (or a?) source of cuml/cpp/src/tsne/bh_kernels.cuh Lines 648 to 650 in 3aacbc9
Example inputs from a lockup: const float PQ = __fdividef(
/* doesn't matter */,
10689151 + 10689150 - 2.0f * (500.782989501953125 * 500.782989501953125 + -3230.845947265625 * -3230.845947265625)); The correct denominator is ~2.72493 but it computes to 0 with Rather than storing+copying
|
@zbjornson I'd thought the division error was because of nan's being propagated down from a higher level, but this makes more sense, now that I see your description. Allowing NaN's into the tree builder code breaks the minimum radius check for the quad tree, and causes an infinite loop. I like #2 as an intermediate solution. If were willing to use extra memory, we could also save the values, do the computation, and roll-back / repeat at double-precision if a nans are detected. @cjnolet The other lock up was caused by a race condition in the code that decrements the bottom offset when allocating new cells after a collision. Also looks like I messed up my last merge from upstream. Fixing now. |
@drobison00 thanks for the fast response!
I was thinking of doing this in the kernel on only the affected inputs, like: float denominator = ...
if (denominator == 0) {
double dbl_denominator = /* repeat with double precision */
result = __fdividef(x, static_cast<float>(dbl_denominator));
} else {
result = __fdividef(x, denominator);
} so we wouldn't need any extra device memory, just regs. Do you think we would have to use double precision uniformly if any results are invalid, though? But... this rearrangement might be pretty good, too: float V = fma(-2.0f, C * D, A) + fma(-2.0f, E * F, B);
// i.e.
float V = fma(-2.0f, Y1[i] * Y1[j], norm_add1[i]) + fma(-2.0f, Y2[i], Y2[j], norm[j]); That computes to 2.0 in this particular case, which might be close enough. |
@zbjornson Good point; yes, I definitely like your idea more, only redoing affected inputs / (re)storing with registers. There might be some corner case where it makes sense to redo the all the calculations, but without some motivating example its probably not worth the performance hit. Also have no problem with the fma formulation if its more stable; we're already working with an approximate solution. I'd still probably do the 0 denominator check though. |
@drobison00 sounds good. Are you planning to make that change in this PR, or should I open a PR? |
@zbjornson I can add it to this one if that works. I need to remove my other changes anyway. |
@zbjornson Looks like we can still run into a problem with double precision. Using doubles resolves a quite a few cases, but not all. Some examples: We can do something like whats below, to always avoid locking up; but likely risk situations where we generate garbage data. What if we do this, and emit some kind of warning/error message in cases where double precision still doesn't work? float denominator = __fmaf_rn(-2.0f, (Y1[i] * Y1[j]), norm_add1[i]) + __fmaf_rn(-2.0f, (Y2[i] * Y2[j]), norm[j]);
if (denominator == 0) {
/* repeat with double precision */
double dbl_denominator = __fma_rn(-2.0f, Y1[i] * Y1[j], norm_add1[i]) + __fma_rn(-2.0f, Y2[i] * Y2[j], norm[j]);
denominator = (dbl_denominator != 0) ? static_cast<float>(dbl_denominator) : FLT_EPSILON;
}
PQ = __fdividef(VAL[index], denominator); |
@drobison00 I was just working through more cases and slowly drafting another comment. There are definitely potential issues with:
Hope to have more thoughts on this later today. (PS: If |
@drobison00 both of those examples work (reasonable-looking nonzero values) if you do the intermediate double dbl_denominator = __fma_rn(-2.0, static_cast<double>(Y1[i]) * Y1[j], norm_add1[i]) + __fma_rn(-2.0, static_cast<double>(Y2[i]) * Y2[j], norm[j]); |
@zbjornson I'd originally tried pre-computing Y1[i] * Y1[j], and Y2[i] * Y2[j], but was still seeing situations where that didn't work. I'll try to reproduce (will update shortly). Using either method, I don't think we can trust that dbl_denominator never comes out as 0. As you say, clamping PQ could be tricky. Perhaps we could, instead, establish some simple order of magnitude bounds on the denominator in the double precision case, and clamp to that, instead of FLT_EPSILON, if the direct computation fails? That seems fairly straight forward. I'll take a look this evening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can trust that dbl_denominator never comes out as 0.
Agreed, it's essentially free to include that safeguard. nvcc reorders branches based on __builtin_expect
, too.
I ran ~90,000 stress-tests with the code below. The double-precision block kicked in 422 times (0.468%). None of those double-precision calcs resulted in zeros, so whatever we decide to do whendbl_denom
is 0, it doesn't seem likely to matter too often.
I also did ~10k tests with the original equation. While the rearrangement doesn't seem to reduce the number of tSNE invocations that encounter zeros, it does substantially decrease the number of attraction kernel invocations that encounter zeros (by like 3x).
(Caveat: my input values were all in the [0, 1] range, so this might not be representative. I logged the random seeds for every run, so we can go back to all of those cases if needed.)
Perhaps we could, instead, establish some simple order of magnitude bounds on the denominator in the double precision case, and clamp to that, instead of FLT_EPSILON, if the direct computation fails?
I was hoping to be able to measure if there was a correlation between FP error and the true denominator, but per above couldn't collect any datapoints for it. Short of that, I'd pick something sorta close to 1 (maybe 0.1 or 0.01) so that the change in this iteration isn't too big, and hope that subsequent iterations resolve it.
How we decide the value is wrong. Testing if it's zero might not catch all instances of significantly wrong denominators.
(I didn't get to do this tonight, and I'm not sure it will be an important difference either. I basically want to do the 32-bit and 64-bit calculation for every point for a bunch of inputs and measure the RMS or something.)
Code I used for stress test
// Try single precision compute first
float denominator = __fmaf_rn(-2.0f, Y1[i] * Y1[j], norm_add1[i]) + __fmaf_rn(-2.0f, Y2[i] * Y2[j], norm[j]);
if (__builtin_expect(denominator == 0, false)) {
// repeat with double precision
double y1i_y1j = static_cast<double>(Y1[i]) * Y1[j];
double y2i_y2j = static_cast<double>(Y2[i]) * Y2[j];
double dbl_denominator = __fma_rn(-2.0, y1i_y1j, norm_add1[i]) + __fma_rn(-2.0, y2i_y2j, norm[j]);
denominator = dbl_denominator == 0 ? FLT_EPSILON : static_cast<float>(dbl_denominator);
}
PQ = __fdividef(VAL[index], denominator);
// Apply forces
atomicAdd(&attract1[i], PQ * (Y1[i] - Y1[j]));
atomicAdd(&attract2[i], PQ * (Y2[i] - Y2[j]));
Added a few tiny comments inline.
Alright, this is mostly clear to me now...
Here's a case that encountered "bad" denominators (y axis is double-precision result): (Each of those plateaus is a tSNE iteration, I think.) Notice:
I'm also pretty confident that there are cases when the f32 result is 1.0 or 2.0 (exactly) and is wrong, e.g. below. I think these cases will be addressed by finding the upstream FP errors later. #2605 fixes one such error (or "the" error):
|
@zbjornson Since the upstream source of the instability isn't entirely clear, going to add the change's discussed so far, which avoid the lock up behavior, notify users when we've encountered a problem, and return early. We can also open another bug to track down the source of the problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes look great and this is almost ready to go. What remains is very minor.
double _Y1 = static_cast<double>(Y1[i] * Y1[j]); | ||
double _Y2 = static_cast<double>(Y2[i] * Y2[j]); | ||
double dbl_denominator = | ||
__fma_rn(-2.0f, _Y1, norm[i] + 1.0f) + __fma_rn(-2.0f, _Y2, norm[j]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to open a Github issue for further investigating this detail and reference it here for future eyes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zbjornson Can you link the PR you mention in your other comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got most of a PR ready that eliminates the upstream source of error. So long as that's accepted, that would mean we can remove the logging and early bail, etc.
BTW Martin Burtscher's BH kernel is now on version 4.5 (this code is based on 3.1). Unfortunately there's no version control on it so it's hard to see what changed and why, but it might be worth looking at his changes for remedying the tree-building lockups too. I noticed he doesn't have a radius error factor anymore (this), so it looks like a source of error was eliminated.
https://userweb.cs.txstate.edu/~burtscher/research/ECL-BH/ECL-BH_45.cu
The license on v4.5 is also more permissive.
cpp/src/tsne/bh_kernels.cuh
Outdated
|
||
if (__builtin_expect(denominator == 0, false)) { | ||
double _Y1 = static_cast<double>(Y1[i] * Y1[j]); | ||
double _Y2 = static_cast<double>(Y2[i] * Y2[j]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still does the calculation with single-precision. At least one of the operands needs to be cast to double before the multiplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update to cast both variables before the multiply.
The second code link simply fails the entire process when the cell race condition is hit.
cell = atomicSub((int*)&bottomd, 1) - 1;
if (cell <= nbodiesd) {printf("ERROR: out of cell memory\n"); asm("trap;");}
This PR leverages the fallback path for closely packed points that exceed the minimum radius.
It also looks like they've moved to an oct-tree formulation instead of the current quad tree. Definitely worth looking at, but it will need to be a later update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The denominator equation is just the squared Euclidean distance. Instead of computing the norms in one kernel and storing it, we can (a) save nRow * sizeof(float) bytes of memory, (b) save global loads/stores, and (c) eliminate a source of FP error that's causing lockups (see linked issues). Per code comment, this still includes a guard in case there are other sources of NaNs upstream. This compiles to just one `setp.ltu` instruction so is essentially free. Ref rapidsai#2358 Ref rapidsai#2565
Prevents race behavior related to cell index selection in Barnes_Hut algorithm. Failures in this case could allow for loop injection within the tree structure, resulting in infinite looping/lockup behavior.
In certain cases, randomly generated test data can result in divide by zero / NaN errors when attractive force is calculated. Check for them and exit early if detected.
Updated BH code to use device_buffer instead of direct allocation.
Closes #2358