Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segfault on infinite recursion in some cases #9617

Merged
merged 1 commit into from
Dec 30, 2023

Conversation

9999years
Copy link
Contributor

@9999years 9999years commented Dec 15, 2023

Context

Closes #9616.

This fixes a segfault on infinite function call recursion (rather than infinite thunk recursion) by tracking the function call depth in EvalState.

Additionally, to avoid printing extremely long stack traces, stack frames are now deduplicated, with a (19997 duplicate traces omitted) message. This should only really be triggered in infinite recursion scenarios.

Before:

$ nix-instantiate --eval --expr '(x: x x) (x: x x)'
Segmentation fault: 11

After:

$ nix-instantiate --eval --expr '(x: x x) (x: x x)'
error: stack overflow

       at «string»:1:14:

            1| (x: x x) (x: x x)
             |              ^

$ nix-instantiate --eval --expr '(x: x x) (x: x x)' --show-trace
error:
       … from call site

         at «string»:1:1:

            1| (x: x x) (x: x x)
             | ^

       … while calling anonymous lambda

         at «string»:1:2:

            1| (x: x x) (x: x x)
             |  ^

       … from call site

         at «string»:1:5:

            1| (x: x x) (x: x x)
             |     ^

       … while calling anonymous lambda

         at «string»:1:11:

            1| (x: x x) (x: x x)
             |           ^

       … from call site

         at «string»:1:14:

            1| (x: x x) (x: x x)
             |              ^

       (19997 duplicate traces omitted)

       error: stack overflow

       at «string»:1:14:

            1| (x: x x) (x: x x)
             |              ^

Future work

  • It would be nice if the maximum function call depth could be customized like --option function-call-depth-limit 500.
  • In some cases, this may print a message like (1 duplicate traces omitted), which is kind of unhelpful. It would be nice to only print that message if more than a couple traces are omitted.

Priorities

Add 👍 to pull requests you find important.

@9999years 9999years requested a review from edolstra as a code owner December 15, 2023 21:29
@github-actions github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Dec 15, 2023
@@ -1508,6 +1508,10 @@ void ExprLambda::eval(EvalState & state, Env & env, Value & v)

void EvalState::callFunction(Value & fun, size_t nrArgs, Value * * args, Value & vRes, const PosIdx pos)
{
if (depth > 10000)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chosen arbitrarily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a strong tradition 😆

size_t minIndent = 1000000;

I don't think 10k stack levels is ever reasonable to do, so it seems fair. Magic number should perhaps be extracted to a constexpr though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we stack overflow before that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status quo: segmentation fault on macOS because our segv handler is Linux x86 only, or so I've heard, and a less than helpful error message if our segv handler does actually work.

Fixing the handler is beyond the scope of this specific work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I'm doing something unreasonable(?), because I'm hitting 12k call depth in two expressions (both doing import-from-derivation). Luckily I can use max-call-depth = 20000 to fix it, but I have to distribute this setting to not leak this detail to users.

What does 10k mean when translated into memory usage? Or, what's the call depth at which the Nix evaluator would segfault at before? (I guess it varies depending on platform?) Although 10k sounds a lot, I have no idea where the segfault would typically hit. 100k? 1M?

src/libexpr/eval.cc Outdated Show resolved Hide resolved
@@ -1508,6 +1508,10 @@ void ExprLambda::eval(EvalState & state, Env & env, Value & v)

void EvalState::callFunction(Value & fun, size_t nrArgs, Value * * args, Value & vRes, const PosIdx pos)
{
if (depth > 10000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a strong tradition 😆

size_t minIndent = 1000000;

I don't think 10k stack levels is ever reasonable to do, so it seems fair. Magic number should perhaps be extracted to a constexpr though!

src/libexpr/eval.cc Outdated Show resolved Hide resolved
src/libutil/error.cc Outdated Show resolved Hide resolved
src/libutil/error.cc Show resolved Hide resolved
@9999years 9999years force-pushed the stack-overflow-segfault branch 3 times, most recently from fe7c5de to 9bb7992 Compare December 16, 2023 00:54
Copy link
Member

@thufschmitt thufschmitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks,

I'm not the right person to judge on the overall mechanism (it feels a bit brittle to me because of the arbitrary magic value, but it's also incredibly nice to have a stack trace – esp on macOS). I'll let others (@edolstra or @roberth in particular) have a more informed opinion here.

Other than that, I let a couple comments. The implementation is great, my main concern is that it doesn't have an impact on the overall performance.

Comment on lines 53 to 77
inline bool operator<(const AbstractPos& lhs, const AbstractPos& rhs)
{
if (lhs.line != rhs.line)
return lhs.line < rhs.line;
if (lhs.column != rhs.column)
return lhs.column < rhs.column;
return false;
}
inline bool operator> (const AbstractPos& lhs, const AbstractPos& rhs) { return rhs < lhs; }
inline bool operator<=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs > rhs); }
inline bool operator>=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs < rhs); }

inline bool operator<(const Trace& lhs, const Trace& rhs)
{
if (lhs.pos != rhs.pos) {
if (!lhs.pos)
return true;
if (!rhs.pos)
return false;
return *lhs.pos < *rhs.pos;
}

if (lhs.hint.str() != rhs.hint.str())
return lhs.hint.str() < rhs.hint.str();

if (lhs.frame != rhs.frame)
return lhs.frame < rhs.frame;

return false;
}
inline bool operator> (const Trace& lhs, const Trace& rhs) { return rhs < lhs; }
inline bool operator<=(const Trace& lhs, const Trace& rhs) { return !(lhs > rhs); }
inline bool operator>=(const Trace& lhs, const Trace& rhs) { return !(lhs < rhs); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you care about the exact semantics of the ordering, I think this could be replaced by a bit of nasty cpp:

Suggested change
inline bool operator<(const AbstractPos& lhs, const AbstractPos& rhs)
{
if (lhs.line != rhs.line)
return lhs.line < rhs.line;
if (lhs.column != rhs.column)
return lhs.column < rhs.column;
return false;
}
inline bool operator> (const AbstractPos& lhs, const AbstractPos& rhs) { return rhs < lhs; }
inline bool operator<=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs > rhs); }
inline bool operator>=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs < rhs); }
inline bool operator<(const Trace& lhs, const Trace& rhs)
{
if (lhs.pos != rhs.pos) {
if (!lhs.pos)
return true;
if (!rhs.pos)
return false;
return *lhs.pos < *rhs.pos;
}
if (lhs.hint.str() != rhs.hint.str())
return lhs.hint.str() < rhs.hint.str();
if (lhs.frame != rhs.frame)
return lhs.frame < rhs.frame;
return false;
}
inline bool operator> (const Trace& lhs, const Trace& rhs) { return rhs < lhs; }
inline bool operator<=(const Trace& lhs, const Trace& rhs) { return !(lhs > rhs); }
inline bool operator>=(const Trace& lhs, const Trace& rhs) { return !(lhs < rhs); }
GENERATE_CMP_EXT(inline, AbstraxtPos, me->line, me->column);
GENERATE_CMP_EXT(inline, Trace, me->pos, me->hint.str(), me->frame);

(where GENERATE_CMP_EXT comes from libutil/comparator.hh)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::tie seems like a great middle ground of having specified semantics, but less code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the C++20 future, one might want to consider using the three-way comparison operator <=>. Its implementation allows the compiler to deduce the rest.
This will have multiple advantages:

  • we can get rid of preprocessor macros
  • strange semantics that are already implemented individually in the different operator implementation might surface in the light of a unified implementation of the <=> op

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Do you know what's blocking this? We do use c++2a. Should we track it somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if anything blocks this at all. Maybe it's just that no one thought about using it, yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the fields here is shared_ptr which by default has very undesirable comparison semantics (pointer address).

Well, if the only reason we need that comparison operator is putting stuff in a Set, pointer comparisons is quite reasonable (and as fast as it can get)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointer comparisons are not reasonable, because each Trace is newly constructed. There's no interning. So it's quite common to have position values that are semantically equal (same line and column number) with different pointer values. I've tried this change and it breaks the tests.

That said, I can simplify these comparisons somewhat and I can replace the AbstractPos one with the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how costly it would be to intern positions. We quite possibly should do that anyhow, but at a future date.

Copy link
Contributor Author

@9999years 9999years Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like LLVM only got <=> for tuples implemented last year: llvm/llvm-project#50396

EDIT: And std::string. Looks like we'll need clang 16 for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how costly it would be to intern positions. We quite possibly should do that anyhow, but at a future date.

We do have PosTable.

This PR's use case is not in the hot path, so we don't need to explore this now I think.

@@ -1508,6 +1508,10 @@ void ExprLambda::eval(EvalState & state, Env & env, Value & v)

void EvalState::callFunction(Value & fun, size_t nrArgs, Value * * args, Value & vRes, const PosIdx pos)
{
if (depth > 10000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we stack overflow before that?

Comment on lines 1525 to 1527
if (depth > 10000)
error("stack overflow").atPos(pos).template debugThrow<EvalError>();
CallLevel _level(depth);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to benchmark this to make sure that it doesn't bear a noticeable evaluation performance impact.
My usual benchmark is a simple nix search nixpkgs blah --option eval-cache false, but maybe @tfc will have something more meaningful to suggest since he's been working heavily on the evaluator performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix 2.18.1:

Executed in   13.19 secs    fish           external
   usr time   11.33 secs    0.12 millis   11.33 secs
   sys time    1.76 secs    1.38 millis    1.76 secs

This PR:

Executed in   13.57 secs    fish           external
   usr time   12.36 secs    0.15 millis   12.35 secs
   sys time    1.10 secs    2.19 millis    1.10 secs

3% slowdown. Seems OK for preventing a segfault.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3% slowdown. Seems OK for preventing a segfault.

3% isn't really trivial. But I reliably couldn't reproduce it:

Summary
  'nixMaster' ran
    1.00 ± 0.01 times faster than 'nixFromPR'

So I think it's fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was probably running an unoptimized build. Thanks for checking it on your end!

}
} else {
oss << "\n" << ANSI_WARNING "(" << skippedTraces.size() << " duplicate traces omitted)" ANSI_NORMAL << "\n";
tracesSeen.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit): Why do we care about that (and likely clearing skippedTraces below)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment to explain this, but I'll copy it here:

Consider a mutually recursive stack trace with:

  • 10 entries of A
  • 10 entries of B
  • 10 entries of A

If we don't clear tracesSeen here, we would print output like this:

  • 1 entry of A
  • (9 duplicate traces omitted)
  • 1 entry of B
  • (19 duplicate traces omitted)

This obscures the control flow, which went from A, to B, and back to A again.

In contrast, if we do clear tracesSeen, the output looks like this:

  • 1 entry of A
  • (9 duplicate traces omitted)
  • 1 entry of B
  • (9 duplicate traces omitted)
  • 1 entry of A
  • (9 duplicate traces omitted)

See: tests/functional/lang/eval-fail-mutual-recursion.nix for a test case exercising this property.

@roberth
Copy link
Member

roberth commented Dec 18, 2023

For a bigger picture please also read #9627

src/libexpr/eval.hh Outdated Show resolved Hide resolved
return lhs.line < rhs.line;
if (lhs.column != rhs.column)
return lhs.column < rhs.column;
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other cases can occur here other than line and column comparison? This makes the code and semantics harder to understand if the code leaves open what else can lead to the false return.

Comment on lines 53 to 77
inline bool operator<(const AbstractPos& lhs, const AbstractPos& rhs)
{
if (lhs.line != rhs.line)
return lhs.line < rhs.line;
if (lhs.column != rhs.column)
return lhs.column < rhs.column;
return false;
}
inline bool operator> (const AbstractPos& lhs, const AbstractPos& rhs) { return rhs < lhs; }
inline bool operator<=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs > rhs); }
inline bool operator>=(const AbstractPos& lhs, const AbstractPos& rhs) { return !(lhs < rhs); }

inline bool operator<(const Trace& lhs, const Trace& rhs)
{
if (lhs.pos != rhs.pos) {
if (!lhs.pos)
return true;
if (!rhs.pos)
return false;
return *lhs.pos < *rhs.pos;
}

if (lhs.hint.str() != rhs.hint.str())
return lhs.hint.str() < rhs.hint.str();

if (lhs.frame != rhs.frame)
return lhs.frame < rhs.frame;

return false;
}
inline bool operator> (const Trace& lhs, const Trace& rhs) { return rhs < lhs; }
inline bool operator<=(const Trace& lhs, const Trace& rhs) { return !(lhs > rhs); }
inline bool operator>=(const Trace& lhs, const Trace& rhs) { return !(lhs < rhs); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the C++20 future, one might want to consider using the three-way comparison operator <=>. Its implementation allows the compiler to deduce the rest.
This will have multiple advantages:

  • we can get rid of preprocessor macros
  • strange semantics that are already implemented individually in the different operator implementation might surface in the light of a unified implementation of the <=> op

@roberth roberth added error-messages Confusing messages and better diagnostics language The Nix expression language; parser, interpreter, primops, evaluation, etc bug labels Dec 18, 2023
@9999years 9999years force-pushed the stack-overflow-segfault branch 2 times, most recently from a359356 to 4c114a2 Compare December 18, 2023 18:28
@9999years 9999years requested a review from roberth December 18, 2023 21:38
@9999years 9999years force-pushed the stack-overflow-segfault branch from 4c114a2 to 852bd1d Compare December 18, 2023 21:49
Copy link
Member

@roberth roberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A stack limit is a good thing to have, regardless of the other steps we can take (ie #9627 and related issues).
We do need a setting for the limit to make sure we don't regress. See comment.

Scoped out: a page about stack behavior in the manual. This is very needed, but it would increase the scope too much. It would be appreciated very much (and improve and shorten the release note fwiw).

doc/manual/rl-next/stack-overflow-segfaults.md Outdated Show resolved Hide resolved
doc/manual/rl-next/stack-overflow-segfaults.md Outdated Show resolved Hide resolved
doc/manual/rl-next/stack-overflow-segfaults.md Outdated Show resolved Hide resolved
src/libutil/error.cc Outdated Show resolved Hide resolved
src/libutil/error.cc Outdated Show resolved Hide resolved
tests/functional/lang/eval-fail-mutual-recursion.nix Outdated Show resolved Hide resolved
}

printSkippedTracesMaybe();
oss << "\n" << prefix;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: If the algorithm could be factored out, it'd be easier to understand and improve.

thought: One possible improvement is to make it understand that clusters of repetition can themselves repeat. This happens when you're doing some iterative thing for each node in a tree for example. Probably there's an easier improvement that I'm not thinking of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored it out, although it should maybe be a class to avoid passing around so much context.

One possible improvement is to make it understand that clusters of repetition can themselves repeat.

Yeah, I considered this but also wasn't sure of an elegant implementation. I'll keep turning it over.

@9999years 9999years force-pushed the stack-overflow-segfault branch 2 times, most recently from 12fcf78 to be44f2e Compare December 19, 2023 20:57
@9999years 9999years requested a review from roberth December 20, 2023 17:09
Copy link
Member

@roberth roberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could rename EvalState::depth to EvalState::callDepth and review my suggestions for comments, I think this is good to go.

src/libexpr/eval.hh Outdated Show resolved Hide resolved
src/libutil/error.cc Show resolved Hide resolved
src/libutil/error.cc Outdated Show resolved Hide resolved
@9999years 9999years force-pushed the stack-overflow-segfault branch from 587417c to 8d39b17 Compare December 24, 2023 23:01
Copy link
Member

@roberth roberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final breadcrumb

EDIT: looks like something was lost. This comment was only supposed to summarize a review with this comment.

src/libexpr/eval.cc Outdated Show resolved Hide resolved
This fixes a segfault on infinite function call recursion (rather than
infinite thunk recursion) by tracking the function call depth in
`EvalState`.

Additionally, to avoid printing extremely long stack traces, stack
frames are now deduplicated, with a `(19997 duplicate traces omitted)`
message. This should only really be triggered in infinite recursion
scenarios.

Before:

    $ nix-instantiate --eval --expr '(x: x x) (x: x x)'
    Segmentation fault: 11

After:

    $ nix-instantiate --eval --expr '(x: x x) (x: x x)'
    error: stack overflow

           at «string»:1:14:
                1| (x: x x) (x: x x)
                 |              ^

    $ nix-instantiate --eval --expr '(x: x x) (x: x x)' --show-trace
    error:
           … from call site
             at «string»:1:1:
                1| (x: x x) (x: x x)
                 | ^

           … while calling anonymous lambda
             at «string»:1:2:
                1| (x: x x) (x: x x)
                 |  ^

           … from call site
             at «string»:1:5:
                1| (x: x x) (x: x x)
                 |     ^

           … while calling anonymous lambda
             at «string»:1:11:
                1| (x: x x) (x: x x)
                 |           ^

           … from call site
             at «string»:1:14:
                1| (x: x x) (x: x x)
                 |              ^

           (19997 duplicate traces omitted)

           error: stack overflow
           at «string»:1:14:
                1| (x: x x) (x: x x)
                 |              ^
@9999years 9999years force-pushed the stack-overflow-segfault branch from 6cd3513 to 7434cac Compare December 30, 2023 06:16
@9999years 9999years requested a review from roberth December 30, 2023 06:17
@roberth roberth merged commit bf1b294 into NixOS:master Dec 30, 2023
8 checks passed
@roberth
Copy link
Member

roberth commented Dec 30, 2023

Thank you @9999years!

tebowy pushed a commit to tebowy/nix that referenced this pull request Jul 11, 2024
Fix segfault on infinite recursion in some cases

(cherry picked from commit bf1b294)
Change-Id: Id137541426ec8536567835953fccf986a3aebf16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug error-messages Confusing messages and better diagnostics language The Nix expression language; parser, interpreter, primops, evaluation, etc with-tests Issues related to testing. PRs with tests have some priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault on infinite recursion
6 participants