Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCRE2 JIT compilation causes invalid memory access #13013

Closed
straight-shoota opened this issue Jan 26, 2023 · 16 comments · Fixed by #13056
Closed

PCRE2 JIT compilation causes invalid memory access #13013

straight-shoota opened this issue Jan 26, 2023 · 16 comments · Fixed by #13056
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:runtime topic:stdlib:text

Comments

@straight-shoota
Copy link
Member

straight-shoota commented Jan 26, 2023

JIT compilation with PCRE2 can causes an invalid memory access. This was discovered while running stdlib specs with PCRE2 (https://github.com/crystal-lang/crystal/actions/runs/4008539255/jobs/6883750327).

I'm not entirely sure what exactly triggers the error condition. So far I have only been able to reproduce it with running the entire std_spec suite. The error happens pretty much immediately after starting the spec run, before any output is printed (even with --verbose).

$ .build\std_spec --verbose
Invalid memory access (C0000005) at address 0xffffffffffffffff
[0x2003e98c255] ???
[0x2003f6a7330] ???
[0x8e] ???
[0x20041836b80] ???
[0x6] ???

The invalid memory access happens somewhere inside pcre2_match.

I have established that there are indeed a couple of regular expression matches happening before it errors.
The first match that failed was the equivalent of Regex.new("^(-?)0x([0-9A-Fa-f]+)(?:\\.([0-9A-Fa-f]+))?p([+-]?)([0-9]+)$").match("0x1.FFFFFEp+62"). But when skipping that one, a completely different regex fails. So I assume the error condition is not directly related to the specific match environment, and probably more systematic. I suppose there could be something wrong with how we're setting up the JIT compilation option (#12866).

I have not been able to reproduce the error with smaller sets of specs yet.

On all other platforms, JIT seems to be working good (or we're failing to hit some error conditions 🙈).

@straight-shoota straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:windows Windows support based on the MSVC toolchain / Win32 API topic:stdlib:text labels Jan 26, 2023
@HertzDevil HertzDevil removed the platform:windows Windows support based on the MSVC toolchain / Win32 API label Feb 6, 2023
@HertzDevil
Copy link
Contributor

HertzDevil commented Feb 6, 2023

I could reproduce this by simply matching many regexes:

5000.times { "".match Regex.new("a") }

The same snippet produces SIGSEGV even on Linux and macOS. The crash goes away if the GC is disabled, or if the @[ThreadLocal] for Regex::PCRE2.jit_stack is removed.

@HertzDevil
Copy link
Contributor

HertzDevil commented Feb 9, 2023

@[ThreadLocal] variables are not tracked by the GC:

module A
  @[ThreadLocal]
  @@x = Bytes.new(256, &.to_u8!)

  # this should trigger a GC cycle (explicit `GC.collect` doesn't work)
  10000.times do
    Bytes.new(64)
  end

  puts @@x.hexdump
end
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00000010  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f  ................
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00000060  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f  `abcdefghijklmno
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
000000b0  b0 b1 b2 b3 b4 b5 b6 b7  b8 b9 ba bb bc bd be bf  ................
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
000000d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
000000e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................

The only other place that uses @[ThreadLocal] is Thread.@@current, which most likely is always reachable somewhere because of its interaction with the event loop. The workaround suggested there is to add the variable as a GC root: (we can't allocate an uncollectable object because these class variables are not references)

module A
  @[ThreadLocal]
  @@x = Bytes.new(256, &.to_u8!)
  # for each thread???
  LibGC.add_roots(pointerof(@@x), pointerof(@@x) + 1)
end

module Regex::PCRE2
  @[ThreadLocal]
  class_getter jit_stack : LibPCRE2::JITStack do
    jit_stack = LibPCRE2.jit_stack_create(32_768, 1_048_576, Regex::PCRE2.general_context)
    if jit_stack.null?
      raise "Error allocating JIT stack"
    end
    jit_stack
  end
  # for each thread???
  LibGC.add_roots(pointerof(@@jit_stack), pointerof(@@jit_stack) + 1)
end

This indeed fixes the crash, but apparently this is not the only way; Nim had this issue too and they turned on some kind of TLS emulation.

@HertzDevil HertzDevil changed the title PCRE2 JIT compilation on Windows causes invalid memory access PCRE2 JIT compilation causes invalid memory access Feb 9, 2023
@asterite
Copy link
Member

asterite commented Feb 9, 2023

Where are we using ThreadLocal? I think a bunch of us knew it doesn't work well and we avoid it. There's another mechanism that to have thread local stuff, I think Ctystal::ThreadLocalVar or something.

We only use ThreadLocal in a single place on Crystal because it's generally broken.

@HertzDevil
Copy link
Contributor

Crystal::ThreadLocalValue calls Thread.current, and Thread.current uses a thread-local class variable, so obviously that variable cannot be a Crystal::ThreadLocalValue or an infinite recursion occurs. Also OpenBSD uses pthread's equivalent functionality instead of @[ThreadLocal].

@asterite
Copy link
Member

asterite commented Feb 9, 2023

Right, that's what I meant. We use @[ThreadLocal] only for Thread. For everything else we use Crystal::ThreadLocalValue.

My question is: are we using @[ThreadLocal] somewhere else? If so, we should try using Crystal::ThreadLocalValue instead. I'm almost sure it will work fine then.

@asterite
Copy link
Member

asterite commented Feb 9, 2023

There's this: af251b5

@asterite
Copy link
Member

asterite commented Feb 9, 2023

If you change jit_stack to use Crystal::ThreadLocalValue, 99% chances of fixing this bug 👍

@asterite
Copy link
Member

asterite commented Feb 9, 2023

That said, I can't remember why using @[ThreadLocal] only for Thread was fine. Maybe threads are always referenced somewhere else so it doesn't matter for the GC.

@HertzDevil
Copy link
Contributor

I think we can remove the recursion too if we use pthread_self / GetCurrentThreadId directly as the key without having to construct a Crystal Thread for every system thread. (Thread.current does it anyway, so the threads shouldn't be garbage-collected)

@HertzDevil
Copy link
Contributor

HertzDevil commented Feb 9, 2023

...or maybe not, or at least we have to use a plain array rather than a Hash, for maximum compatibility:

POSIX.1 allows an implementation wide freedom in choosing the type used to represent a thread ID; for example, representation using either an arithmetic type or a structure is permitted. Therefore, variables of type pthread_t can't portably be compared using the C equality operator (==); use pthread_equal(3) instead.

@asterite
Copy link
Member

asterite commented Feb 9, 2023

But I don't think there's an issue with using @[ThreadLocal] for that particular Thread.current case.

Could you try it?

@HertzDevil
Copy link
Contributor

HertzDevil commented Feb 9, 2023

I tried it, and indeed Thread.@@current must be written the way it is now, because when -Dpreview_mt is set, __crystal_once ultimately relies on it being nil before, well, all the constant initializers are run. If it were a Crystal::ThreadLocalValue then __crystal_once would try to use it before it were initialized, through a Mutex.

I don't think there is an easy way out for that @[ThreadLocal]. However we should make the threads API public as soon as possible in case library authors stumble across this annotation only to realize it doesn't work as expected.

@asterite
Copy link
Member

asterite commented Feb 9, 2023

Maybe we should also move that annotation inside the Crystal namespace, and stop documenting it.

@straight-shoota
Copy link
Member Author

straight-shoota commented Feb 9, 2023

Alternatively, clearly document what it does and what not.
Removing it from the public API should be good though.

@straight-shoota
Copy link
Member Author

Actually, it's defined by the compiler and part of the interface for stdlib. It's not in the stdlib API.
So I think that's fine as it is. We just need to improve the documentation in the language reference.

@asterite
Copy link
Member

asterite commented Feb 9, 2023

I think we remove it from the language reference:

https://crystal-lang.org/reference/1.7/syntax_and_semantics/annotations/built_in_annotations.html#threadlocal

Its only use is in Thread for that specific variable. Nobody should use it anywhere else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:runtime topic:stdlib:text
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants