Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV while compressing on 1.5.0-4 version #327

Open
aidar-stripe opened this issue Oct 2, 2024 · 7 comments
Open

SIGSEGV while compressing on 1.5.0-4 version #327

aidar-stripe opened this issue Oct 2, 2024 · 7 comments

Comments

@aidar-stripe
Copy link

aidar-stripe commented Oct 2, 2024

👋 @luben!

We've been occasionally seeing segmentation faults while compressing parquet files using zstd-jni. In our environment we load zstd-jni version 1.5.0-4 which comes from Spark's ZStdCompressionCodec and load zstd of version 1.4.4 which comes from ZStandardCodec in Hadoop.

Segfault happens very rarely and usually Spark task retry helps to resolve the situation, but I'm still pretty curious on why this happens.

It's also interesting that execution jumps from zstd-jni to zstd library:

(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x00007f25dff75280  0x00007f25dff85f9b  Yes (*)     /lib/x86_64-linux-gnu/libz.so.1
0x00007f25dff56ae0  0x00007f25dff66535  Yes (*)     /lib/x86_64-linux-gnu/libpthread.so.0
0x00007f25dff40a20  0x00007f25dff49981  Yes         /usr/lib/jvm/java-11-openjdk-amd64/bin/../lib/jli/libjli.so
0x00007f25dff38220  0x00007f25dff39179  Yes (*)     /lib/x86_64-linux-gnu/libdl.so.2
0x00007f25dfd67630  0x00007f25dfedc4bd  Yes (*)     /lib/x86_64-linux-gnu/libc.so.6
0x00007f25dff9b100  0x00007f25dffbd684  Yes (*)     /lib64/ld-linux-x86-64.so.2
0x00007f25decfdfd0  0x00007f25df98a0a2  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/server/libjvm.so
0x00007f25de936120  0x00007f25dea1e332  Yes (*)     /lib/x86_64-linux-gnu/libstdc++.so.6
0x00007f25de7563c0  0x00007f25de7fcfa8  Yes (*)     /lib/x86_64-linux-gnu/libm.so.6
0x00007f25de7315e0  0x00007f25de742055  Yes (*)     /lib/x86_64-linux-gnu/libgcc_s.so.1
0x00007f25de625720  0x00007f25de628d70  Yes (*)     /lib/x86_64-linux-gnu/librt.so.1
0x00007f25de616620  0x00007f25de61cdca  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libverify.so
0x00007f25de5f2860  0x00007f25de606e11  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libjava.so
0x00007f25dff91360  0x00007f25dff9347b  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libjimage.so
0x00007f25d5da4320  0x00007f25d5daa09d  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libinstrument.so
0x00007f25d5d885c0  0x00007f25d5d8ea1c  Yes (*)     /lib/x86_64-linux-gnu/libnss_files.so.2
0x00007f25d5d7c620  0x00007f25d5d800b6  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libzip.so
0x00007f259a8aff20  0x00007f259a8b656a  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libnio.so
0x00007f259a8949a0  0x00007f259a8a1edb  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libnet.so
0x00007f259a187120  0x00007f259a187e70  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libmanagement.so
0x00007f259a17e400  0x00007f259a180590  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libmanagement_ext.so
                                        No          /pay/hadoop/yarn/local/usercache/root/appcache/application_1727194033885_13706/container_e219_1727194033885_13706_01_000966/tmp/libio_grpc_netty_shaded_netty_transport_native_epoll_x86_645600534466742408469.so
0x00007f2594921120  0x00007f25949217cf  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libextnet.so
0x00007f2592becd00  0x00007f2592bffcd4  Yes         /pay/hadoop-3.2.1/lib/native/libhadoop.so.1.0.0
0x00007f25948fa0e0  0x00007f25948fa49f  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libjaas.so
0x00007f2593bdf320  0x00007f2593be2998  Yes (*)     /lib/x86_64-linux-gnu/libnss_dns.so.2
0x00007f2593bc6720  0x00007f2593bd511c  Yes (*)     /lib/x86_64-linux-gnu/libresolv.so.2
0x00007f2593b98220  0x00007f2593bb0f54  Yes         /usr/lib/jvm/java-11-openjdk-amd64/lib/libsunec.so
                                        No          /pay/hadoop/yarn/local/usercache/root/appcache/application_1727194033885_13706/container_e219_1727194033885_13706_01_000966/tmp/liblz4-java-5498418542349824384.so
0x00007f2589c89400  0x00007f2589d6f8d4  Yes (*)     /pay/hadoop/yarn/local/usercache/root/appcache/application_1727194033885_13706/container_e219_1727194033885_13706_01_000966/tmp/libzstd-jni-1.5.0-417438354713579205691.so
0x00007f2593154240  0x00007f25931e5d0a  Yes (*)     /lib/x86_64-linux-gnu/libzstd.so.1
(*): Shared library is missing debugging information.
aidar@XXXXXXX:~$ ll /lib/x86_64-linux-gnu/libzstd.so.1
lrwxrwxrwx 1 root root 16 Mar  3  2021 /lib/x86_64-linux-gnu/libzstd.so.1 -> libzstd.so.1.4.4
(gdb) bt
#0  0x00007f25dfd8800b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f25dfd67859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f25decfe435 in os::abort (dump_core=<optimized out>, siginfo=<optimized out>, context=<optimized out>) at ./src/hotspot/os/linux/os_linux.cpp:1667
#3  0x00007f25df91735d in VMError::report_and_die (id=<optimized out>, message=message@entry=0x0, detail_fmt=<optimized out>, detail_args=detail_args@entry=0x7f258c8bd0e8, thread=thread@entry=0x556ea2ab1800, pc=pc@entry=0x7f259315cd9e <ZSTD_compressBlock_internal+1006> "\203\272\330\021", siginfo=0x7f258c8bd470, context=0x7f258c8bd340, filename=<optimized out>, lineno=0, size=0) at ./src/hotspot/share/utilities/vmError.cpp:1630
#4  0x00007f25df917ecf in VMError::report_and_die (thread=thread@entry=0x556ea2ab1800, sig=sig@entry=11, pc=pc@entry=0x7f259315cd9e <ZSTD_compressBlock_internal+1006> "\203\272\330\021", siginfo=siginfo@entry=0x7f258c8bd470, context=context@entry=0x7f258c8bd340, detail_fmt=detail_fmt@entry=0x7f25df9d462e "%s") at ./src/hotspot/share/utilities/vmError.cpp:1277
#5  0x00007f25df917f02 in VMError::report_and_die (thread=thread@entry=0x556ea2ab1800, sig=sig@entry=11, pc=pc@entry=0x7f259315cd9e <ZSTD_compressBlock_internal+1006> "\203\272\330\021", siginfo=siginfo@entry=0x7f258c8bd470, context=context@entry=0x7f258c8bd340) at ./src/hotspot/share/utilities/vmError.cpp:1283
#6  0x00007f25df66c836 in JVM_handle_linux_signal (sig=sig@entry=11, info=info@entry=0x7f258c8bd470, ucVoid=ucVoid@entry=0x7f258c8bd340, abort_if_unrecognized=abort_if_unrecognized@entry=1) at ./src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp:617
#7  0x00007f25df65f69c in signalHandler (sig=11, info=0x7f258c8bd470, uc=0x7f258c8bd340) at ./src/hotspot/os/linux/os_linux.cpp:5027
#8  <signal handler called>
#9  ZSTD_compressBlock_internal (zc=zc@entry=0x556ea7610b50, dst=dst@entry=0x45cb80759, dstCapacity=dstCapacity@entry=131582, src=src@entry=0x556eaa263149, srcSize=srcSize@entry=0, frame=frame@entry=1) at compress/zstd_compress.c:2440
#10 0x00007f259315d850 in ZSTD_compress_frameChunk (lastFrameChunk=0, srcSize=131072, src=<optimized out>, dstCapacity=131585, dst=0x45cb80756, cctx=0x556ea7610b50) at compress/zstd_compress.c:2511
#11 ZSTD_compressContinue_internal (cctx=0x556ea7610b50, dst=0x45cb80756, dstCapacity=<optimized out>, src=<optimized out>, srcSize=131072, frame=frame@entry=1, lastFrameChunk=0) at compress/zstd_compress.c:2654
#12 0x00007f259315dbb5 in ZSTD_compressContinue (cctx=<optimized out>, dst=<optimized out>, dstCapacity=<optimized out>, src=<optimized out>, srcSize=<optimized out>) at compress/zstd_compress.c:2678
#13 0x00007f2589d48667 in ZSTD_compressStream2 () from /pay/hadoop/yarn/local/usercache/root/appcache/application_1727194033885_13706/container_e219_1727194033885_13706_01_000966/tmp/libzstd-jni-1.5.0-417438354713579205691.so
#14 0x00007f2589d6de1e in Java_com_github_luben_zstd_ZstdOutputStreamNoFinalizer_compressStream () from /pay/hadoop/yarn/local/usercache/root/appcache/application_1727194033885_13706/container_e219_1727194033885_13706_01_000966/tmp/libzstd-jni-1.5.0-417438354713579205691.so
#15 0x00007f25cf03ecb4 in ?? ()
#16 0x0000000000010000 in ?? ()
#17 0x00000004c67dff68 in ?? ()
#18 0x0000000098cfbfed in ?? ()
#19 0x000000045cb80740 in ?? ()
#20 0x00000004c67dfef8 in ?? ()
#21 0x00000004c67afc78 in ?? ()
#22 0x000000045cba19e8 in ?? ()
#23 0x0000000000000000 in ?? ()
(gdb) frame 9
#9  ZSTD_compressBlock_internal (zc=zc@entry=0x556ea7610b50, dst=dst@entry=0x45cb80759, dstCapacity=dstCapacity@entry=131582, src=src@entry=0x556eaa263149, srcSize=srcSize@entry=0, frame=frame@entry=1) at compress/zstd_compress.c:2440
2440	compress/zstd_compress.c: No such file or directory.

(gdb) print zc
$1 = (ZSTD_CCtx *) 0x556ea7610b50

(gdb) print zc->blockState
$2 = {prevCBlock = 0x0, nextCBlock = 0x0, matchState = {window = {nextSrc = 0x556eaa283149 "", base = 0x556eaa263149 "(\002", dictBase = 0x0, dictLimit = 0, lowLimit = 0}, loadedDictEnd = 0, nextToUpdate = 0, hashLog3 = 0, hashTable = 0x100000001, hashTable3 = 0x556eaa1eaf40, chainTable = 0x0, opt = {litFreq = 0x556eaa483149, litLengthFreq = 0x0, matchLengthFreq = 0x556eaa23af48, offCodeFreq = 0x556eaa232f48, matchTable = 0x556eaa22af48, priceTable = 0x8000, litSum = 131072, litLengthSum = 0, matchLengthSum = 0, offCodeSum = 0, litSumBasePrice = 0, litLengthSumBasePrice = 0, matchLengthSumBasePrice = 0, offCodeSumBasePrice = 0, priceType = zop_dynamic, symbolCosts = 0x0, literalCompressionMode = ZSTD_lcm_auto}, dictMatchState = 0x0, cParams = {windowLog = 0, chainLog = 0, hashLog = 0, searchLog = 0, minMatch = 0, targetLength = 0, strategy = 0}}}

(gdb) print zc->blockState.prevCBlock
$3 = (ZSTD_compressedBlockState_t *) 0x0

(gdb) print zc->blockState.prevCBlock->entropy
Cannot access memory at address 0x0
(gdb)

Do you know if this is expected if we have different versions of zstd loaded? Should we keep them aligned?

@luben
Copy link
Owner

luben commented Oct 3, 2024

No idea, I don't think this is mixing up libzstd and libzstd-jni. Can you try replacing Spark's Zstd-jni with more recent? It is API/ABI compatible.

@aidar-stripe
Copy link
Author

aidar-stripe commented Oct 7, 2024

I was under the impression that it jumps around, considering that addresses for

0x00007f2589c89400  0x00007f2589d6f8d4  Yes (*)     .../libzstd-jni-1.5.0-417438354713579205691.so
0x00007f2593154240  0x00007f25931e5d0a  Yes (*)      .../libzstd.so.1

and code stack goes

...
#12 0x00007f259315dbb5  <---- is between [0x00007f2593154240  0x00007f25931e5d0a] libzstd.so.1
#13 0x00007f2589d48667 <---- is between [0x00007f2589c89400  0x00007f2589d6f8d4] libzstd-jni-1.5.0-417438354713579205691.so 
...

Similar core dump happens for 1.5.5 libzstd and 1.5.0-4 libzstd-jni. But I'm not quite sure if this ^ have anything to do with segfault, or it's just ZSTD core issue. Would you rather recommend moving the issue to https://github.com/facebook/zstd?

@luben
Copy link
Owner

luben commented Oct 8, 2024

Hmm, that's interesting. I don't think https://github.com/facebook/zstd will be able to help if that's the case. We will need some linker magic to fix that. What happens if you match the zstd and zstd-jni versions?

@luben
Copy link
Owner

luben commented Oct 8, 2024

Just pushed 0c2051b - it may help here. And you should not have problems in the other direction as libzstd-jni does not export all symbols as global.

@luben
Copy link
Owner

luben commented Oct 8, 2024

I will push it to maven later this week.

@aidar-stripe
Copy link
Author

Hmm, that's interesting. I don't think https://github.com/facebook/zstd will be able to help if that's the case. We will need some linker magic to fix that. What happens if you match the zstd and zstd-jni versions?

It's quite complex to reproduce the fault, the probability of the task failing is too low and I'm not sure which conditions lead to it to create reproducible test. I've tried simple compression/decompression with both libraries loaded with no success.

However, we would most likely align zstd-jni version with libzstd version on Ubuntu Noble (1.5.5) and reach back to confirm if it helped with the issue.

@luben
Copy link
Owner

luben commented Oct 28, 2024

Hi, sorry it too me so long. I have just pushed 1.5.6-7 to Maven

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants