-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endless AGW crashloop, mme exiting on start with Sanitizer CHECK failed #15279
Comments
Hello @rechia-venko I'm having the same issue, did you find what causes it? |
We had this error with our AGW in 1.6.1, but even with a brand new AGW in 1.8.0 (orc8r also in 1.8.0) the issue continue. |
Related issue. https://www.openwall.com/lists/musl/2015/07/02/13 One more ref - rust-lang/rust#111073 |
We are facing this issue in the recent docker image builds of agw locally There is a recent update in ubuntu sources for gcc-10-base package https://launchpad.net/ubuntu/+source/gcc-10/+changelog due to which the liblsan0 got upgraded from 10.3.0 to 10.5.0. |
It seems that gcc-10-base isn't available anymore in version 10.3.0 did you manage to downgrade it ? |
yes 10.3.0 is not available now, but a 10.2.1 version of deb images for gcc-10-base and liblsan0 is available - http://ftp.debian.org/debian/pool/main/g/gcc-10/. But, we are unable to root cause why mme service is exiting with updated 10.5.0 liblsan0 package. |
Great tip, thanks a lot @harsharao87! We managed to bring our agw back up with the following: wget https://ftp.debian.org/debian/pool/main/g/gcc-10/liblsan0_10.2.1-6_amd64.deb
wget https://ftp.debian.org/debian/pool/main/g/gcc-10/gcc-10-base_10.2.1-6_amd64.deb
sudo dpkg -i gcc-10-base_10.2.1-6_amd64.deb liblsan0_10.2.1-6_amd64.deb We'll lilkely have to do this in production environments. Better to have something working than nothing. 😂 |
We saw the issue in v1.6.1 (service based arch). Here AGW went for a reboot and post reboot unattended-upgrades.service upgraded these packages. (You might want to check /var/log/dpkg.log You can see upgrade for pkg happening) I think WA wise and RCA wise we are good, but what is the long term fix ? :) |
Docker containers for x86_64 are also impacted. Workaround is to add the following lines in
before the line
|
The workaround works, but afterwards it prevents us from further upgrading the system due to broken dependencies. For instance, that OVS upgrade script doesn't work anymore for us. |
forgive-me for speaking maybe out of topic, but anyone else after these workaround experienced problems with UE connections with the agw in non nat mode |
@Darlanewe, we don't see this behavior. 4 AGW in bridge mode. Workaround from rechia-venko |
I just found out that a fresh agw docker installation works. Interestingly, the magmad container has the older version of liblsan 10.3.0 running: I don't know where the docker takes its libraries from, but it is different from the host in this case. Can we be sure that the containers will always stay with the old version? |
Could this case be due to a real leak detected by the leak sanitizer? Or
some other kind of segmentation fault that is happening, and caught by the
leak sanitizer just before exiting?
Em qui., 14 de set. de 2023 15:44, Darlan Ewerling ***@***.***>
escreveu:
… hi there,
we experienced the same mme malfunction today in one of our agw's
the error message is different but it has to do with sanitizer also
follow the error log
[image: image]
<https://user-images.githubusercontent.com/96587737/268077905-f4d5b22b-4c17-4051-9ee4-e29cc4d2395e.png>
the gcc-10-base and liblsan0 are still applied accordingly to thw rechia
venko fix
the most strange thing about it is that the services were not restarted
(at least not on purpose)
it was working perfectly and then the s1 envoy message came up and the mme
entered the restart loop
—
Reply to this email directly, view it on GitHub
<#15279 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAK7R4JFFCCYGTEVQLVOLQ3X2NGBDANCNFSM6AAAAAA3BTPIBI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
i can't say i have all the variables in this cenario but, |
May I ask what is the status of this ? |
Update: TL;DR:
Detail:I launched the MME with LSAN_OPTION=abort_on_error=1, which produced a core dump. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f1080c9f859 in __GI_abort () at abort.c:79
#2 0x00007f108209dc22 in __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:155
#3 0x00007f10820a92fc in __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
#4 0x00007f10820a9381 in __sanitizer::CheckFailed (file=file@entry=0x7f10820c1a90 "../../../../src/libsanitizer/lsan/lsan_interceptors.cpp", line=line@entry=53,
cond=cond@entry=0x7f10820c1040 "((!lsan_init_is_running)) != (0)", v1=v1@entry=0, v2=v2@entry=0) at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:82
#5 0x00007f108208da55 in __interceptor_malloc (size=size@entry=79) at ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53
#6 0x00007f108299feb9 in __GI__dl_exception_create_format (exception=exception@entry=0x7fffc7eb2780, objname=0x7f10829b6ea0 "/lib/x86_64-linux-gnu/liblsan.so.0", fmt=fmt@entry=0x7f10829ac6f5 "undefined symbol: %s%s%s")
at dl-exception.c:146
#7 0x00007f10829933bd in _dl_lookup_symbol_x (undef_name=0x7f10820c5abb "_thread_db_sizeof_pthread", undef_map=0x7f1082970000, ref=0x7fffc7eb2808, symbol_scope=0x7f1082970368, version=0x0, type_class=0, flags=3,
skip_map=0x0) at dl-lookup.c:878
#8 0x00007f1080ddd44d in do_sym (flags=<optimized out>, vers=0x0, who=0x7f108209b088 <__sanitizer::ThreadDescriptorSize()+40>, name=0x7f10820c5abb "_thread_db_sizeof_pthread", handle=<optimized out>) at dl-sym.c:117
#9 _dl_sym (handle=<optimized out>, name=0x7f10820c5abb "_thread_db_sizeof_pthread", who=0x7f108209b088 <__sanitizer::ThreadDescriptorSize()+40>) at dl-sym.c:274
#10 0x00007f1080c764a8 in dlsym_doit (a=a@entry=0x7fffc7eb2a50) at dlsym.c:50
#11 0x00007f1080ddd928 in __GI__dl_catch_exception (exception=exception@entry=0x7fffc7eb29e0, operate=operate@entry=0x7f1080c76490 <dlsym_doit>, args=args@entry=0x7fffc7eb2a50) at dl-error-skeleton.c:208
#12 0x00007f1080ddd9f3 in __GI__dl_catch_error (objname=objname@entry=0x7f10820dc210 <__interceptor_calloc::calloc_memory_for_dlsym+16>,
errstring=errstring@entry=0x7f10820dc218 <__interceptor_calloc::calloc_memory_for_dlsym+24>, mallocedp=mallocedp@entry=0x7f10820dc208 <__interceptor_calloc::calloc_memory_for_dlsym+8>,
operate=operate@entry=0x7f1080c76490 <dlsym_doit>, args=args@entry=0x7fffc7eb2a50) at dl-error-skeleton.c:227
#13 0x00007f1080c76b59 in _dlerror_run (operate=operate@entry=0x7f1080c76490 <dlsym_doit>, args=args@entry=0x7fffc7eb2a50) at dlerror.c:170
#14 0x00007f1080c76525 in __dlsym (handle=handle@entry=0x0, name=name@entry=0x7f10820c5abb "_thread_db_sizeof_pthread") at dlsym.c:70
#15 0x00007f108209b088 in __sanitizer::ThreadDescriptorSize () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:320
#16 0x00007f108209bbbe in __sanitizer::ThreadDescriptorSize () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:316
#17 __sanitizer::GetTls (size=0x7fffc7eb2b88, addr=0x7fffc7eb2bb0) at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:447
#18 __sanitizer::GetThreadStackAndTls (main=main@entry=true, stk_addr=stk_addr@entry=0x7fffc7eb2b90, stk_size=stk_size@entry=0x7fffc7eb2b80, tls_addr=tls_addr@entry=0x7fffc7eb2bb0, tls_size=tls_size@entry=0x7fffc7eb2b88)
at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:520
#19 0x00007f108208f7db in __lsan::ThreadStart (tid=tid@entry=0, os_id=1996211, thread_type=thread_type@entry=__sanitizer::ThreadType::Regular) at ../../../../src/libsanitizer/lsan/lsan_thread.cpp:83
#20 0x00007f1082089072 in __lsan_init () at ../../../../src/libsanitizer/lsan/lsan.cpp:119
#21 0x00007f1082998cf6 in _dl_init (main_map=0x7f10829b6190, argc=5, argv=0x7fffc7eb2c58, env=0x7fffc7eb2c88) at dl-init.c:104
#22 0x00007f108298813a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#23 0x0000000000000005 in ?? ()
#24 0x00007fffc7eb47da in ?? ()
#25 0x00007fffc7eb47ed in ?? ()
#26 0x00007fffc7eb47f0 in ?? ()
#27 0x00007fffc7eb480c in ?? ()
#28 0x00007fffc7eb480f in ?? ()
#29 0x0000000000000000 in ?? () There we see that Funnily, the "interceptor" code contains specific steps for As we can see in the above stack trace, Later versions of gcc fix that by using a dedicated allocator for dlsym. End of the story ? Okay, no problem, we have radical options: just get rid of the bugger. void __lsan_init(void){} And replace the (dysfunctional) system-installed libsan0. MME happy and running without any downgrade :) |
hi @ferrieux |
Just compile with |
Note, I have reported the gcc bug to Ubuntu: |
Your Environment
Describe the Issue
The MME service no longer starts, it stays in an endless crashloop together with mobilityd, pipelined, sessiond.
We have observed that the mme service keeps restarting with this error:
Aug 02 14:17:41 magma mme[4004418]: ==4004418==Sanitizer CHECK failed: ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53 ((!lsan_init_is_running)) != (0) (0, 0)
To Reproduce
We don't know why this is started happening. We were having issues with user data plane (no traffic was being forwarded from UE to the Internet). At some point during troubleshooting we decided to:
systemctl stop magma@*
systemctl start magma@magmad
Then the mme service never came back up.
Expected behavior
the mme service should come up
Screenshots
Here's the output of health_cli.py:
Additional context
Here's a syslog that shows part of the problem.
Aug 1 11:04
- at around this time we tried tosystemctl stop magma@*
. A bunch of errors show up in the syslog when the services are going down.Aug 1 11:07:15
- we attempt to restart the service withsystemctl start magma@magmad
. Many errors show up, mainly related to redis connections.Aug 1 11:07:24
- we see the first entry of this error, that now always occurs constantly:magma mme[4019572]: ==4019572==Sanitizer CHECK failed: ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53 ((!lsan_init_is_running)) != (0) (0, 0)
here's the full syslog of when the error started happening. syslog.1.log.gz
The test scenario that we were trying to bring up was the S1 handover using srsran and zmq radio https://docs.srsran.com/projects/4g/en/latest/app_notes/source/handover/source/index.html#s1-handover.
temporary workaround
try at your own risk
Downgrade liblsan0 anb gcc-10-base of your agw from 10.5.0 to a lower version:
The text was updated successfully, but these errors were encountered: