Endless AGW crashloop, mme exiting on start with Sanitizer CHECK failed #15279

rechia-venko · 2023-08-02T17:38:17Z

Your Environment

Version: 1.8
Affected Component: Access Gateway
Affected Subcomponent: AGW MME, sessiond, mobilityd, pipelined
Deployment Environment: e.g. bare-metal AGW, VM orc8r deployed with magma-galaxy (https://github.com/ShubhamTatvamasi/magma-galaxy)

Describe the Issue

The MME service no longer starts, it stays in an endless crashloop together with mobilityd, pipelined, sessiond.

We have observed that the mme service keeps restarting with this error:

Aug 02 14:17:41 magma mme[4004418]: ==4004418==Sanitizer CHECK failed: ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53 ((!lsan_init_is_running)) != (0) (0, 0)

Aug 02 14:17:30 magma systemd[1]: magma@mme.service: Main process exited, code=exited, status=23/n/a
Aug 02 14:17:31 magma systemd[1]: magma@mme.service: Failed with result 'exit-code'.
Aug 02 14:17:36 magma systemd[1]: magma@mme.service: Scheduled restart job, restart counter is at 1451.
Aug 02 14:17:36 magma systemd[1]: Stopped Magma OAI MME service.
Aug 02 14:17:40 magma systemd[1]: Starting Magma OAI MME service...
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: enable_nat
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: congestion_control_enabled
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: enable5g_features
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_default_slice_service_type
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_default_slice_differentiator
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_name
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_region_id
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_set_id
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: amf_pointer
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: default_dnn
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: auth_retry_max_count
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for mme, key not found: auth_retry_interval
Aug 02 14:17:40 magma mme[4004398]: ERROR:root:Error retrieving config for spgw, key not found: ovs_internal_sampling_fwd_tbl
Aug 02 14:17:40 magma mme[4004398]: WARNING:root:Service (spgw) missing in mconfig
Aug 02 14:17:41 magma mme[4004416]: STATELESS	mme -> use_stateless
Aug 02 14:17:41 magma mme[4004416]: STATELESS	pipelined -> clean_restart
Aug 02 14:17:41 magma mme[4004416]: STATELESS	pipelined -> redis_enabled
Aug 02 14:17:41 magma mme[4004416]: STATELESS	sessiond -> support_stateless
Aug 02 14:17:41 magma mme[4004416]: Check returning return_codes.STATELESS
Aug 02 14:17:41 magma mme[4004416]: AGW is stateless, no need to restart Sctpd
Aug 02 14:17:41 magma systemd[1]: Started Magma OAI MME service.
Aug 02 14:17:41 magma mme[4004418]: ==4004418==Sanitizer CHECK failed: ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53 ((!lsan_init_is_running)) != (0) (0, 0)
Aug 02 14:17:41 magma systemd[1]: magma@mme.service: Main process exited, code=exited, status=23/n/a
Aug 02 14:17:42 magma systemd[1]: magma@mme.service: Failed with result 'exit-code'.

To Reproduce

We don't know why this is started happening. We were having issues with user data plane (no traffic was being forwarded from UE to the Internet). At some point during troubleshooting we decided to:

systemctl stop magma@*
systemctl start magma@magmad

Then the mme service never came back up.

Expected behavior

the mme service should come up

Screenshots

Here's the output of health_cli.py:

Additional context

Here's a syslog that shows part of the problem.

Aug 1 11:04 - at around this time we tried to systemctl stop magma@*. A bunch of errors show up in the syslog when the services are going down.
Aug 1 11:07:15 - we attempt to restart the service with systemctl start magma@magmad. Many errors show up, mainly related to redis connections.
Aug 1 11:07:24 - we see the first entry of this error, that now always occurs constantly:

magma mme[4019572]: ==4019572==Sanitizer CHECK failed: ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53 ((!lsan_init_is_running)) != (0) (0, 0)

here's the full syslog of when the error started happening. syslog.1.log.gz

The test scenario that we were trying to bring up was the S1 handover using srsran and zmq radio https://docs.srsran.com/projects/4g/en/latest/app_notes/source/handover/source/index.html#s1-handover.

temporary workaround

try at your own risk

Downgrade liblsan0 anb gcc-10-base of your agw from 10.5.0 to a lower version:

wget https://ftp.debian.org/debian/pool/main/g/gcc-10/liblsan0_10.2.1-6_amd64.deb
wget https://ftp.debian.org/debian/pool/main/g/gcc-10/gcc-10-base_10.2.1-6_amd64.deb
sudo dpkg -i gcc-10-base_10.2.1-6_amd64.deb liblsan0_10.2.1-6_amd64.deb

The text was updated successfully, but these errors were encountered:

javieraubert · 2023-08-02T22:46:11Z

Hello @rechia-venko I'm having the same issue, did you find what causes it?

gabrielrbeiro · 2023-08-03T01:18:00Z

We had this error with our AGW in 1.6.1, but even with a brand new AGW in 1.8.0 (orc8r also in 1.8.0) the issue continue.

vinothinfinity · 2023-08-03T04:47:52Z

Related issue. https://www.openwall.com/lists/musl/2015/07/02/13
As suggested in link, upgradinggcc/glibc might help !!

One more ref - rust-lang/rust#111073

harsharao87 · 2023-08-03T08:08:35Z

We are facing this issue in the recent docker image builds of agw locally

There is a recent update in ubuntu sources for gcc-10-base package

https://launchpad.net/ubuntu/+source/gcc-10/+changelog

due to which the liblsan0 got upgraded from 10.3.0 to 10.5.0.

gabrielrbeiro · 2023-08-03T10:57:08Z

It seems that gcc-10-base isn't available anymore in version 10.3.0 did you manage to downgrade it ?

harsharao87 · 2023-08-03T13:08:35Z

yes 10.3.0 is not available now, but a 10.2.1 version of deb images for gcc-10-base and liblsan0 is available - http://ftp.debian.org/debian/pool/main/g/gcc-10/.
This will help you unblock for now, please try this out in development environments only, as the impact is unknown yet.

But, we are unable to root cause why mme service is exiting with updated 10.5.0 liblsan0 package.

rechia-venko · 2023-08-03T13:27:58Z

Great tip, thanks a lot @harsharao87! We managed to bring our agw back up with the following:

wget https://ftp.debian.org/debian/pool/main/g/gcc-10/liblsan0_10.2.1-6_amd64.deb
wget https://ftp.debian.org/debian/pool/main/g/gcc-10/gcc-10-base_10.2.1-6_amd64.deb
sudo dpkg -i gcc-10-base_10.2.1-6_amd64.deb liblsan0_10.2.1-6_amd64.deb

We'll lilkely have to do this in production environments. Better to have something working than nothing. 😂

vinothinfinity · 2023-08-03T13:32:16Z

We saw the issue in v1.6.1 (service based arch). Here AGW went for a reboot and post reboot unattended-upgrades.service upgraded these packages. (You might want to check /var/log/dpkg.log You can see upgrade for pkg happening)

I think WA wise and RCA wise we are good, but what is the long term fix ? :)

edaspb · 2023-08-04T17:04:07Z

As far as I see automatic packet update is active by default on AGW. Which can be potentially dangerous.

jordanvrtanoski · 2023-08-23T01:30:39Z

Docker containers for x86_64 are also impacted. Workaround is to add the following lines in magma/lte/gateway/docker/services/c/Dockerfile:

RUN wget https://ftp.debian.org/debian/pool/main/g/gcc-10/liblsan0_10.2.1-6_amd64.deb && \
    wget https://ftp.debian.org/debian/pool/main/g/gcc-10/gcc-10-base_10.2.1-6_amd64.deb && \
    dpkg -i gcc-10-base_10.2.1-6_amd64.deb liblsan0_10.2.1-6_amd64.deb

before the line

RUN ldconfig 2> /dev/null

rechia-venko · 2023-08-23T11:50:40Z

The workaround works, but afterwards it prevents us from further upgrading the system due to broken dependencies. For instance, that OVS upgrade script doesn't work anymore for us.

Darlanewe · 2023-08-24T16:43:26Z

forgive-me for speaking maybe out of topic, but anyone else after these workaround experienced problems with UE connections with the agw in non nat mode
because after the workaround all of the services up and running aparently normal in non nat mode the ue get the ip from the router as expected but then theres no communication of any kind, no ping, nothing and the only change was the downgrade of the packages to normalize mme services, if any hel is offered will be much appreciated, and again sorry if this is off topic

edaspb · 2023-08-24T21:08:17Z

@Darlanewe, we don't see this behavior. 4 AGW in bridge mode. Workaround from rechia-venko

rechia-venko · 2023-08-31T13:00:27Z

I just found out that a fresh agw docker installation works. Interestingly, the magmad container has the older version of liblsan 10.3.0 running:

I don't know where the docker takes its libraries from, but it is different from the host in this case. Can we be sure that the containers will always stay with the old version?

Darlanewe · 2023-09-14T18:44:20Z

hi there,
we experienced the same mme malfunction today in one of our agw's
the error message is different but it has to do with sanitizer also
follow the error log

the gcc-10-base and liblsan0 are still applied accordingly to the rechia venko fix

the most strange thing about it is that the services were not restarted (at least not on purpose)
it was working perfectly and then the s1 envoy message came up and the mme entered the restart loop

rechia-venko · 2023-09-14T19:21:19Z

Could this case be due to a real leak detected by the leak sanitizer? Or some other kind of segmentation fault that is happening, and caught by the leak sanitizer just before exiting? Em qui., 14 de set. de 2023 15:44, Darlan Ewerling ***@***.***> escreveu:

…

hi there, we experienced the same mme malfunction today in one of our agw's the error message is different but it has to do with sanitizer also follow the error log [image: image] <https://user-images.githubusercontent.com/96587737/268077905-f4d5b22b-4c17-4051-9ee4-e29cc4d2395e.png> the gcc-10-base and liblsan0 are still applied accordingly to thw rechia venko fix the most strange thing about it is that the services were not restarted (at least not on purpose) it was working perfectly and then the s1 envoy message came up and the mme entered the restart loop — Reply to this email directly, view it on GitHub <#15279 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BAK7R4JFFCCYGTEVQLVOLQ3X2NGBDANCNFSM6AAAAAA3BTPIBI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Darlanewe · 2023-09-14T20:13:59Z

i can't say i have all the variables in this cenario but,
the files shown in the error with /home/vagrant (this directory does not exist on my agw) i have no idea how that happened
then i decided to renew the agw/orchestrator registration
and as soon as i deleted the /var/opt/magma/configs/gateway.mconfig
the mme came back to life

ferrieux · 2023-10-01T20:59:47Z

May I ask what is the status of this ?
Magma is supposed to be "production grade", yet has a fatal crash loop with an unexplained connection with the latest gcc. Then what ?
(a) gcc has a bug, Magma should be proud to have found it, be sure to advertise it here
(b) Magma has a bug, earlier gcc/sanitizers overlooked it, but now thankfully we can fix it. Issue number ?
(c) Nobody knows.
Which of the above is true ?

ferrieux · 2023-10-02T00:04:52Z

Update:

TL;DR:

(1) it is a gcc/sanitizer bug
(2) it is here to stay in gcc10 which is now end-of-support
(3) it is fixed in gcc11+ through reimplementation of dlsym allocator.
(4) since Ubuntu 20.04 is stuck to gcc10, a good workaround is to neutralize liblsan0 with a fake lib => NO NEED TO DOWNGRADE GCC ANY MORE

Detail:

I launched the MME with LSAN_OPTION=abort_on_error=1, which produced a core dump.
There I got a stack trace, shown below

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f1080c9f859 in __GI_abort () at abort.c:79
#2  0x00007f108209dc22 in __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:155
#3  0x00007f10820a92fc in __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
#4  0x00007f10820a9381 in __sanitizer::CheckFailed (file=file@entry=0x7f10820c1a90 "../../../../src/libsanitizer/lsan/lsan_interceptors.cpp", line=line@entry=53, 
    cond=cond@entry=0x7f10820c1040 "((!lsan_init_is_running)) != (0)", v1=v1@entry=0, v2=v2@entry=0) at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:82
#5  0x00007f108208da55 in __interceptor_malloc (size=size@entry=79) at ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:53
#6  0x00007f108299feb9 in __GI__dl_exception_create_format (exception=exception@entry=0x7fffc7eb2780, objname=0x7f10829b6ea0 "/lib/x86_64-linux-gnu/liblsan.so.0", fmt=fmt@entry=0x7f10829ac6f5 "undefined symbol: %s%s%s")
    at dl-exception.c:146
#7  0x00007f10829933bd in _dl_lookup_symbol_x (undef_name=0x7f10820c5abb "_thread_db_sizeof_pthread", undef_map=0x7f1082970000, ref=0x7fffc7eb2808, symbol_scope=0x7f1082970368, version=0x0, type_class=0, flags=3, 
    skip_map=0x0) at dl-lookup.c:878
#8  0x00007f1080ddd44d in do_sym (flags=<optimized out>, vers=0x0, who=0x7f108209b088 <__sanitizer::ThreadDescriptorSize()+40>, name=0x7f10820c5abb "_thread_db_sizeof_pthread", handle=<optimized out>) at dl-sym.c:117
#9  _dl_sym (handle=<optimized out>, name=0x7f10820c5abb "_thread_db_sizeof_pthread", who=0x7f108209b088 <__sanitizer::ThreadDescriptorSize()+40>) at dl-sym.c:274
#10 0x00007f1080c764a8 in dlsym_doit (a=a@entry=0x7fffc7eb2a50) at dlsym.c:50
#11 0x00007f1080ddd928 in __GI__dl_catch_exception (exception=exception@entry=0x7fffc7eb29e0, operate=operate@entry=0x7f1080c76490 <dlsym_doit>, args=args@entry=0x7fffc7eb2a50) at dl-error-skeleton.c:208
#12 0x00007f1080ddd9f3 in __GI__dl_catch_error (objname=objname@entry=0x7f10820dc210 <__interceptor_calloc::calloc_memory_for_dlsym+16>, 
    errstring=errstring@entry=0x7f10820dc218 <__interceptor_calloc::calloc_memory_for_dlsym+24>, mallocedp=mallocedp@entry=0x7f10820dc208 <__interceptor_calloc::calloc_memory_for_dlsym+8>, 
    operate=operate@entry=0x7f1080c76490 <dlsym_doit>, args=args@entry=0x7fffc7eb2a50) at dl-error-skeleton.c:227
#13 0x00007f1080c76b59 in _dlerror_run (operate=operate@entry=0x7f1080c76490 <dlsym_doit>, args=args@entry=0x7fffc7eb2a50) at dlerror.c:170
#14 0x00007f1080c76525 in __dlsym (handle=handle@entry=0x0, name=name@entry=0x7f10820c5abb "_thread_db_sizeof_pthread") at dlsym.c:70
#15 0x00007f108209b088 in __sanitizer::ThreadDescriptorSize () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:320
#16 0x00007f108209bbbe in __sanitizer::ThreadDescriptorSize () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:316
#17 __sanitizer::GetTls (size=0x7fffc7eb2b88, addr=0x7fffc7eb2bb0) at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:447
#18 __sanitizer::GetThreadStackAndTls (main=main@entry=true, stk_addr=stk_addr@entry=0x7fffc7eb2b90, stk_size=stk_size@entry=0x7fffc7eb2b80, tls_addr=tls_addr@entry=0x7fffc7eb2bb0, tls_size=tls_size@entry=0x7fffc7eb2b88)
    at ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:520
#19 0x00007f108208f7db in __lsan::ThreadStart (tid=tid@entry=0, os_id=1996211, thread_type=thread_type@entry=__sanitizer::ThreadType::Regular) at ../../../../src/libsanitizer/lsan/lsan_thread.cpp:83
#20 0x00007f1082089072 in __lsan_init () at ../../../../src/libsanitizer/lsan/lsan.cpp:119
#21 0x00007f1082998cf6 in _dl_init (main_map=0x7f10829b6190, argc=5, argv=0x7fffc7eb2c58, env=0x7fffc7eb2c88) at dl-init.c:104
#22 0x00007f108298813a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#23 0x0000000000000005 in ?? ()
#24 0x00007fffc7eb47da in ?? ()
#25 0x00007fffc7eb47ed in ?? ()
#26 0x00007fffc7eb47f0 in ?? ()
#27 0x00007fffc7eb480c in ?? ()
#28 0x00007fffc7eb480f in ?? ()
#29 0x0000000000000000 in ?? ()

There we see that __lsan_init() calls dlsym() which calls (the intercepted form of) malloc() which barfs since it doesn't expect to be called during initialization.

Funnily, the "interceptor" code contains specific steps for calloc() but none for malloc(). This is done initially to solve a chicken-and-egg problem for the instrumentation of calloc itself by dlsym (which calls calloc ...). But this also allows to call dlsym repeatedly within the init. This has worked for years I guess, but is fragile, as it makes the assumption that dlsym will always call calloc and not malloc... Too bad nobody cared to notify the dlsym() developers about this :)

As we can see in the above stack trace, dlsym in some situations ends up calling malloc, hence our crash.

Later versions of gcc fix that by using a dedicated allocator for dlsym. End of the story ?
Nope, for people like us who are stuck with a version "beyond backports", as 10.x is end-of-support !!!

Okay, no problem, we have radical options: just get rid of the bugger.
Write a one-line replacement for liblsan0.so that just does:

void __lsan_init(void){}

And replace the (dysfunctional) system-installed libsan0.

MME happy and running without any downgrade :)

Darlanewe · 2023-10-03T00:09:26Z

hi @ferrieux
could you be more specific on how to do the whole replacement process of this lib?

ferrieux · 2023-10-03T06:24:19Z

Just compile with gcc -shared -fPIC, then either overwrite the existing liblsan0.so.0 wherever your distro puts it, or put the new one earlier in the LD_LIBRARY_PATH.

ferrieux · 2023-10-03T22:33:35Z

Note, I have reported the gcc bug to Ubuntu:
https://bugs.launchpad.net/ubuntu/+source/gcc-defaults/+bug/2038386

rechia-venko added the type: bug Something isn't working label Aug 2, 2023

shashidhar-patil mentioned this issue Oct 30, 2023

CI Test Failing: build_dockerfile_bazel_cache_prod #15332

Closed

sT331h0rs3 mentioned this issue Dec 11, 2023

The program compiled and run with lsan alone reports an error. google/sanitizers#1710

Closed

amarjargal mentioned this issue May 15, 2024

MME crashing due to libprotobuf FATAL /usr/include/google/protobuf/repeated_field.h:1514] CHECK failed: (index) < (current_size_) error #15429

Open

asantos-vk mentioned this issue Sep 24, 2024

mme and pipelined services fail to start on v1.8 bare metal AGW installation #15531

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endless AGW crashloop, mme exiting on start with Sanitizer CHECK failed #15279

Endless AGW crashloop, mme exiting on start with Sanitizer CHECK failed #15279

rechia-venko commented Aug 2, 2023 •

edited

Loading

javieraubert commented Aug 2, 2023

gabrielrbeiro commented Aug 3, 2023 •

edited

Loading

vinothinfinity commented Aug 3, 2023 •

edited

Loading

harsharao87 commented Aug 3, 2023 •

edited

Loading

gabrielrbeiro commented Aug 3, 2023

harsharao87 commented Aug 3, 2023

rechia-venko commented Aug 3, 2023

vinothinfinity commented Aug 3, 2023 •

edited

Loading

edaspb commented Aug 4, 2023

jordanvrtanoski commented Aug 23, 2023

rechia-venko commented Aug 23, 2023

Darlanewe commented Aug 24, 2023 •

edited

Loading

edaspb commented Aug 24, 2023

rechia-venko commented Aug 31, 2023

Darlanewe commented Sep 14, 2023 •

edited

Loading

rechia-venko commented Sep 14, 2023 via email

Darlanewe commented Sep 14, 2023

ferrieux commented Oct 1, 2023

ferrieux commented Oct 2, 2023

Darlanewe commented Oct 3, 2023

ferrieux commented Oct 3, 2023

ferrieux commented Oct 3, 2023

Endless AGW crashloop, mme exiting on start with Sanitizer CHECK failed #15279

Endless AGW crashloop, mme exiting on start with Sanitizer CHECK failed #15279

Comments

rechia-venko commented Aug 2, 2023 • edited Loading

Your Environment

Describe the Issue

To Reproduce

Expected behavior

Screenshots

Additional context

temporary workaround

javieraubert commented Aug 2, 2023

gabrielrbeiro commented Aug 3, 2023 • edited Loading

vinothinfinity commented Aug 3, 2023 • edited Loading

harsharao87 commented Aug 3, 2023 • edited Loading

gabrielrbeiro commented Aug 3, 2023

harsharao87 commented Aug 3, 2023

rechia-venko commented Aug 3, 2023

vinothinfinity commented Aug 3, 2023 • edited Loading

edaspb commented Aug 4, 2023

jordanvrtanoski commented Aug 23, 2023

rechia-venko commented Aug 23, 2023

Darlanewe commented Aug 24, 2023 • edited Loading

edaspb commented Aug 24, 2023

rechia-venko commented Aug 31, 2023

Darlanewe commented Sep 14, 2023 • edited Loading

rechia-venko commented Sep 14, 2023 via email

Darlanewe commented Sep 14, 2023

ferrieux commented Oct 1, 2023

ferrieux commented Oct 2, 2023

TL;DR:

Detail:

Darlanewe commented Oct 3, 2023

ferrieux commented Oct 3, 2023

ferrieux commented Oct 3, 2023

rechia-venko commented Aug 2, 2023 •

edited

Loading

gabrielrbeiro commented Aug 3, 2023 •

edited

Loading

vinothinfinity commented Aug 3, 2023 •

edited

Loading

harsharao87 commented Aug 3, 2023 •

edited

Loading

vinothinfinity commented Aug 3, 2023 •

edited

Loading

Darlanewe commented Aug 24, 2023 •

edited

Loading

Darlanewe commented Sep 14, 2023 •

edited

Loading