Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing bug in proxysql 2.4.1 on Amazon Linux 2 #3889

Closed
minuteman3 opened this issue May 27, 2022 · 6 comments
Closed

Crashing bug in proxysql 2.4.1 on Amazon Linux 2 #3889

minuteman3 opened this issue May 27, 2022 · 6 comments

Comments

@minuteman3
Copy link

minuteman3 commented May 27, 2022

We're in the middle of trialing and rolling out proxysql for read replica connections to our AWS Aurora clusters. We're running proxysql 2.4.1 on Amazon Linux 2 (cpe:2.3:o:amazon:amazon_linux:2) installed from proxysql-2.4.1-1-centos7.x86_64.rpm distributed from GitHub.

We're currently blocked as a result of a crashing bug killing the proxy process:

Error: signal 11:
/usr/bin/proxysql(_Z13crash_handleri+0x2a)[0x5c004a]
/lib64/libc.so.6(+0x33d10)[0x7fce83414d10]
/usr/bin/proxysql(mysql_query_digest_and_first_comment_2+0x171e)[0x70610e]
/usr/bin/proxysql(_ZN15Query_Processor17query_parser_initEP20__SQP_query_parser_tPcii+0x60)[0x650de0]
/usr/bin/proxysql(_ZN10Query_Info5beginEPhib+0xa5)[0x628255]
/usr/bin/proxysql(_ZN13MySQL_Session20get_pkts_from_clientERbR10_PtrSize_t+0xa83)[0x640313]
/usr/bin/proxysql(_ZN13MySQL_Session7handlerEv+0xa7)[0x640757]
/usr/bin/proxysql(_ZN12MySQL_Thread20process_all_sessionsEv+0x40c)[0x62206c]
/usr/bin/proxysql(_ZN12MySQL_Thread3runEv+0x70a)[0x6232ea]
/usr/bin/proxysql(_Z24mysql_worker_thread_funcPv+0x6c)[0x5b865c]
/lib64/libpthread.so.0(+0x744b)[0x7fce8489144b]
/lib64/libc.so.6(clone+0x3f)[0x7fce834ce40f]
 ---- /usr/bin/proxysql(_Z13crash_handleri+0x2a) [0x5c004a] : crash_handler(int)
 ---- /usr/bin/proxysql(_ZN15Query_Processor17query_parser_initEP20__SQP_query_parser_tPcii+0x60) [0x650de0] : Query_Processor::query_parser_init(__SQP_query_parser_t*, char*, int, int)
 ---- /usr/bin/proxysql(_ZN10Query_Info5beginEPhib+0xa5) [0x628255] : Query_Info::begin(unsigned char*, int, bool)
 ---- /usr/bin/proxysql(_ZN13MySQL_Session20get_pkts_from_clientERbR10_PtrSize_t+0xa83) [0x640313] : MySQL_Session::get_pkts_from_client(bool&, _PtrSize_t&)
 ---- /usr/bin/proxysql(_ZN13MySQL_Session7handlerEv+0xa7) [0x640757] : MySQL_Session::handler()
 ---- /usr/bin/proxysql(_ZN12MySQL_Thread20process_all_sessionsEv+0x40c) [0x62206c] : MySQL_Thread::process_all_sessions()
 ---- /usr/bin/proxysql(_ZN12MySQL_Thread3runEv+0x70a) [0x6232ea] : MySQL_Thread::run()
 ---- /usr/bin/proxysql(_Z24mysql_worker_thread_funcPv+0x6c) [0x5b865c] : mysql_worker_thread_func(void*)
To report a crashing bug visit: https://github.com/sysown/proxysql/issues
For support visit: https://proxysql.com/services/support/

I've not been able to isolate a specific query that causes the crash. I've recovered core dumps but I'm not sure that I can share them as they may contain sensitive data, I'll need to verify on my side and will be in touch via https://proxysql.com/contact-us/

@renecannao
Copy link
Contributor

Although we can't reproduce a crash yet, we are able to reproduce an invalid memory read using this query:

SELECT 1 FROM DUAL WHERE 1 IN (10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009, 10010, 10011, 10012, 10013, 10014, 10015, 10016, 10017, 10018, 10019, 10020, 10021, 10022, 10023, 10024, 10025, 10026, 10027, 10028, 10029, 10030, 10031, 10032, 10033, 10034, 10035, 10036, 10037, 10038, 10039, 10040, 10041, 10042, 10043, 10044, 10045, 10046, 10047, 10048, 10049, 10050, 10051, 10052, 10053, 10054, 10055, 10056, 10057, 10058, 10059, 10060, 10061, 10062, 10063, 10064, 10065, 10066, 10067, 10068, 10069, 10070, 10071, 10072, 10073, 10074, 10075, 10076, 10077, 10078, 10079, 10080, 10081, 10082, 10083, 10084, 10085, 10086, 10087, 10088, 10089, 10090, 10091, 10092, 10093, 10094, 10095, 10096, 10097, 10098, 10099, 10100, 10101, 10102, 10103, 10104, 10105, 10106, 10107, 10108, 10109, 10110, 10111, 10112, 10113, 10114, 10115, 10116, 10117, 10118, 10119, 10120, 10121, 10122, 10123, 10124, 10125, 10126, 10127, 10128, 10129, 10130, 10131, 10132, 10133, 10134, 10135, 10136, 10137, 10138, 10139, 10140, 10141, 10142, 10143, 10144, 10145, 10146, 10147, 10148, 10149, 10150, 10151, 10152, 10153, 10154, 10155, 10156, 10157, 10158, 10159, 10160, 10161, 10162, 10163, 10164, 10165, 10166, 10167, 10168, 10169, 10170, 10171, 10172, 10173, 10174, 10175, 10176, 10177, 10178, 10179, 10180, 10181, 10182, 10183, 10184, 10185, 10186, 10187, 10188, 10189, 10190, 10191, 10192, 10193, 10194, 10195, 10196, 10197, 10198, 10199) /* my comment at the end of a long list of ids */;

According to valgrind:

==444949== 1 errors in context 1 of 1:
==444949== Invalid read of size 1
==444949==    at 0x89B5EF: process_cmnt_type_1 (c_tokenizer.cpp:1127)
==444949==    by 0x89B5EF: stage_1_parsing (c_tokenizer.cpp:1772)
==444949==    by 0x89B5EF: mysql_query_digest_and_first_comment_2 (c_tokenizer.cpp:2552)
==444949==    by 0x76B8BE: Query_Processor::query_parser_init(__SQP_query_parser_t*, char*, int, int) (Query_Processor.cpp:1899)
==444949==    by 0x71EA8E: Query_Info::query_parser_init() (MySQL_Session.cpp:350)
==444949==    by 0x71E745: Query_Info::begin(unsigned char*, int, bool) (MySQL_Session.cpp:299)
==444949==    by 0x730FC2: MySQL_Session::get_pkts_from_client(bool&, _PtrSize_t&) (MySQL_Session.cpp:3625)
==444949==    by 0x73422C: MySQL_Session::handler() (MySQL_Session.cpp:4391)
==444949==    by 0x6FAB3A: MySQL_Thread::process_all_sessions() (MySQL_Thread.cpp:3822)
==444949==    by 0x6F8904: MySQL_Thread::run() (MySQL_Thread.cpp:3324)
==444949==    by 0x624583: mysql_worker_thread_func(void*) (main.cpp:418)
==444949==    by 0x4A39608: start_thread (pthread_create.c:477)
==444949==    by 0x4ED0162: clone (clone.S:95)
==444949==  Address 0xa371ebd is 0 bytes after a block of size 1,485 alloc'd
==444949==    at 0x483C855: malloc (vg_replace_malloc.c:381)
==444949==    by 0x6E4406: MySQL_Data_Stream::buffer2array() (mysql_data_stream.cpp:988)
==444949==    by 0x6E3A89: MySQL_Data_Stream::read_pkts() (mysql_data_stream.cpp:902)
==444949==    by 0x6F96CF: MySQL_Thread::process_data_on_data_stream(MySQL_Data_Stream*, unsigned int) (MySQL_Thread.cpp:3550)
==444949==    by 0x6F7C08: MySQL_Thread::ProcessAllMyDS_AfterPoll() (MySQL_Thread.cpp:3050)
==444949==    by 0x6F8855: MySQL_Thread::run() (MySQL_Thread.cpp:3308)
==444949==    by 0x624583: mysql_worker_thread_func(void*) (main.cpp:418)
==444949==    by 0x4A39608: start_thread (pthread_create.c:477)
==444949==    by 0x4ED0162: clone (clone.S:95)

@axl89
Copy link

axl89 commented Jun 15, 2022

We've had the same issue when upgrading to 2.4.1 (we came from the latest 2.3.x) in an Ubuntu box:

ubuntu@psql:/home/ubuntu# uname -a
Linux ip-x.x.x.x 5.8.0-1038-aws #40~20.04.1-Ubuntu SMP Thu Jun 17 13:25:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@psql:~$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
"2022-06-14T14:56:31.704Z","""main-eu-proxysql-1""","""v4""","Error: signal 11:"
"2022-06-14T14:56:31.704Z","""main-eu-proxysql-1""","""v4""","/usr/bin/proxysql(_Z13crash_handleri+0x38)[0x558deeca2280]"
"2022-06-14T14:56:31.704Z","""main-eu-proxysql-1""","""v4""","/lib/x86_64-linux-gnu/libc.so.6(+0x430c0)[0x7fbad198a0c0]"
"2022-06-14T14:56:31.704Z","""main-eu-proxysql-1""","""v4""","/usr/bin/proxysql(_ZN12MySQL_Logger11log_requestEP13MySQL_SessionP17MySQL_Data_Stream+0x6b)[0x558deee993a7]"
"2022-06-14T14:56:31.704Z","""main-eu-proxysql-1""","""v4""","/usr/bin/proxysql(_ZN13MySQL_Session8LogQueryEP17MySQL_Data_Stream+0xba)[0x558deedb7ab8]"
"2022-06-14T14:56:31.704Z","""main-eu-proxysql-1""","""v4""","/usr/bin/proxysql(_ZN13MySQL_Session10RequestEndEP17MySQL_Data_Stream+0x85)[0x558deedb7b41]"
"2022-06-14T14:56:31.704Z","""main-eu-proxysql-1""","""v4""","/usr/bin/proxysql(_ZN13MySQL_Session43handler_again___status_RESETTING_CONNECTIONEv+0x636)[0x558deed9c840]"
"2022-06-14T14:56:31.704Z","""main-eu-proxysql-1""","""v4""","/usr/bin/proxysql(_ZN13MySQL_Session7handlerEv+0x500)[0x558deeda8964]"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","/usr/bin/proxysql(_ZN12MySQL_Thread20process_all_sessionsEv+0x655)[0x558deed6e9f1]"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","/usr/bin/proxysql(_ZN12MySQL_Thread3runEv+0xbe5)[0x558deed6c7c1]"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","/usr/bin/proxysql(_Z24mysql_worker_thread_funcPv+0xce)[0x558deec944cf]"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fbad1e9e609]"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fbad1a66163]"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","'---- /usr/bin/proxysql(_Z13crash_handleri+0x38) [0x558deeca2280] : crash_handler(int)"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","'---- /usr/bin/proxysql(_ZN12MySQL_Logger11log_requestEP13MySQL_SessionP17MySQL_Data_Stream+0x6b) [0x558deee993a7] : MySQL_Logger::log_request(MySQL_Session*, MySQL_Data_Stream*)"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","'---- /usr/bin/proxysql(_ZN13MySQL_Session8LogQueryEP17MySQL_Data_Stream+0xba) [0x558deedb7ab8] : MySQL_Session::LogQuery(MySQL_Data_Stream*)"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","'---- /usr/bin/proxysql(_ZN13MySQL_Session10RequestEndEP17MySQL_Data_Stream+0x85) [0x558deedb7b41] : MySQL_Session::RequestEnd(MySQL_Data_Stream*)"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","'---- /usr/bin/proxysql(_ZN13MySQL_Session43handler_again___status_RESETTING_CONNECTIONEv+0x636) [0x558deed9c840] : MySQL_Session::handler_again___status_RESETTING_CONNECTION()"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","'---- /usr/bin/proxysql(_ZN13MySQL_Session7handlerEv+0x500) [0x558deeda8964] : MySQL_Session::handler()"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","'---- /usr/bin/proxysql(_ZN12MySQL_Thread20process_all_sessionsEv+0x655) [0x558deed6e9f1] : MySQL_Thread::process_all_sessions()"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","'---- /usr/bin/proxysql(_ZN12MySQL_Thread3runEv+0xbe5) [0x558deed6c7c1] : MySQL_Thread::run()"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","'---- /usr/bin/proxysql(_Z24mysql_worker_thread_funcPv+0xce) [0x558deec944cf] : mysql_worker_thread_func(void*)"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","To report a crashing bug visit: https://github.com/sysown/proxysql/issues"
"2022-06-14T14:56:31.705Z","""main-eu-proxysql-1""","""v4""","For support visit: https://proxysql.com/services/support/"
"2022-06-14T15:01:43.000Z","""main-eu-proxysql-1""","""v4""","ProxySQL crashed. Restarting!"

Additionally we got a HUGE load average during this time frame:

image

image

I've uploaded here an excerpt of 5 minutes before the incident.

@renecannao
Copy link
Contributor

Hi @axl89

We've had the same issue ...

No, you didn't have the same issue, the backtrace is completely different. It is not the same issue, at all.

ProxySQL cannot cause a load of 400.

From manpage:

System load averages is the average number of processes that are either in a runnable or uninterruptable state.

So it should be improbable that the load was caused by proxysql.
It is more likely that something else was causing huge load on your system, and proxysql was a victim of it.

Do you have a core dump?

I've uploaded here an excerpt of 5 minutes before the incident.

Nothing relevant in it. Although, the excerpt contains 30 seconds before the incident, not 5 minutes.

Also, is there any reason why are you using a DEBUG build?

@axl89
Copy link

axl89 commented Jun 15, 2022

No, you didn't have the same issue, the backtrace is completely different. It is not the same issue, at all.

My bad, you're totally right. Let me know if you want me to open a new issue.

ProxySQL cannot cause a load of 400.

It's not the load average, it's the load average normalized per # of cpu cores, so the actual load average was actually way higher than 400. We will dive deeper into our observability tools to confirm what could have caused this.

Do you have a core dump?

I don't think so, but I will review the instance and come back to you if I have more details.

Also, is there any reason why are you using a DEBUG build?

Well, that's actually very surprising and embarrassing. We made a big mistake on our upgrade process... thank you for pointing this out.

@renecannao
Copy link
Contributor

Let me know if you want me to open a new issue.

Only if it is confirmed it is a bug, with a core dump.

@axl89
Copy link

axl89 commented Jun 15, 2022

  1. Is there a way to send the core dump file securely, @renecannao ? We found what looks like a core dump in /var/crash/_usr_bin_proxysql.997.crash.
  2. From our telemetry we can see that ProxySQL had a spike on memory usage from 16:51:55 to 16:55:35, and then the process was restarted at 17:01:55 (new PID):

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants