Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blobfuse may crash due DNS query thread time out. #279

Closed
xiangwuxw opened this issue May 31, 2019 · 5 comments
Closed

Blobfuse may crash due DNS query thread time out. #279

xiangwuxw opened this issue May 31, 2019 · 5 comments
Assignees

Comments

@xiangwuxw
Copy link

I had troubleshooted a blobfuse crash issue last year, and confirmed there is a bug in the libcurl which may crash the blobfuse. In that case, add mapping to hosts may prevent DNS timeout resolve the issue, Few weeks ago, I had another random crash issue and hosts doesn't help. while looking for a better way to debug the issue, as the admin can't keep the SSH session running all the time. I asked the admin to try the private build which set the NOSIGNAL flag. Interestingly after the private build was used, no more crash anymore. It looks like there other situations, even with hosts file, the name resolution thread may still timeout and causing the crash.

The libcurl bug would need curl_easy_setopt(m_curl, CURLOPT_NOSIGNAL, 1); before curl_easy_perform to workaround.

Crash stack like following

#0 0x0000003f334df393 in __poll (fds=, nfds=, timeout=) at ../sysdeps/unix/sysv/linux/poll.c:89
#1 0x0000003f3540c086 in send_dg (statp=0x7ff68e50ddb8, buf=0x7ff68e509d20 "\202\364\001", buflen=46, buf2=0x7ff68e509d50 "\214F\001", buflen2=46, ans=0x7ff68e50aa00 "\214F\201\200", anssiz=2048, ansp=0x7ff68e50b260, ansp2=0x7ff68e50b258, nansp2=0x7ff68e50b27c,
resplen2=0x7ff68e50b278, ansp2_malloced=0x7ff68e50b274) at res_send.c:1197
#2 __libc_res_nsend (statp=0x7ff68e50ddb8, buf=0x7ff68e509d20 "\202\364\001", buflen=46, buf2=0x7ff68e509d50 "\214F\001", buflen2=46, ans=0x7ff68e50aa00 "\214F\201\200", anssiz=2048, ansp=0x7ff68e50b260, ansp2=0x7ff68e50b258, nansp2=0x7ff68e50b27c,
resplen2=0x7ff68e50b278, ansp2_malloced=0x7ff68e50b274) at res_send.c:576
#3 0x0000003f35408811 in __libc_res_nquery (statp=0x7ff68e50ddb8, name=0x7ff680002230 "xwtest.blob.core.windows.net", class=1, type=62321, answer=0x7ff68e50aa00 "\214F\201\200", anslen=2048, answerp=0x7ff68e50b260, answerp2=0x7ff68e50b258, nanswerp2=0x7ff68e50b27c,
resplen2=0x7ff68e50b278, answerp2_malloced=0x7ff68e50b274) at res_query.c:227
#4 0x0000003f35408dd0 in __libc_res_nquerydomain (statp=0x7ff68e50ddb8, name=0x7ff680002230 "xwtest.blob.core.windows.net", domain=, class=1, type=62321, answer=0x7ff68e50aa00 "\214F\201\200", anslen=2048, answerp=0x7ff68e50b260,
answerp2=0x7ff68e50b258, nanswerp2=0x7ff68e50b27c, resplen2=0x7ff68e50b278, answerp2_malloced=0x7ff68e50b274) at res_query.c:589
#5 0x0000003f35409a91 in __libc_res_nsearch (statp=0x7ff68e50ddb8, name=0x7ff680002230 "xwtest.blob.core.windows.net", class=1, type=62321, answer=0x7ff68e50aa00 "\214F\201\200", anslen=2048, answerp=0x7ff68e50b260, answerp2=0x7ff68e50b258, nanswerp2=0x7ff68e50b27c,
resplen2=0x7ff68e50b278, answerp2_malloced=0x7ff68e50b274) at res_query.c:380
#6 0x00007ff68ef327d7 in _nss_dns_gethostbyname4_r (name=0x7ff680002230 "xwtest.blob.core.windows.net", pat=0x7ff68e50b8b8, buffer=0x7ff68e50b2e0 "\n\031\024=", buflen=1064, errnop=0x7ff68e50b8cc, herrnop=0x7ff68e50b8c8, ttlp=0x0) at nss_dns/dns-host.c:311
#7 0x0000003f334cff8b in gaih_inet (name=0x7ff680002230 "xwtest.blob.core.windows.net", service=, req=0x7ff68e50bb40, pai=, naddrs=0x7ff68e50ba98) at ../sysdeps/posix/getaddrinfo.c:882
#8 0x0000003f334d2fcf in getaddrinfo (name=, service=, hints=0x7ff68e50bb40, pai=0x7ff68e50baf8) at ../sysdeps/posix/getaddrinfo.c:2406
#9 0x0000003d28042b82 in Curl_getaddrinfo_ex (nodename=, servname=, hints=, result=0x7ff68e50bb78) at curl_addrinfo.c:131
#10 0x0000003d28037cc2 in Curl_getaddrinfo (conn=, hostname=0x7ff680002230 "xwtest.blob.core.windows.net", port=80, waitp=) at hostip6.c:245
#11 0x0000003d2800f77a in Curl_resolv (conn=0x7ff680001b20, hostname=0x7ff680002230 "xwtest.blob.core.windows.net", port=80, entry=0x7ff68e50bee0) at hostip.c:453
#12 0x0000003d2800fa48 in Curl_resolv_timeout (conn=0x7ff680001b20, hostname=0x7ff680002230 "xwtest.blob.core.windows.net", port=80, entry=0x7ff68e50bee0, timeoutms=) at hostip.c:627
#13 0x0000003d280244af in resolve_server (data=0x7ff688001580, in_connect=, async=0x7ff68e50c200) at url.c:4227
#14 create_conn (data=0x7ff688001580, in_connect=, async=0x7ff68e50c200) at url.c:4727
#15 0x0000003d28024623 in Curl_connect (data=, in_connect=0x7ff68e50c3f0, asyncp=0x7ff68e50c3ff, protocol_done=0x7ff68e50c3fe) at url.c:4853
#16 0x0000003d2802c930 in connect_host (data=0x7ff688001580) at transfer.c:2524
#17 Curl_perform (data=0x7ff688001580) at transfer.c:2660
#18 0x0000000000489c15 in microsoft_azure::storage::CurlEasyRequest::perform (this=0x7ff680000c98) at /root/azure-storage-fuse/azure-storage-cpp-lite/src/http/libcurl_http_client.cpp:58
#19 0x000000000048bd7f in microsoft_azure::storage::CurlEasyRequest::submit(std::function<void(int, microsoft_azure::storage::storage_istream, CURLcode)>, std::chrono::seconds) (this=0x7ff680000c98, cb=..., interval=)
at /root/azure-storage-fuse/azure-storage-cpp-lite/include/http/libcurl_http_client.h:83
#20 0x00000000004943db in microsoft_azure::storage::async_executor::submit_helper (promise=std::shared_ptr (count 3) 0x7ff680000e78, outcome=std::shared_ptr (count 3) 0x7ff680000e38, account=std::shared_ptr (count 4) 0x7ff688000d08,
request=std::shared_ptr (count 4) 0x7ff680000dc8, http=std::shared_ptr (count 4) 0x7ff680000c98, context=std::shared_ptr (count 4) 0x7ff688000f48, retry=std::shared_ptr (count 3) 0x7ff680000e08)
at /root/azure-storage-fuse/azure-storage-cpp-lite/include/executor.h:182
#21 0x0000000000497201 in microsoft_azure::storage::async_executor::submit (account=std::shared_ptr (count 4) 0x7ff688000d08, request=std::shared_ptr (count 4) 0x7ff680000dc8, http=std::shared_ptr (count 4) 0x7ff680000c98,
context=std::shared_ptr (count 4) 0x7ff688000f48) at /root/azure-storage-fuse/azure-storage-cpp-lite/include/executor.h:199
#22 0x000000000048ed2d in microsoft_azure::storage::blob_client::get_blob_property (this=0x7ff688001108, container=Unhandled dwarf expression opcode 0xf3
) at /root/azure-storage-fuse/azure-storage-cpp-lite/src/blob/blob_client.cpp:226
#23 0x000000000049e06a in microsoft_azure::storage::blob_client_wrapper::get_blob_property (this=Unhandled dwarf expression opcode 0xf3
) at /root/azure-storage-fuse/azure-storage-cpp-lite/src/blob/blob_client_wrapper.cpp:774
#24 0x0000000000449f6c in azs_getattr (path=0x7ff680000990 "/tls", stbuf=0x7ff68e50cc20) at /root/azure-storage-fuse/blobfuse/utilities.cpp:400
#25 0x0000003f3500b353 in lookup_path (f=0x12aad60, nodeid=1, name=0x7ff68e50e038 "tls", path=, e=0x7ff68e50cc10, fi=) at fuse.c:1824
#26 0x0000003f3500d865 in fuse_lib_lookup (req=0x7ff6800008c0, parent=1, name=0x7ff68e50e038 "tls") at fuse.c:2017
#27 0x0000003f350120ef in fuse_do_work (data=0x7ff6880008c0) at fuse_loop_mt.c:107
#28 0x0000003f33c07aa1 in start_thread (arg=0x7ff68e50d700) at pthread_create.c:301
#29 0x0000003f334e8bdd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:122

@amnguye
Copy link
Member

amnguye commented Jul 11, 2019

Hi Xiangwu,

Thank you for letting us know, troubleshooting and finding a workaround to this issue. We will definitely look into the main cause of this issue and see if the workaround is the best fit for now.

Thank you for also providing the crash stack.

@xiangwuxw
Copy link
Author

Thanks for looking into this issue. Here is the post for detail why it crashes. https://stackoverflow.com/questions/21887264/why-libcurl-needs-curlopt-nosignal-option-and-what-are-side-effects-when-it-is . DNS resolver is not thread safe.

I guess even hosts is there, due to high CPU or thread scheduling issue, the DNS resolver thread may still timeout. or that option may affect other operations which have similar dns signaling issue.

@xiangwuxw
Copy link
Author

If anyone has weird crash issue, download the private builds for Redhat 7 or Redhat 6 for testing purpose. If you no longer see crash, then you hit the libcurl bug for the multiple thread issue. Basically speaking the way the blobfuse is using libcurl doesn't support multithread.

https://xwfileping.blob.core.windows.net/files/blobfuse.6.ipv4.zip
https://xwfileping.blob.core.windows.net/files/blobfuse.7.ipv4.zip

@NaraVen NaraVen self-assigned this Sep 28, 2020
@NaraVen
Copy link
Collaborator

NaraVen commented Sep 29, 2020

@xiangwu-ms , Thank you for reporting this. Blobfuse does support multithreading as it uses only the thread-safe functions. It does not use CURLOPT_DNS_USE_GLOBAL_CACHE and more over all the global init are initialized with GLOBAL CONSTANTS. However, looks like from your bug report DNS timeout is not honored when the SIGNAL option is not set to 1L in Curl. We will make this change with proper SIGPIPE handling.

@NaraVen
Copy link
Collaborator

NaraVen commented Sep 30, 2020

@xiangwu-ms , we have set CURLOPT_NOSIGNAL = IL and verified that our latest release 1.3.4 has been compiled using the threaded resolver. The crash should no longer occur if you use 1.3.4

@NaraVen NaraVen closed this as completed Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants