Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP::Client hangs #4650

Closed
nobilik opened this issue Jul 1, 2017 · 40 comments
Closed

HTTP::Client hangs #4650

nobilik opened this issue Jul 1, 2017 · 40 comments
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib
Milestone

Comments

@nobilik
Copy link

nobilik commented Jul 1, 2017

I don't have big experience as issue writer, but I'll try

When I use HTTP::Client with concurrency >= 5 (for sure, maybe less) my app completely hangs with no output. From time to time it can pass but it happens seldom.

TOP shows:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
13360 user   20   0  357836  22860   7580 R 100,0  0,1   1:26.57 insta_app 

There is no any output with strace also.

Crystal 0.23.0 [3c3d3e2] (2017-06-28) LLVM 3.8.1
Ubuntu 16.04.2 LTS

require "http/client"
API_URI = URI.parse("https://www.instagram.com")
loop do
   get_comments
   sleep 10
end
   def get_comments
    concurrency = 30
    request_qtt = 3000
    channel = Channel(Nil).new(concurrency)
    concurrency.times do
      spawn make_request(request_qtt/concurrency, channel)
    end
    concurrency.times { channel.receive }
  end

  private def make_request(instances, channel)
    client = HTTP::Client.new(API_URI)
    instances.times do
      result = client.get("/cocacola/media/")
      puts result.status_code
    end
  ensure
    client.close if client
    channel.send(nil)
  end
@kostya
Copy link
Contributor

kostya commented Jul 1, 2017

can you give full example to reproduce? this is not compiled

@nobilik
Copy link
Author

nobilik commented Jul 1, 2017

Sorry, I fix it. Try now

@nobilik
Copy link
Author

nobilik commented Jul 3, 2017

Hi. There is one suggestion, I think the problem appear when I upgrade Crystal from 0.21 to 0.22. And in 0.22 have been added OpenSSL 1.1.0 support. I'm not sure, but maybe it's here. Because before I saw match more errors with SSL in my app.
By the way, is there any way to downgrade Crystal version from 0.23 to 0.21? Thanks for help.

@kostya
Copy link
Contributor

kostya commented Jul 3, 2017

if your try site without https, it hangs?

@nobilik
Copy link
Author

nobilik commented Jul 3, 2017

I've tried to make it with http and there was no issues for 50 loops. With https it happens every 5 loops or less @ysbaddaden Can you please take a part of discussion as an author of support for OpenSSL 1.1.0 (See #4215, #4230, thanks @ysbaddaden)?

@oprypin
Copy link
Member

oprypin commented Jul 3, 2017

http probably just serves a redirect to https so it may not be a good indicator

@nobilik
Copy link
Author

nobilik commented Jul 3, 2017

@oprypin I've tested it on site without redirects, so I think the problem is somewhere between http client and openssl

@ysbaddaden
Copy link
Contributor

Which OpenSSL version is being linked and used?

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jul 3, 2017

Using your sample, I can't reproduce on Ubuntu 14.04 (trusty) with Crystal 0.23 (compiled from source) and OpenSSL 1.0.1f (official package), even after running numerous times.

With OpenSSL 1.0.2h and 1.1.0e (both compiled from source), I get the following SSL error:

Unhandled exception in spawn:
SSL_connect: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed (OpenSSL::SSL::Error)
0x51506f: initialize at crystal-0.23.0/src/openssl/ssl/socket.cr 36:7
0x514ec2: *OpenSSL::SSL::Socket::Client::new:context:sync_close:hostname<TCPSocket, OpenSSL::SSL::Context::Client, Bool, String>:OpenSSL::SSL::Socket::Client at ??
0x50ac57: socket at crystal-0.23.0/src/http/client.cr 657:5
0x50a97d: exec_internal_single at crystal-0.23.0/src/http/client.cr 500:5
0x507c9c: exec_internal at crystal-0.23.0/src/http/client.cr 486:5
0x507b82: exec at crystal-0.23.0/src/http/client.cr 483:5
0x5078c6: exec at crystal-0.23.0/src/http/client.cr 586:5
0x5078a7: get at crystal-0.23.0/src/http/client.cr 329:3
0x5072ae: make_request at /bug.cr 21:14
0x46cb3e: ~procProc(Nil) at crystal-0.23.0/bug.cr 16:3
0x48238e: run at crystal-0.23.0/src/fiber.cr 255:3
0x46a0b6: ~proc2Proc(Fiber, (IO::FileDescriptor | Nil)) at crystal-0.23.0/src/concurrent.cr 61:3
0x0: ??? at ??

I'm not sure where to start looking at. Trying to connect to any https just fails.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jul 3, 2017

Reduced to:

require "openssl"
tcp_socket = TCPSocket.new("www.google.com", 443)        
ssl_socket = OpenSSL::SSL::Socket::Client.new(tcp_socket)

@nobilik
Copy link
Author

nobilik commented Jul 3, 2017

OpenSSL 1.0.2g 1 Mar 2016
@ysbaddaden Should I try to upgrade it? Do you think that the problem can be here?

@straight-shoota
Copy link
Member

Since newer versions 1.0.2h and 1.1.0e both fail, there is probably no point in upgrading.

@ysbaddaden The reduced sample does not raise an SSL error with 1.0.2g and Crystal 0.23.0

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jul 3, 2017

My issue was unrelated: I compiled from source and openssl couldn't find the system certs... with proper links it now works correctly. Sorry for the noise.

I can't reproduce your issue with OpenSSL versions 1.0.1f, 1.0.2h or 1.1.0e.

@nobilik
Copy link
Author

nobilik commented Jul 4, 2017

@ysbaddaden Can you send me your compiled file. I'll run it on my PC? [email protected]

@ysbaddaden
Copy link
Contributor

I guess you'd just hit issues with shared libs.

@nobilik
Copy link
Author

nobilik commented Jul 4, 2017

@ysbaddaden can you advise me what to do? how to find and fix... I'm trying to catch it for 3 days already(

@akzhan
Copy link
Contributor

akzhan commented Jul 4, 2017

@straight-shoota
Copy link
Member

I don't think this is going to work because the ubuntu repo only contains the latest version IIRC. You'll have to install prior releases manually from https://github.com/crystal-lang/crystal/releases

@nobilik
Copy link
Author

nobilik commented Jul 5, 2017

I've just installed Ubuntu 14.04 and run my example code on it. The result is absolutely same, my app hangs. Ok, it was 97% CPU, not 100% but this is the only difference. I can't understand why this issue is reproducible by me only(

@straight-shoota
Copy link
Member

Yeah, I've tried you're code sample with the same setup Ubuntu 16.04, Crystal 0.23.0, LLVM 3.8.1, OpenSSL 1.0.2.g and it works :/

Could you add some puts debugging to see where it hangs? Just insert puts at significant places like method and block entry/exit. It might not be that helpful because of concurrency, but it's worth a try.

@nobilik
Copy link
Author

nobilik commented Jul 5, 2017

@straight-shoota My "puts" debugging brings me here:

def consume_body_io
    if io = @body_io
      puts "consume_body_io start"
      @body = io.gets_to_end
      @body_io = nil
      puts "consume_body_io finish"
    end
  end

It starts but never ends. And with little improvement of code @body = "" #io.gets_to_end the app never hangs. But there is also no data for me)))

And it's not about concurrency.. I've tried it with only main fiber and result is same. I can't understand what's wrong.. Did you try code with Instagram? I work with different social media and it happens only with Insta.

@nobilik
Copy link
Author

nobilik commented Jul 6, 2017

Hi everyone! Maybe I should write i somewhere else, if so tell me please.
I run my app under lldb and when it hangs I saw this with thread backtrace command:

* thread #1: tid = 11963, 0x00007ffff652d02e libc.so.6`__memcpy_avx_unaligned + 462 at memcpy-avx-unaligned.S:166, name = 'insta_app', stop reason = signal SIGSTOP
  * frame #0: 0x00007ffff652d02e libc.so.6`__memcpy_avx_unaligned + 462 at memcpy-avx-unaligned.S:166
    frame #1: 0x00007ffff79a83ee libz.so.1`inflate + 1742
    frame #2: 0x00000000005777ea insta_app`*Flate::Reader#read<Slice(UInt8)>:Int32 + 3418 at reader.cr:88
    frame #3: 0x000000000057cdec insta_app`*Gzip::Reader#read<Slice(UInt8)>:Int32 + 236 at reader.cr:87
    frame #4: 0x000000000057fd7f insta_app`*Gzip::Reader@IO#gets_to_end:String + 255 at io.cr:560
    frame #5: 0x0000000000653d0a insta_app`*HTTP::Client::Response#consume_body_io:Nil + 234 at response.cr:73
    frame #6: 0x0000000000653bff insta_app`*HTTP::Client::Response::from_io?<(OpenSSL::SSL::Socket+ | TCPSocket+), Bool, Bool>:(HTTP::Client::Response | Nil) + 1807 at response.cr:97
    frame #7: 0x000000000064aa2e insta_app`*HTTP::Client#exec_internal_single<HTTP::Request>:(HTTP::Client::Response | Nil) + 174 at client.cr:502
    frame #8: 0x0000000000647ccc insta_app`*HTTP::Client#exec_internal<HTTP::Request>:HTTP::Client::Response + 44 at client.cr:486
    frame #9: 0x0000000000647bac insta_app`*HTTP::Client#exec<HTTP::Request>:HTTP::Client::Response + 44 at client.cr:482
    frame #10: 0x00000000006478e1 insta_app`*HTTP::Client#exec<String, String, Nil, Nil>:HTTP::Client::Response + 49 at client.cr:585
    frame #11: 0x00000000006478a7 insta_app`*HTTP::Client#get<String>:HTTP::Client::Response + 39 at client.cr:329
    frame #12: 0x000000000064db2a insta_app`*InstagramApi::AccountClient#get_account:(InstagramApi::Account | Nil) + 186 at instagram_api.cr:17
    frame #13: 0x00000000006462ec insta_app`*InstaAccount#get_response<DbMap::Group, HTTP::Client>:NamedTuple(items: Array(InstagramApi::Media), fail_flag: Bool) + 732 at insta_updater.cr:166
    frame #14: 0x0000000000645f6a insta_app`*InstaAccount#make_request<Array(DbMap::Group), Channel::Buffered(Nil), ConnectionPool(HTTP::Client)>:Nil + 186 at insta_updater.cr:81
    frame #15: 0x00000000004e057b insta_app`~procProc(Nil)@src/insta_app/insta_updater.cr:72 + 27 at raise.cr:189
    frame #16: 0x00000000004fc53e insta_app`*Fiber#run:(IO::FileDescriptor | Nil) + 174 at fiber.cr:255
    frame #17: 0x00000000004d5586 insta_app`~proc2Proc(Fiber, (IO::FileDescriptor | Nil))@/opt/crystal/src/fiber.cr:29 + 6 at concurrent.cr:61

As far as I understand the problem is in LibC. But what to do next? Help needed!
Update I make backtrace few times and saw that results are different after this point: 0x00000000005777ea insta_app*Flate::Reader#read<Slice(UInt8)>:Int32 + 3418 at reader.cr:88`
It could be LibSSL, LibCrypto, etc...

@rdp
Copy link
Contributor

rdp commented Jul 6, 2017

If you continue with lldb and get trace again is it always in the same place? is reader.cr yours? Also you may be able to see what bytes its hanging on by using tcpdump or what not (is the port still open? etc.)

@nobilik
Copy link
Author

nobilik commented Jul 6, 2017

@rdp Yes, the place is always same. And reader.cr is Crystal source file.
Anyway, I can't understand what it is. But I install new Ubuntu 16.04 on VirtualBox and here my test code works without hanging. I try to run compiled file on host machine (my computer) but have got same issue. I compiled it again on VM but with '-static' this time. Guess what? Nothing change, it still hangs on host machine. Should I close the issue in this case?

@rdp
Copy link
Contributor

rdp commented Jul 6, 2017 via email

@nobilik
Copy link
Author

nobilik commented Jul 7, 2017

@rdp Yes, "step" is possible. Actually I start with lldb from putting breakpoint in io.cr gets_to_end method.

@RX14
Copy link
Contributor

RX14 commented Jul 7, 2017

It's clearly a problem in libz if all the backtraces have Flate::Reader and libz.so in them...

@rdp
Copy link
Contributor

rdp commented Jul 7, 2017 via email

@nobilik
Copy link
Author

nobilik commented Jul 8, 2017

@rdp can't catch the issue with breakpoint. Execution is too slow I think

@rdp
Copy link
Contributor

rdp commented Jul 8, 2017 via email

@nobilik
Copy link
Author

nobilik commented Jul 13, 2017

Hi, I succeed to 'step in' with lldb after hanging, but I have no enough experience to understand. So I can step till this:

Process 24133 stopped
* thread #1: tid = 24133, 0x000000000050e067 test_app`*OpenSSL::SSL::Socket+@OpenSSL::SSL::Socket#unbuffered_read<Slice(UInt8)>:Int32 + 103 at socket.cr:100, name = 'test_app', stop reason = step in
    frame #0: 0x000000000050e067 test_app`*OpenSSL::SSL::Socket+@OpenSSL::SSL::Socket#unbuffered_read<Slice(UInt8)>:Int32 + 103 at socket.cr:100
   97  	    count = slice.size
   98  	    return 0 if count == 0
   99  	
-> 100 	    LibSSL.ssl_read(@ssl, slice.pointer(count), count).tap do |bytes|
   101 	      if bytes <= 0 && !LibSSL.ssl_get_error(@ssl, bytes).zero_return?
   102 	        raise OpenSSL::SSL::Error.new(@ssl, bytes, "SSL_read")
   103 	      end

after that stepping gives me only empty space and while stepping I have 100% CPU load. If I do ctrl + C and step again CPU becomes 100% loaded again and output is same. There is more output in attachment. Thanks for help.
dbg.txt

@rdp
Copy link
Contributor

rdp commented Jul 13, 2017

OK there are definitely some questions in that output, like this block:

   123 	  # Capture fault signals (SEGV, BUS) and finish the process printing a backtrace first
   124 	  LibC.dprintf 2, "Invalid memory access (signal %d) at address 0x%lx\n", sig, addr
   125 	  CallStack.print_backtrace
-> 126 	  LibC._exit sig
   127 	end

But I guess that's just exiting the signal handler so expected, since SIGPIPE is being generated but handled in other ways.

I was able to repro the behavior using the given snippet on on 16.04 64-bit Crystal 0.23.0 [3c3d3e2] (2017-06-28) LLVM 3.8.1 and 16.10 32-bit Crystal 0.23.0 [3c3d3e2] (2017-06-29) LLVM 3.8.1 and 17.04 64-bit Crystal 0.23.0 [3c3d3e2] (2017-06-28) LLVM 3.8.1 (all digitalocean SFO2 droplets) and even on OS X Sierra Crystal 0.23.0 (2017-06-30) LLVM 4.0.1 (though it sometimes takes awhile to repro in OS X, maybe 1min, on the digitalocean boxes it takes like 10s).

If I add this puts line to flate/reader.cr

      read_bytes = slice.size - @stream.avail_out
      ::puts "ret=#{ret} read_bytes=#{read_bytes}" # add this line

I eventually get this output repeated "forever" to the console (i.e. it seems to be hanging looping with this):

...
ret=BUF_ERROR read_bytes=0
ret=BUF_ERROR read_bytes=0
ret=BUF_ERROR read_bytes=0
ret=BUF_ERROR read_bytes=0
...

If I add BUF_ERROR to the throw block:

      when LibZ::Error::DATA_ERROR,
           LibZ::Error::MEM_ERROR, LibZ::Error::BUF_ERROR
        raise Flate::Error.new(ret, @stream)

it works "better" (well it doesn't busy loop anymore for me --no idea if this is the "best" way to handle it, are there other conditions that should go here as well?). It does seem to mess with your channel receive logic though. You might need something like:

  ensure
    client.close rescue nil if client
    channel.send(nil)
  end

Wonder what the "approved" way of closing like this is, or if there is some better way here as well? Maybe #close should swallow failures? Not sure, but definitely with the above suggested BUF_ERROR change the #close method can also throw Broken pipe exceptions FWIW.

@nobilik
Copy link
Author

nobilik commented Jul 14, 2017

Cool! I can confirm that adding LibZ::Error::BUF_ERROR solved the issue! Should I close it now? @rdp thanks a lot! Will you make pull request for it?

 ensure
    client.close rescue nil if client
    channel.send(nil)
  end

This block was only for snippet code. Ordinary I use pool of HTTP.clients.

@sdogruyol
Copy link
Member

Can we close this?

@asterite
Copy link
Member

No, nothing was fixed, it's still a bug

@RX14
Copy link
Contributor

RX14 commented Sep 15, 2017

inflate may return Z_BUF_ERROR if no progress was possible. Note that Z_BUF_ERROR is not fatal, and inflate() can be called again with more input and more output space to continue decompressing.

So it seems probable to me that what's happening is that the slice passed to read is too small (probably only 1 byte), so libz can't make any progress.

If someone could provide me with a file where this happens, we can make a spec for it and debug it. I think we might have to introduce being able to write into a larger buffer than the one given to read though.

@rdp
Copy link
Contributor

rdp commented Sep 15, 2017

I haven't looked closely but wonder if it is being passed only 0 bytes even? Hmm...or even "1 byte" (which can't be deflated) but that's the "last byte" (b/c of truncated input) so it's looping forever...hmm...

@Sija
Copy link
Contributor

Sija commented Aug 26, 2018

@nobilik Could you please check does #6610 (merged into the current master) solves your issue?

@nobilik
Copy link
Author

nobilik commented Aug 27, 2018

@Sija Hi! My test app works well now. Sorry I can't try it on production for now, but I think it's fixed. Thanks!

@asterite
Copy link
Member

Let's close this. If it turns out #6610 didn't fix this, please re-open.

@asterite asterite added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib labels Aug 27, 2018
@asterite asterite added this to the 0.26.1 milestone Aug 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib
Projects
None yet
Development

No branches or pull requests