Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Bad file descriptor" 5 seconds after specified connection duration #753

Closed
esqt opened this issue Jun 15, 2018 · 9 comments
Closed

"Bad file descriptor" 5 seconds after specified connection duration #753

esqt opened this issue Jun 15, 2018 · 9 comments

Comments

@esqt
Copy link

esqt commented Jun 15, 2018

Context

  • Version of iperf3:
    iperf 3.5 (cJSON 1.5.2)

  • Hardware:
    VirtualBox VM

  • Operating system (and distribution, if any):
    Ubuntu 18.04

Bug Report

The problem occurs when a connection lasts more than 5 seconds after the specified duration (when the data keeps arriving to the receiver even though the sender has stopped emitting). This can be the result of large queues in network equipment together with low bandwidth.

The bug is related to issues #645 #648 #653

  • Expected Behavior
    There could be two expected behaviors:

    • either the connection is stopped by the receiver when the specified duration is exceeded
    • or the connection continues while data is coming (there could be a timeout to cut the connection if no data has been received in the last 5s)
  • Actual Behavior
    As soon as the connection length of the client reaches the duration specified by the sender plus 5 second, the message "iperf3: error - select failed: Bad file descriptor" is displayed.

In the example below, the link capacity is 5 Mb/s and the client is sending UDP data at 10 Mb/s

Client:

iperf3 -c 192.168.2.3 -u -b 10000000

Connecting to host 192.168.2.3, port 5201
[ 6] local 192.168.0.4 port 41567 connected to 192.168.2.3 port 5201
[ ID] Interval Transfer Bitrate Total Datagrams
[ 6] 0.00-1.00 sec 1.19 MBytes 9.99 Mbits/sec 863
[ 6] 1.00-2.00 sec 1.19 MBytes 10.0 Mbits/sec 864
[ 6] 2.00-3.00 sec 1.19 MBytes 10.0 Mbits/sec 863
[ 6] 3.00-4.00 sec 1.19 MBytes 10.0 Mbits/sec 863
[ 6] 4.00-5.00 sec 1.19 MBytes 10.0 Mbits/sec 863
[ 6] 5.00-6.00 sec 1.19 MBytes 10.0 Mbits/sec 864
[ 6] 6.00-7.00 sec 1.19 MBytes 10.0 Mbits/sec 863
[ 6] 7.00-8.00 sec 1.19 MBytes 10.0 Mbits/sec 863
[ 6] 8.00-9.00 sec 1.19 MBytes 10.0 Mbits/sec 863
iperf3: error - control socket has closed unexpectedly

Server:

iperf3 -s

Server listening on 5201

Accepted connection from 192.168.0.4, port 42186
[ 6] local 192.168.2.3 port 5201 connected to 192.168.0.4 port 41567
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 6] 0.00-1.00 sec 252 KBytes 2.06 Mbits/sec 2.033 ms 0/178 (0%)
[ 6] 1.00-2.00 sec 625 KBytes 5.12 Mbits/sec 1.898 ms 0/442 (0%)
[ 6] 2.00-3.00 sec 617 KBytes 5.05 Mbits/sec 1.833 ms 0/436 (0%)
[ 6] 3.00-4.00 sec 617 KBytes 5.05 Mbits/sec 1.818 ms 0/436 (0%)
[ 6] 4.00-5.00 sec 615 KBytes 5.04 Mbits/sec 2.090 ms 0/435 (0%)
[ 6] 5.00-6.00 sec 617 KBytes 5.05 Mbits/sec 1.703 ms 0/436 (0%)
[ 6] 6.00-7.00 sec 622 KBytes 5.10 Mbits/sec 2.342 ms 0/440 (0%)
[ 6] 7.00-8.00 sec 624 KBytes 5.11 Mbits/sec 2.083 ms 0/441 (0%)
[ 6] 8.00-9.00 sec 530 KBytes 4.34 Mbits/sec 4.119 ms 0/375 (0%)
[ 6] 9.00-10.00 sec 372 KBytes 3.05 Mbits/sec 1.938 ms 0/263 (0%)
[ 6] 10.00-11.00 sec 615 KBytes 5.04 Mbits/sec 2.128 ms 0/435 (0%)
[ 6] 11.00-12.00 sec 628 KBytes 5.14 Mbits/sec 2.037 ms 0/444 (0%)
[ 6] 12.00-13.00 sec 617 KBytes 5.05 Mbits/sec 2.243 ms 0/436 (0%)
[ 6] 13.00-14.00 sec 619 KBytes 5.08 Mbits/sec 2.095 ms 0/438 (0%)
iperf3: error - select failed: Bad file descriptor

  • Steps to Reproduce
    The bug is a bit tricky to reproduce: you need to have a link which buffers packets in a queue (I believe many routers do that), as well as a link with a low capacity (so that all the packets stored in the queue cannot be transfered withing 5 seconds on the link).
    Then, just send data in UDP with a bitrate higher than the link capacity. This way the last packet sent by the sender will arrive more than 5 seconds after at the receiver and that will cause the bug.
@bmah888
Copy link
Contributor

bmah888 commented Jun 22, 2018

You're right that this Bad file descriptor message can be unexpected. I think that when iperf3 was being implemented, nobody thought that we'd get into a situation where there are more than 5 seconds of packets in flight. I mean in order to do this, you need be basically overrunning the path (or a bottleneck link on the path). What's the use case for running iperf3 like this, where it's deliberately sending (via a non-congestion-aware protocol) more data than you know the path can handle?

I feel like Steve Jobs saying "You're holding it wrong." :-)

In theory we could change the 5 seconds to some other value, or even make it a parameter, but to me that's kind of a hacky workaround.

@ckleu
Copy link

ckleu commented Oct 26, 2018

I am getting this issue in a back-to-back setup where two servers are connected to each other (25G link)

I am trying to run the same Test Case multiple times after each other to see the reproducibility of the test. However on the second iteration it always comes with this issue.

{"error": "unable to send cookie to server: Bad file descriptor"}

Tried to run the server in normal CLI mode and started via the Python-wrapper

Configuration used

def new_client_configure():
  client = iperf3.Client()
  client.omit = 5
  client.duration = 15
  client.num_streams = 8
  client.server_hostname = '10.0.0.2'
  client.zerocopy = True
  client.port = 5201
  return client

@ckleu
Copy link

ckleu commented Oct 26, 2018

Looks like it might be a problem in the python wrapper, not the iperf3 source code.

I commented out some of the del functions and now seems to execute fine

207     def __del__(self):
208         """Cleanup the test after the :class:`IPerf3` class is terminated"""
225             #self.lib.iperf_client_end(self._test)
226             #self.lib.iperf_free_test(self._test)

@ckleu
Copy link

ckleu commented Oct 26, 2018

Think I found a better workaround (not modifying the python-iperf3 wrapper):

for i in range(20):
  print("Test iteration: {}".format(i))
  test = new_client_configure()
  r = test.run()
  print("r.received_Mbps: {}".format(r.received_Mbps))
  del test
  time.sleep(2)

@bmah888
Copy link
Contributor

bmah888 commented Feb 1, 2019

It sounds like this isn't really an iperf3 issue, so closing for now.

@bmah888 bmah888 closed this as completed Feb 1, 2019
@Karry
Copy link

Karry commented May 28, 2019

Hi, I am able to reproduce this issue with pure iperf3 binaries, compiled from the master. Just few condition have to be met:

  • use specific router
  • run test longer than 10 minutes

I am running server with these arguments -s -p 666 --interval 60 in the debugger and client with -c <iperf-server-address> -t 900 -p 666 -l60 -b 0.5M --interval 60 --version4 --bidir

I put three breakpoints on the server side:

  • iperf_server_api.c:216 - termination message from the client
  • iperf_server_api.c:243 - end timer
  • iperf_server_api.c:440 - negative result from select is handled

after long test (15 minutes) this happened:

  • no termination message from client arrived (yet)
  • control socket is closed in server_timer_proc timer handle at iperf_server_api.c:243, test->state is still TEST_RUNNING
  • after handling timers and returning to main server loop, select fails with errno EBADF (9)

as a result client stucks completely...

@Karry
Copy link

Karry commented May 29, 2019

I found out the issue - control socket timed out on the router, when client wanted send "test ends" control message, it don't arrives to the server and this specific router don't reply with any RST packet back! Client stuck at this point...

I tried to install libkeepalive on the client (Ubuntu Linux) and use it with iperf client: LD_PRELOAD=libkeepalive.so KEEPCNT=20 KEEPIDLE=180 KEEPINTVL=60

This solves the situation, control sockets remains established and test is ended properly...

So, my suggestion is to use TCP keepalive for control socket as a default (or some kind of application heartbeat) and improve error handling on the server - Bad file descriptor is really bad error message for the case when control message timeout...

@wetglass
Copy link

wetglass commented Jun 8, 2019

@bmah888 I have a perfectly legit use case. I'm performance testing IoT equipment meant to be low bw, lossy, best effort kind of wireless connections. The devices use a buffer, on a lossy link, where throughput could easily be pegged. The OP identified my exact setup, which describes real world solutions for large utilities, etc. I've been stopped in my tracks by this bug. Going to try @Karry workaround.

@acooks
Copy link
Contributor

acooks commented Jun 12, 2019

I think this is related to issue 751 and PR 859 #859

There is a race condition between the termination message from the client and server_timer_proc().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants