Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data forwarding repeatedly stopping #5

Open
marcus-aa opened this issue Sep 3, 2020 · 11 comments
Open

Data forwarding repeatedly stopping #5

marcus-aa opened this issue Sep 3, 2020 · 11 comments

Comments

@marcus-aa
Copy link

marcus-aa commented Sep 3, 2020

I have a problem on 2 different installations where BS just stops forwarding the data after 24-36 hours after a re-start.

This only happens where BS connects over LAN/TCP to the source, never when it uses a /dev/usb port.

INPUT_OPTIONS="--net 192.168.10.158:10003"

The TCP connection stays up, BS answers new local connections to the output ports, but no data is coming through.
if I stop and start the service, everything works again for a day or two.

TCP keepalive issue? In my experience they cause this kind of problems, better just check if data is coming in or re-connect.
The source are Radarcapes.
EDIT: Also with Dump1090 on the other end it shows the same problem.

● beast-splitter.service - Mode S Beast data forwarder
Loaded: loaded (/lib/systemd/system/beast-splitter.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2020-08-27 09:17:02 CEST; 1 weeks 0 days ago
Docs: https://github.com/flightaware/beast-splitter
Main PID: 391 (beast-splitter)
CGroup: /system.slice/beast-splitter.service
└─391 /usr/bin/beast-splitter --net 192.168.10.158:10003 --listen localhost:35005:RCdfGi --listen localhost:45005:Bcdfgij

@mutability
Copy link
Contributor

mutability commented Sep 3, 2020

When it gets stuck, what is the state of the TCP connection? Is new data actually arriving from the remote host? Is there data waiting in the receive buffer that beast-splitter isn't reading? What does the Radarcape think the state of the connection is?

@marcus-aa
Copy link
Author

marcus-aa commented Sep 3, 2020

I have another one here that is stuck right now if you need further details. Here the other side is actually dump1090-fa on another Rpi.

INPUT_OPTIONS="--net 192.168.1.8:30005"
OUTPUT_OPTIONS="--listen localhost:30005:BCdfGi --connect localhost:30004:BCdfGi"

All connections show "Established".

netstat -tpn | grep beast
tcp        0      0 127.0.0.1:54598         127.0.0.1:30004         ESTABLISHED 16357/beast-splitte 
tcp        0      0 127.0.0.1:30005         127.0.0.1:54054         ESTABLISHED 16357/beast-splitte 
tcp        0      0 192.168.1.77:44636      192.168.1.8:30005       ESTABLISHED 16357/beast-splitte 
tcp        0      0 127.0.0.1:30005         127.0.0.1:52514         ESTABLISHED 16357/beast-splitte 

There is plenty of data if I make a parallell connection to the same port, but none on the BS output ports.

root@P-ESSE01:~# nc localhost 30005
^C
root@P-ESSE01:~# 
root@P-ESSE01:~# nc 192.168.1.8 30005 
2??R?K?]Fv?r2??R?x?=P?3??R?켠0?????s?3??R??-w???`0???Z?2??SK?]Fv?r3??S?w????????B??3??Sа????*A4?
                                                                                                }k?2??S?? ?Œ?2??SLλ]Fv?r3??S-?;?FX??qH??2??S?f]D	?^T?3??S?'?0?????s?2??LO)]BB??vQ3????'???g: ?է!2??S	^?<]F??e$3??S	ѻ'?0?9?0?>?)2??S
ҫ& 0K?)2??S
           ??? 0K?+3??S
                       ??Р??i??7??@?2??S
??+?F!/?!??3??S??;???A ???GE?2??S?(     ^

The service is active and there are no attempts to re-connect to 1.8

root@P-ESSE01:~# systemctl status beast-splitter.service
● beast-splitter.service - Mode S Beast data forwarder
   Loaded: loaded (/lib/systemd/system/beast-splitter.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2020-08-26 17:18:05 CEST; 1 weeks 1 days ago
     Docs: https://github.com/flightaware/beast-splitter
 Main PID: 16357 (beast-splitter)
   CGroup: /system.slice/beast-splitter.service
           └─16357 /usr/bin/beast-splitter --net 192.168.1.8:30005 --listen localhost:30005:BCdfGi --connect localhost:30004:BCdfGi --status-file /run/beast-splitter/status.json

Sep 03 21:31:54 P-ESSE01 beast-splitter[16357]: 127.0.0.1:52758: settings changed to BCdfgij
Sep 03 21:34:24 P-ESSE01 beast-splitter[16357]: 127.0.0.1:52758: connection closed
Sep 03 21:34:24 P-ESSE01 beast-splitter[16357]: 127.0.0.1:30005: accepted a connection from 127.0.0.1:52778 with settings BCdfgi
Sep 03 21:34:24 P-ESSE01 beast-splitter[16357]: 127.0.0.1:52778: settings changed to BCdfgij
Sep 03 21:36:54 P-ESSE01 beast-splitter[16357]: 127.0.0.1:52778: connection closed
Sep 03 21:36:54 P-ESSE01 beast-splitter[16357]: 127.0.0.1:30005: accepted a connection from 127.0.0.1:52798 with settings BCdfgi
Sep 03 21:36:54 P-ESSE01 beast-splitter[16357]: 127.0.0.1:52798: settings changed to BCdfgij
Sep 03 21:39:25 P-ESSE01 beast-splitter[16357]: 127.0.0.1:52798: connection closed
Sep 03 21:39:25 P-ESSE01 beast-splitter[16357]: 127.0.0.1:30005: accepted a connection from 127.0.0.1:52812 with settings BCdfgi
Sep 03 21:39:25 P-ESSE01 beast-splitter[16357]: 127.0.0.1:52812: settings changed to BCdfgij

@mutability
Copy link
Contributor

What does netstat look like on the dump1090 side?

Usually the cause of this sort of behavior is that the sending side no longer believes the connection is alive, but the receiving side never heard the connection teardown messages. This can be e.g. because of a sufficiently long network outage, or intermediate firewalls losing connection state.

dump1090-fa and, I think, Radarcapes will ensure they're always sending regular data, so treating the connection as failed after a timeout with no data is one way to solve it. But this behaviour is not universal so it'd need to be optional. Sending periodic beast commands upstream is another way to detect a failed connection. Enabling TCP keepalives is third way but the typical keepalive timeouts tend to be very large (hours).

@marcus-aa
Copy link
Author

Revisiting.

In my experience with TCP data in NMEA & AIS networks, where data is similar and often unidirectional, there is a need to have a data timeout on the receiving side somewhere between 60 and 200 seconds. Otherwise this condition will appear sooner or later. In this particular case once per day on average.

I now have access to both sides of this connection, here is the Beast-splitter side (IP 192.68.1.10 connecting to .1.8:30005)

image

And here is the dump1090-fa side,

image

Just like you say there is no connection there any more.

After re-starting beast-splitter:

image

it is working normally again, for a while.

It would be very good to add an optional data timeout, otherwise the network input is not so useable.

mlat-client handles this already as default I think? It does not have this problem if I connect it directly to a network port without using beast-splitter in between. But then I don't get the ADSB-data to the rest of the system.

@marcus-aa
Copy link
Author

Re-visiting. Same problem on another host. It has sat there for a week now thinking the connection is up. A data timeout is needed.

9404 /usr/bin/beast-splitter --net 192.168.0.102:10002 --listen localhost:30005:RCdfGi --listen localhost:40005:BCDfgij --connect localhost:30004:BCdfGi --status-file /run/beast-splitter/status.json

root@P-ESPA01:~# netstat -tpn | grep beast
tcp 0 0 127.0.0.1:30005 127.0.0.1:39208 ESTABLISHED 9404/beast-splitter
tcp 0 0 127.0.0.1:30005 127.0.0.1:51620 ESTABLISHED 9404/beast-splitter
tcp 0 0 192.168.0.101:46606 192.168.0.102:10002 ESTABLISHED 9404/beast-splitter

root@P-ESPA01:~# cat /run/beast-splitter/status.json
{
"radio" : {
"status" : "red",
"message" : "Not connected to receiver"

root@P-ESPA01:~# nc 192.168.0.102 10002
24Xħ??=�??34Xħ?)V?=?80????34X?QM? ??J%-??2ԧ34X?

No problem to re-connect.

@mutability
Copy link
Contributor

I doubt I'll get to this any time soon - can you look at doing a PR for this?

@mutability
Copy link
Contributor

mutability commented Sep 18, 2022

That "not connected" is a little odd though - I wonder if something else is going on there. What's the last log output?

("not connected" implies that the connection is dead -- but the OS disagrees -- or has never seen a valid message, or it got garbled data and couldn't re-establish sync)

@marcus-aa
Copy link
Author

My understanding is that Beast-splitter "knows" on a data level that it is not receiving status messages (it is a Radarcape), therefor the "Not connected..." messages , but the TCP connection has not been re-established correctly. So BS doesn't "do" anything with the information that the status messages are lost. Different OSI layers...

I only have log entries back to the 16th, lost connection on the 12th. It just repeats like this and are the incoming connections.

Sep 18 18:47:23 P-ESPA01 beast-splitter[9404]: 127.0.0.1:51810: settings changed to RCdfGij
Sep 18 18:49:53 P-ESPA01 beast-splitter[9404]: 127.0.0.1:51810: connection closed
Sep 18 18:49:53 P-ESPA01 beast-splitter[9404]: 127.0.0.1:30005: accepted a connection from 127.0.0.1:51816 with settings RCdfGi
Sep 18 18:49:53 P-ESPA01 beast-splitter[9404]: 127.0.0.1:51816: settings changed to RCdfGij

mlat-client seems to have a 150s timeout and re-establishes connections repeatedly.

Sep 18 18:59:54 P-ESPA01 vrss-mlat-client[255]: Sun Sep 18 18:59:54 2022 Disconnecting from localhost:30005: No data (not even keepalives) received for 150 seconds
Sep 18 18:59:54 P-ESPA01 vrss-mlat-client[255]: Sun Sep 18 18:59:54 2022 Input connected to localhost:30005
Sep 18 18:59:54 P-ESPA01 vrss-mlat-client[255]: Sun Sep 18 18:59:54 2022 Input format changed to RADARCAPE, 1000MHz clock

@marcus-aa
Copy link
Author

I doubt I'll get to this any time soon - can you look at doing a PR for this?

I will look, but my programming skills are unfortunately more in the line of making a script that reads status.json and forces a re-start of the beast-splitter service if there are too many consecutive "Not connected"....

@mutability
Copy link
Contributor

When beast-splitter has seen at least one status message, it expects at least one status message every 15 seconds. If 15 seconds elapses with no status messages, it will disconnect / reconnect, which is basically exactly what you're asking for here:

liveness_timer.expires_from_now(radarcape_liveness_interval);

The "not connected to receiver", however, means that is_connected() returns false:

std::string radio_message = (input->is_connected() ? "Connected to receiver" : "Not connected to receiver");

bool is_connected(void) const { return (good_sync && receiver_type != ReceiverType::UNKNOWN); }

If is_connected() is false, that means that either

a) we have no connection at all; or
b) we have a connection, but autodetection of Beast vs Radarcape mode is still in progress; or
c) we have a connection, but we don't have "sync" (where "sync" means "discovered the start of a correctly formatted message" -- this is more relevant for serial connections, where you might start receiving data mid-message).

For (a) the OS, at least, thinks we have a connection;
for (b) that condition should only last for a few seconds, not hours;
which leaves (c)

So my current guess (in the absence of a bug somewhere in the above code) is that beast-splitter established the most recent network connection, and then absolutely nothing was ever received - no status messages, no data, nothing. Since no status message was received, the must-see-regular-status-messages timeout never gets started and the connection just sits there forever.

mutability added a commit that referenced this issue Oct 6, 2022
* If the input connection is forced to Radarcape mode, then Radarcape
  status messages are expected immediately; beast-splitter does not
  wait to see a status message before starting the timer that expects
  at least one status message per 15 seconds.

* New option, --beast-input-timeout, sets a timeout for non-Radarcape
  connections (in seconds). If no valid message is received within
  that timeout, beast-splitter disconnects/reconnects.

+ some internal cleanups
@mutability
Copy link
Contributor

mutability commented Oct 6, 2022

Can you try out PR #6 ? (Compiles, minimal testing)

You'll want to pass a new command-line option --beast-input-timeout (e.g. --beast-input-timeout 60.0 for a 60-second timeout). This applies only to non-Radarcape connections, Radarcape-mode connections expect a status message every 15 seconds even if --beast-input-timeout is not set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants