Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry attempts on communication between parent and child #89

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mjdoris
Copy link

@mjdoris mjdoris commented Jun 17, 2021

Added in try blocks around get()/put() commands between worker and its parent tab. A timeout value of 30 seconds is added and a TimeoutError exception is captured to retry the message.

Reason: When using secure communication zmq messages get lost/dropped/missed/something between the parent tab and its worker seemingly at random. The most often scenario of this occurring is when sending a new job to a worker. The parent will issue a job to the worker and proceed to wait for an acknowledgement. The worker never receives this message and continues to wait for a job from the parent. Now both the parent and the worker are waiting for messages that will never come. This causes the blacs tab to enter an infinite wait state, and the tab will error out with a 'Device has not responded for xx:xx time'. The tab will then sit in this error state counting up for hours, days, or weeks until the tab is restarted. Bringing any experiment to a halt. This seems to occur infrequently, but devices which send requests repeatedly in a short amount of time tend to encounter this error more often (such as the Pulseblaster with its 'check_status' job.) However, I've logged this happening to a Pulseblaster and multiple NI devices, so it's definitely not device specific.

This error is not computer specific either, and seems to plague all computers on the NIST network, even those simply communicating over their localhost loopback interface. Adding in these retry blocks seem to get around this issue if turning off secure communication is not an option. So far it appears that only one retry is needed to restore communication between the tab and its worker.

I've spent quite a bit of time troubleshooting the origin of this issue. From debugging it seems to be happening during the PUSH/PULL commands in ZMQ when secure communication is turned on. I have also monitored network traffic during test shots and do see a lot of retransmission of packets when this issue occurs. Physically disconnecting the computer from any network also seems to stop the communication issue. I suspect something is happening with the encryption due to some network interference and labscript is blind to ZMQ discarding its message queue or losing messages. However, even computers strictly on the localhost interface suffer from this issue, so I'm not completely confident it's just the external network at fault. Regardless, it's probably not good that labscript doesn't appear to be able to recover from ZMQ hiccups during the tab/worker exchange.

Added in try blocks around get()/put() commands between worker and its parent tab. A timeout value of 30 seconds is added and a TimeoutError exception is captured to retry the message.

Reason: When using secure communication zmq messages get lost/dropped/missed between the parent tab and its worker seemingly at random. The most often scenario of this occurring is when sending a new job to a worker. The parent will issue a job to the worker and proceed to wait for an acknowledgement. The worker never receives this message and continues to wait for a job from the parent. This causes the blacs tab to enter an infinite wait state, and the tab will error out with a 'Device has not responded for xx:xx time'. The tab will then sit in this error state counting up for hours, days, or weeks until the tab is restarted. Bringing any experiment to a halt. This seems to occur infrequently, but devices which send requests repeatedly in a short amount of time tend to encounter this error more often (such as the Pulseblaster with its 'check_status' job.)

This error is not computer specific, and seems to plague all computers on the network, even those simply communicating over their localhost loopback interface. Adding in these retry blocks seem to get around this issue if turning off secure communication is not an option. So far only one retry is needed to restore communication between the tab and its worker.
@chrisjbillington
Copy link
Member

chrisjbillington commented Jun 19, 2021

Hi Michael,

This was an infuriating issue that I was never able to get to the bottom of. It's good that retries fix the issue, though unsatisfying. It makes one wonder where else in the codebase is vulnerable to messages being lost in transit.

It's particularly frustrating since connections on localhost should never fail. The problem is NIST-specific as you've seen, I never saw it anywhere else. I had theories that it was due to antivirus or other nosy software doing essentially a TCP reset attack and closing the connections. Though this still should not have resulted in message loss, since zmq should reconnect and unsent messages be queued and retried - we are not supposed to have to worry about retries ourselves.

But now I have another theory. @pacosalces will remember that we had a strange issue at NIST when not using encryption. When leaving lyse running overnight on an office computer, some other computer at NIST would connect to the zmq PULL socket for printing text to the output box. It would send what looked like the text for an HTTP GET request with a firefox user agent. Very strange! The format of the message didn't match what the code running in lyse expected though, and lyse just crashed.

Anyway, we solved this problem by turning on encryption. Then the mysterious firefox requests would not make it to application code anyway and the problem went away.

BLACS uses zmq PUSH and PULL sockets for sending data between the parent and worker processes. PUSH and PULL are technically one-to-many and many-to-one sockets, with PUSH sockets fanning out messages - taking turns which connected client they send messages to. That means if an external computer (such as our firefox friend) were to connect to a PUSH socket, it could "steal" messages - they would fail to be delivered to the other client. We have no use for this fanning out mechanism, it's just how these sockets work when multiple clients connect.

Evidence against this theory: It occurs when you are using encryption, and doesn't occur when you're not. One would expect the other way around! Our firefox friend can only butt in if it knows our encryption key, no? Despite that, I can see how it possibly still could occur with encryption on. Reading the docs of ZMQ_IMMEDIATE here, it seems that PUSH sockets can still queue messages for peers that haven't finished connecting yet. Perhaps a client without encryption attempting to connect still has messages enqueued for it, which then sit in that queue forever since the connection is not successful. As for why it doesn't occur when encryption is off, well perhaps there's a race condition such that our firefox friend's connection attempts fail faster (after all, crypto is slower) and it is not even in a partially connected state long enough to have a message enqueued for it.

Evidence for this theory: The problem ceases when you disconnect from the external network. Also, other high-speed and frequent zmq communication does not suffer from this issue - such as the camera worker class sending frames to the parent, it gets a response for every frame. Only PUSH sockets that call bind() would be vulnerable to this issue, and I think the sockets for sending data to child processes in BLACS and lyse are the only ones using this.

Could you perhaps try out the following two modifications (separately) for zprocess/process_tree.py? If this theory is right then either of them should solve the problem, though one is a bit more aggressive than the other.

Less aggressive option: keep using push/pull, but set zmq.IMMEDIATE to hopefully prevent "losing" messages by queueing them for sending on partially connected clients:

diff --git a/zprocess/process_tree.py b/zprocess/process_tree.py
index 2105469..8f8d3bc 100644
--- a/zprocess/process_tree.py
+++ b/zprocess/process_tree.py
@@ -1294,6 +1294,7 @@ class ProcessTree(object):
         the child process. TODO finish this and other docstrings."""
         context = SecureContext.instance(shared_secret=self.shared_secret)
         to_child = context.socket(zmq.PUSH, allow_insecure=self.allow_insecure)
+        to_child.setsockopt(zmq.IMMEDIATE, 1)
         from_child = context.socket(zmq.PULL, allow_insecure=self.allow_insecure)

         from_child_port = from_child.bind_to_random_port('tcp://*')
@@ -1424,6 +1425,7 @@ class ProcessTree(object):

         context = SecureContext.instance(shared_secret=self.shared_secret)
         to_parent = context.socket(zmq.PUSH, allow_insecure=self.allow_insecure)
+        to_parent.setsockopt(zmq.IMMEDIATE, 1)
         from_parent = context.socket(zmq.PULL, allow_insecure=self.allow_insecure)

         from_parent.connect(

More aggressive option: don't use push/pull, use exclusive pair sockets instead - prevents other peers from connecting (partially or otherwise) at all. This is what I wanted to do when first making zprocess actually, but the documentation warned that they were buggy or something so we went for push/pull. Silly, since the communication is supposed to be exclusive.

diff --git a/zprocess/process_tree.py b/zprocess/process_tree.py
index 2105469..b21ffb3 100644
--- a/zprocess/process_tree.py
+++ b/zprocess/process_tree.py
@@ -419,7 +419,7 @@ class WriteQueue(object):
         self.sock = sock
         self.lock = threading.Lock()
         self.poller = zmq.Poller()
-        self.poller.register(self.sock)
+        self.poller.register(self.sock, zmq.POLLOUT)
         self.interruptor = Interruptor()
 
     def put(self, obj, timeout=None, interruptor=None):
@@ -432,7 +432,7 @@ class WriteQueue(object):
         with self.lock:
             try:
                 interruption_sock = interruptor.subscribe()
-                self.poller.register(interruption_sock)
+                self.poller.register(interruption_sock, zmq.POLLIN)
                 while True:
                     if timeout is not None:
                         timeout = max(0, (deadline - monotonic()) * 1000)
@@ -478,18 +478,18 @@ class ReadQueue(object):
 
     def __init__(self, sock):
         self.sock = sock
-        self.to_self = sock.context.socket(zmq.PUSH)
-        self.from_self = sock.context.socket(zmq.PULL)
+        self.to_self = sock.context.socket(zmq.PAIR)
+        self.from_self = sock.context.socket(zmq.PAIR)
         self_endpoint = 'inproc://zpself' + hexlify(os.urandom(8)).decode()
         self.from_self.bind(self_endpoint)
         self.to_self.connect(self_endpoint)
         self.lock = threading.Lock()
         self.to_self_lock = threading.Lock()
         self.in_poller = zmq.Poller()
-        self.in_poller.register(self.sock)
-        self.in_poller.register(self.from_self)
+        self.in_poller.register(self.sock, zmq.POLLIN)
+        self.in_poller.register(self.from_self, zmq.POLLIN)
         self.out_poller = zmq.Poller()
-        self.out_poller.register(self.to_self)
+        self.out_poller.register(self.to_self, zmq.POLLOUT)
         self.interruptor = Interruptor()
 
     def get(self, timeout=None, interruptor=None):
@@ -505,7 +505,7 @@ class ReadQueue(object):
         with self.lock:
             try:
                 interruption_sock = interruptor.subscribe()
-                self.in_poller.register(interruption_sock)
+                self.in_poller.register(interruption_sock, zmq.POLLIN)
                 events = dict(self.in_poller.poll(timeout))
                 if not events:
                     raise TimeoutError('get() timed out')
@@ -1293,8 +1293,8 @@ class ProcessTree(object):
         Process.interrupt_startup() (such as Process.terminate()) may wish to terminate
         the child process. TODO finish this and other docstrings."""
         context = SecureContext.instance(shared_secret=self.shared_secret)
-        to_child = context.socket(zmq.PUSH, allow_insecure=self.allow_insecure)
-        from_child = context.socket(zmq.PULL, allow_insecure=self.allow_insecure)
+        to_child = context.socket(zmq.PAIR, allow_insecure=self.allow_insecure)
+        from_child = context.socket(zmq.PAIR, allow_insecure=self.allow_insecure)
 
         from_child_port = from_child.bind_to_random_port('tcp://*')
         to_child_port = to_child.bind_to_random_port('tcp://*')
@@ -1423,8 +1423,8 @@ class ProcessTree(object):
             self.zlock_client.set_process_name(name)
 
         context = SecureContext.instance(shared_secret=self.shared_secret)
-        to_parent = context.socket(zmq.PUSH, allow_insecure=self.allow_insecure)
-        from_parent = context.socket(zmq.PULL, allow_insecure=self.allow_insecure)
+        to_parent = context.socket(zmq.PAIR, allow_insecure=self.allow_insecure)
+        from_parent = context.socket(zmq.PAIR, allow_insecure=self.allow_insecure)
 
         from_parent.connect(
             'tcp://%s:%d' % (self.parent_host, parentinfo['from_parent_port'])

@chrisjbillington
Copy link
Member

Here were the kinds of messages we were getting before we turned on encryption:

[b'no-cache\r\nUser-Agent: Mozilla/4.0 (compatible; MSIE 8.0; ']
[b'no-cache\r\nUser-Agent: Mozilla/4.0 (compatible; MSIE 8.0; ']
[b'859-1,utf-8;q=0.9,*;q=0.1\r\nAccept-Language: ']
[b'859-1,utf-8;q=0.9,*;q=0.1\r\nAccept-Language: ']
[b'no-cache\r\nUser-Agent: Mozilla/4.0 (compatible; MSIE 8.0; ']
[b'no-cache\r\nUser-Agent: Mozilla/4.0 (compatible; MSIE 8.0; ']
[b'no-cache\r\nUser-Agent: Mozilla/4.0 (compatible; MSIE 8.0; ']
[b'no-cache\r\nUser-Agent: Mozilla/4.0 (compatible; MSIE 8.0; ']

If you can figure out what kind of software is sending messages like this over zmq at NIST, I'd be interested to know!

@chrisjbillington
Copy link
Member

chrisjbillington commented Jun 19, 2021

Haven't ruled out the TCP reset theory though, and others have experienced it:

zeromq/libzmq#3392

If it's TCP resets causing the problem, then the above two patches won't fix it. Retries are the only way.

But it would be nice to have the retries at a lower level rather than in the BLACS code, so that they are shared by all code using zprocess/zeromq.

Pretty annoying that we've got TCP, a protocol that is reliable in the context of one connection, on top of which zeromq is automatically reconnecting to sort of make it more reliable still - but then it can still drop messages such that we would have to implement our own message sequencing and acknowledgements, defeating the purpose of using TCP in the first place. Might as well just use raw UDP if that's how it's gonna be.

@chrisjbillington
Copy link
Member

If it's TCP resets causing the problem, this might protect at least the non-remote BLACS workers, by not exposing their connections outside localhost:

diff --git a/zprocess/process_tree.py b/zprocess/process_tree.py
index 2105469..2c1cf47 100644
--- a/zprocess/process_tree.py
+++ b/zprocess/process_tree.py
@@ -1296,8 +1296,12 @@ class ProcessTree(object):
         to_child = context.socket(zmq.PUSH, allow_insecure=self.allow_insecure)
         from_child = context.socket(zmq.PULL, allow_insecure=self.allow_insecure)

-        from_child_port = from_child.bind_to_random_port('tcp://*')
-        to_child_port = to_child.bind_to_random_port('tcp://*')
+        if remote_process_client is None:
+            bind_ip = '127.0.0.1'
+        else:
+            bind_ip = '*'
+        from_child_port = from_child.bind_to_random_port(f'tcp://{bind_ip}')
+        to_child_port = to_child.bind_to_random_port(f'tcp://{bind_ip}')
         self.check_broker()
         if self.heartbeat_server is None:
             # First child process, we need a heartbeat server:

@mjdoris
Copy link
Author

mjdoris commented Jun 23, 2021

Hey Chris,

Thanks for all of this help and input. I will try the modifications and see what happens. It may take a while due to the nature of the issue. I'll have to run a shot overnight or through multiple days to see if it happens again with the changes. It's quite annoying to reproduce consistently.

It's strange that lyse was picking up the request packets from a HTTP browser. Those user agent headers usually come with a request, not a response, and the requests are targeted toward a specific IP on port 80 iirc. I don't recall Lyse being setup to use an HTTP port, right?

I haven't seen such traffic in my port sniffing, but it could be that whatever/whoever was the source is not at NIST currently due to the pandemic. I will say that we have noticed a TON of discovery packets from the NI GigE software when these issues occur. I thought it might be related, but when I blocked this traffic I was still getting the dropped messages. So unless these discovery packets are doing something else to NIST's network that causes the drop, I am doubtful it's related now.

Generally I'll see a burst of discovery packets, a large amount of spurious retransmissions, unseen ACK segments, and even some RST packets. On their own they're not usually that much of a concern, but seeing a sudden influx of retransmits, resets, and segments together raises my eyebrow. Especially when my test setup and then the chip lab's PB die seconds later. There have been times when I have checked the sniffer's log against the times of the chip lab's missing message failure and they do line up. But without anything more substantial I haven't really been able to link anything together.

@mjdoris
Copy link
Author

mjdoris commented Jun 24, 2021

Hi Michael,

This was an infuriating issue that I was never able to get to the bottom of. It's good that retries fix the issue, though unsatisfying. It makes one wonder where else in the codebase is vulnerable to messages being lost in transit.

It's particularly frustrating since connections on localhost should never fail. The problem is NIST-specific as you've seen, I never saw it anywhere else. I had theories that it was due to antivirus or other nosy software doing essentially a TCP reset attack and closing the connections. Though this still should not have resulted in message loss, since zmq should reconnect and unsent messages be queued and retried - we are not supposed to have to worry about retries ourselves.

But now I have another theory. @pacosalces will remember that we had a strange issue at NIST when not using encryption. When leaving lyse running overnight on an office computer, some other computer at NIST would connect to the zmq PULL socket for printing text to the output box. It would send what looked like the text for an HTTP GET request with a firefox user agent. Very strange! The format of the message didn't match what the code running in lyse expected though, and lyse just crashed.

Anyway, we solved this problem by turning on encryption. Then the mysterious firefox requests would not make it to application code anyway and the problem went away.

BLACS uses zmq PUSH and PULL sockets for sending data between the parent and worker processes. PUSH and PULL are technically one-to-many and many-to-one sockets, with PUSH sockets fanning out messages - taking turns which connected client they send messages to. That means if an external computer (such as our firefox friend) were to connect to a PUSH socket, it could "steal" messages - they would fail to be delivered to the other client. We have no use for this fanning out mechanism, it's just how these sockets work when multiple clients connect.

Evidence against this theory: It occurs when you are using encryption, and doesn't occur when you're not. One would expect the other way around! Our firefox friend can only butt in if it knows our encryption key, no? Despite that, I can see how it possibly still could occur with encryption on. Reading the docs of ZMQ_IMMEDIATE here, it seems that PUSH sockets can still queue messages for peers that haven't finished connecting yet. Perhaps a client without encryption attempting to connect still has messages enqueued for it, which then sit in that queue forever since the connection is not successful. As for why it doesn't occur when encryption is off, well perhaps there's a race condition such that our firefox friend's connection attempts fail faster (after all, crypto is slower) and it is not even in a partially connected state long enough to have a message enqueued for it.

Evidence for this theory: The problem ceases when you disconnect from the external network. Also, other high-speed and frequent zmq communication does not suffer from this issue - such as the camera worker class sending frames to the parent, it gets a response for every frame. Only PUSH sockets that call bind() would be vulnerable to this issue, and I think the sockets for sending data to child processes in BLACS and lyse are the only ones using this.

Could you perhaps try out the following two modifications (separately) for zprocess/process_tree.py? If this theory is right then either of them should solve the problem, though one is a bit more aggressive than the other.

Less aggressive option: keep using push/pull, but set zmq.IMMEDIATE to hopefully prevent "losing" messages by queueing them for sending on partially connected clients:

diff --git a/zprocess/process_tree.py b/zprocess/process_tree.py
index 2105469..8f8d3bc 100644
--- a/zprocess/process_tree.py
+++ b/zprocess/process_tree.py
@@ -1294,6 +1294,7 @@ class ProcessTree(object):
         the child process. TODO finish this and other docstrings."""
         context = SecureContext.instance(shared_secret=self.shared_secret)
         to_child = context.socket(zmq.PUSH, allow_insecure=self.allow_insecure)
+        to_child.setsockopt(zmq.IMMEDIATE, 1)
         from_child = context.socket(zmq.PULL, allow_insecure=self.allow_insecure)

         from_child_port = from_child.bind_to_random_port('tcp://*')
@@ -1424,6 +1425,7 @@ class ProcessTree(object):

         context = SecureContext.instance(shared_secret=self.shared_secret)
         to_parent = context.socket(zmq.PUSH, allow_insecure=self.allow_insecure)
+        to_parent.setsockopt(zmq.IMMEDIATE, 1)
         from_parent = context.socket(zmq.PULL, allow_insecure=self.allow_insecure)

         from_parent.connect(

More aggressive option: don't use push/pull, use exclusive pair sockets instead - prevents other peers from connecting (partially or otherwise) at all. This is what I wanted to do when first making zprocess actually, but the documentation warned that they were buggy or something so we went for push/pull. Silly, since the communication is supposed to be exclusive.

diff --git a/zprocess/process_tree.py b/zprocess/process_tree.py
index 2105469..b21ffb3 100644
--- a/zprocess/process_tree.py
+++ b/zprocess/process_tree.py
@@ -419,7 +419,7 @@ class WriteQueue(object):
         self.sock = sock
         self.lock = threading.Lock()
         self.poller = zmq.Poller()
-        self.poller.register(self.sock)
+        self.poller.register(self.sock, zmq.POLLOUT)
         self.interruptor = Interruptor()
 
     def put(self, obj, timeout=None, interruptor=None):
@@ -432,7 +432,7 @@ class WriteQueue(object):
         with self.lock:
             try:
                 interruption_sock = interruptor.subscribe()
-                self.poller.register(interruption_sock)
+                self.poller.register(interruption_sock, zmq.POLLIN)
                 while True:
                     if timeout is not None:
                         timeout = max(0, (deadline - monotonic()) * 1000)
@@ -478,18 +478,18 @@ class ReadQueue(object):
 
     def __init__(self, sock):
         self.sock = sock
-        self.to_self = sock.context.socket(zmq.PUSH)
-        self.from_self = sock.context.socket(zmq.PULL)
+        self.to_self = sock.context.socket(zmq.PAIR)
+        self.from_self = sock.context.socket(zmq.PAIR)
         self_endpoint = 'inproc://zpself' + hexlify(os.urandom(8)).decode()
         self.from_self.bind(self_endpoint)
         self.to_self.connect(self_endpoint)
         self.lock = threading.Lock()
         self.to_self_lock = threading.Lock()
         self.in_poller = zmq.Poller()
-        self.in_poller.register(self.sock)
-        self.in_poller.register(self.from_self)
+        self.in_poller.register(self.sock, zmq.POLLIN)
+        self.in_poller.register(self.from_self, zmq.POLLIN)
         self.out_poller = zmq.Poller()
-        self.out_poller.register(self.to_self)
+        self.out_poller.register(self.to_self, zmq.POLLOUT)
         self.interruptor = Interruptor()
 
     def get(self, timeout=None, interruptor=None):
@@ -505,7 +505,7 @@ class ReadQueue(object):
         with self.lock:
             try:
                 interruption_sock = interruptor.subscribe()
-                self.in_poller.register(interruption_sock)
+                self.in_poller.register(interruption_sock, zmq.POLLIN)
                 events = dict(self.in_poller.poll(timeout))
                 if not events:
                     raise TimeoutError('get() timed out')
@@ -1293,8 +1293,8 @@ class ProcessTree(object):
         Process.interrupt_startup() (such as Process.terminate()) may wish to terminate
         the child process. TODO finish this and other docstrings."""
         context = SecureContext.instance(shared_secret=self.shared_secret)
-        to_child = context.socket(zmq.PUSH, allow_insecure=self.allow_insecure)
-        from_child = context.socket(zmq.PULL, allow_insecure=self.allow_insecure)
+        to_child = context.socket(zmq.PAIR, allow_insecure=self.allow_insecure)
+        from_child = context.socket(zmq.PAIR, allow_insecure=self.allow_insecure)
 
         from_child_port = from_child.bind_to_random_port('tcp://*')
         to_child_port = to_child.bind_to_random_port('tcp://*')
@@ -1423,8 +1423,8 @@ class ProcessTree(object):
             self.zlock_client.set_process_name(name)
 
         context = SecureContext.instance(shared_secret=self.shared_secret)
-        to_parent = context.socket(zmq.PUSH, allow_insecure=self.allow_insecure)
-        from_parent = context.socket(zmq.PULL, allow_insecure=self.allow_insecure)
+        to_parent = context.socket(zmq.PAIR, allow_insecure=self.allow_insecure)
+        from_parent = context.socket(zmq.PAIR, allow_insecure=self.allow_insecure)
 
         from_parent.connect(
             'tcp://%s:%d' % (self.parent_host, parentinfo['from_parent_port'])

Hey Chris, just to give you an update I tried the less aggressive option (using zmq.IMMEDIATE) and messages were lost between the pulseblaster tab and its worker today. It sat waiting for about 3 hours before I noticed. The shot had been running for about 24 hours. I am now using the more aggressive option of replacing push/pull and will update you if anything happens.

@mjdoris
Copy link
Author

mjdoris commented Jun 28, 2021

So another update:

I changed the push/pull ports with pair ones instead as you instructed. I have left a shot running and it has been going without issue for four days so far. It usually encounters the communication issue at least once over a weekend, but it has not occurred so far.

I will continue letting it run through the week to see if anything happens. But so far it appears that this was the trick.

@chrisjbillington
Copy link
Member

Thanks Michael, to be honest that is kind of surprising to me! But good news. If it turns out this fixes it, I do like it as a solution since indeed these communication lines are supposed to be exclusive pairs, and the PULL/PUSH model was not the right fit anyway.

But I will have to figure out how to approach the backward-compatibility issue in zprocess, either raising an error telling you to update or providing some way for zprocess running on different computers to still work if they're different versions. Presumably you have the cameras running on a separate computer in the chip lab still, which would have broken unless you patched zprocess on both computers, right?

@chrisjbillington
Copy link
Member

It's strange that lyse was picking up the request packets from a HTTP browser. Those user agent headers usually come with a request, not a response, and the requests are targeted toward a specific IP on port 80 iirc. I don't recall Lyse being setup to use an HTTP port, right?

Yes, completely bizarre. Not port 80, and of course HTTP does not normally run over zeromq. As a test I just pointed a web browser at a port with a zmq.PULL socket bound to it, and surprisingly, data came through. I suppose (at least some) HTTP headers just happen to contain the correct bytes to be considered a valid zeromq handshake. That explains that I guess! Just a web-browser pointed at a non-standard port, perhaps as part of an automated security scanner or something.

The packet-sniffing info is interesting, though without being in the mindset of testing a more specific hypothesis it's hard to make a whole lot of sense of it! Much of that traffic could be the response to breakage rather than the cause, which makes it hard to disentangle.

@mjdoris
Copy link
Author

mjdoris commented Jul 1, 2021

Thanks Michael, to be honest that is kind of surprising to me! But good news. If it turns out this fixes it, I do like it as a solution since indeed these communication lines are supposed to be exclusive pairs, and the PULL/PUSH model was not the right fit anyway.

But I will have to figure out how to approach the backward-compatibility issue in zprocess, either raising an error telling you to update or providing some way for zprocess running on different computers to still work if they're different versions. Presumably you have the cameras running on a separate computer in the chip lab still, which would have broken unless you patched zprocess on both computers, right?

Since the chip lab is taking data I am unable to touch the setup there. So all of this has been happening on a test setup. But I would expect the camera, or anything remote, to break if their zprocess library is still using the old code. Other than a version check and warning, I'm not sure how to make it backwards compatible (unless there was some sort of exception for cameras). But it could just be one of those things where you're forced to update everything, as annoying as that can be for some labs.

Also, that same shot I spoke of earlier is still running without issues. It has been going for a week now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants