Add error handling to Cryomech PTC drivers and agent acq process #767

BrianJKoopman · 2024-10-02T19:34:06Z

Description

This PR implements error handling for the Cryomech PTC class and most of the agent, with the main goal of making the acq() process robust to network interruptions.

In doing this I moved the "drivers" out to their own module. That's perhaps a bit noisy overall in this PR, but the interesting diffs are separated to:

Driver error handling implementation: 41f5f8f
Agent error handling in acq process: 1470e0a

Motivation and Context

This implementation is the one outlined in "Reconnecting in the Drivers" in #538, heavily inspired by #325.

Resolves #766.

How Has This Been Tested?

Tested by running the agent locally with an actual Cryomech compressor. Tested the agent remains online while disconnecting the network cable from the compressor and re-establishes connection when the cable is plugged back in. Example logs:

2024-10-02T15:18:47-0400 start called for acq
2024-10-02T15:18:47-0400 acq:1 Status is now "starting".
2024-10-02T15:18:47-0400 Response from acq.start(): Started process "acq".
2024-10-02T15:18:47-0400 init:0 PTC agent initialized
2024-10-02T15:18:47-0400 init:0 Status is now "done".
2024-10-02T15:18:47-0400 acq:1 Status is now "running".
2024-10-02T15:19:12-0400 Failed to get data from compressor. Check network connection.
2024-10-02T15:19:23-0400 Failed to get data from compressor. Check network connection.
2024-10-02T15:19:32-0400 Connection re-established.

Also tested shutting down the compressor and bringing it back up, which causes the reset() to trigger and produces a slightly different set of errors. It still reconnects once the compressor is back online. Example logs:

2024-10-02T15:19:51-0400 Failed to get data from compressor. Check network connection.
2024-10-02T15:20:02-0400 Failed to get data from compressor. Check network connection.
2024-10-02T15:20:12-0400 Failed to get data from compressor. Check network connection.
2024-10-02T15:20:13-0400 Connection error: [Errno 32] Broken pipe
2024-10-02T15:20:13-0400 Resetting the connection to the compressor.
2024-10-02T15:20:14-0400 Unable to connect. [Errno 111] Connection refused
2024-10-02T15:20:14-0400 Caught unexpected AttributeError during write:
2024-10-02T15:20:14-0400   'NoneType' object has no attribute 'sendall'
2024-10-02T15:20:14-0400 Failed to get data from compressor. Check network connection.
2024-10-02T15:20:15-0400 Caught unexpected AttributeError during write:
2024-10-02T15:20:15-0400   'NoneType' object has no attribute 'sendall'
2024-10-02T15:20:15-0400 Resetting the connection to the compressor.
2024-10-02T15:20:15-0400 Unable to connect. [Errno 111] Connection refused
2024-10-02T15:20:15-0400 Caught unexpected AttributeError during write:
2024-10-02T15:20:15-0400   'NoneType' object has no attribute 'sendall
2024-10-02T15:20:47-0400 Resetting the connection to the compressor.
2024-10-02T15:20:47-0400 Connection re-established.

(There will be a lot of the AttributeError logs, as the acq process loops.)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

BrianJKoopman · 2024-10-04T14:58:44Z

Added error handling for when the socket comes back as None to avoid those AttributeError exceptions. Retested, and I'm fairly happy with the performance and logging at the moment. Will likely merge when checks are done.

This happens when we try to reset the connection, but can't establish anything due to the device not accepting connections.

BrianJKoopman · 2024-10-07T17:02:27Z

Messed up splitting out #771, had to push a rebase...will merge when checks complete.

BrianJKoopman added the agent: cryomech label Oct 2, 2024

BrianJKoopman mentioned this pull request Oct 7, 2024

cryomech: Move PTC class to drivers module #771

Merged

6 tasks

BrianJKoopman added 4 commits October 7, 2024 13:00

Implement error handling in Cryomech drivers

32331b0

Handle ConnectionError in main acq process

48948f8

Handle connection error in init task

555d274

Handle condition where socket returns None

2bddc4f

This happens when we try to reset the connection, but can't establish anything due to the device not accepting connections.

BrianJKoopman force-pushed the koopman/cryomech-robustness branch from 81929b9 to 2bddc4f Compare October 7, 2024 17:01

BrianJKoopman merged commit cc1efa3 into main Oct 7, 2024
5 checks passed

BrianJKoopman deleted the koopman/cryomech-robustness branch October 7, 2024 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add error handling to Cryomech PTC drivers and agent acq process #767

Add error handling to Cryomech PTC drivers and agent acq process #767

BrianJKoopman commented Oct 2, 2024

BrianJKoopman commented Oct 4, 2024

BrianJKoopman commented Oct 7, 2024

Add error handling to Cryomech PTC drivers and agent acq process #767

Add error handling to Cryomech PTC drivers and agent acq process #767

Conversation

BrianJKoopman commented Oct 2, 2024

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

BrianJKoopman commented Oct 4, 2024

BrianJKoopman commented Oct 7, 2024