Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hardware): Hardware error codes #13009

Merged
merged 20 commits into from
Jul 10, 2023
Merged

feat(hardware): Hardware error codes #13009

merged 20 commits into from
Jul 10, 2023

Conversation

sfoster1
Copy link
Member

@sfoster1 sfoster1 commented Jun 28, 2023

Now that we can express error codes in all our client interfaces, let's start raising them.

This PR makes the opentrons_hardware module use the new exceptions that have the new error codes. There's a couple new error codes, but mostly all this does is raise different exceptions in the same places.

The two exception is in the can messenger and the movegrouprunner.

CAN Messenger

The CAN messenger has this asyncio task that reads stuff asynchronously from the bus and passes it to registered listeners. In addition, if a message is an error message it would raise an exception. But, it's a task, so that exception doesn't go anywhere. #12963 made this exception just logged and swallowed but, like... why does it exist? So we removed it.

MoveGroupRunner

A similar issue happens in MoveGroupRunner, where a background task accumulates responses and then pops them on over to the caller. This works the same now, but we can also move the move stop condition checking inside that loop. We can also now properly handle having multiple errors in a move group!

Note: You need to run make setup-py again since now hardware depends on shared-data.

Review Requests

  • Give a skim over most stuff
  • Look more in-depth at MoveGroupRunner and at CANMessenger

Testing

  • Cause some errors and maek sure it doesn't break in new ways.

@sfoster1 sfoster1 requested review from a team as code owners June 28, 2023 20:58
@codecov
Copy link

codecov bot commented Jun 28, 2023

Codecov Report

Merging #13009 (38741fb) into edge (ce24501) will increase coverage by 0.14%.
The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             edge   #13009      +/-   ##
==========================================
+ Coverage   72.51%   72.66%   +0.14%     
==========================================
  Files        2377     1526     -851     
  Lines       65495    49826   -15669     
  Branches     7269     3099    -4170     
==========================================
- Hits        47495    36205   -11290     
+ Misses      16269    13120    -3149     
+ Partials     1731      501    -1230     
Flag Coverage Δ
app 43.55% <ø> (-27.75%) ⬇️
protocol-designer 45.84% <ø> (ø)
shared-data 77.34% <75.00%> (+0.93%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
api/src/opentrons/hardware_control/ot3api.py 80.10% <ø> (ø)
...pentrons_hardware/drivers/binary_usb/bin_serial.py 63.52% <ø> (-1.31%) ⬇️
...ns_hardware/drivers/binary_usb/binary_messenger.py 82.40% <ø> (+6.87%) ⬆️
...pentrons_hardware/drivers/can_bus/can_messenger.py 88.29% <ø> (+3.36%) ⬆️
...dware/opentrons_hardware/drivers/can_bus/driver.py 80.43% <ø> (-0.82%) ⬇️
...are/opentrons_hardware/drivers/can_bus/settings.py 100.00% <ø> (+4.28%) ⬆️
...pentrons_hardware/drivers/can_bus/socket_driver.py 48.57% <ø> (+1.20%) ⬆️
...ns_hardware/firmware_bindings/messages/payloads.py 96.09% <ø> (+0.66%) ⬆️
...are/firmware_bindings/utils/binary_serializable.py 98.64% <ø> (+2.26%) ⬆️
...e/opentrons_hardware/firmware_update/downloader.py 100.00% <ø> (ø)
... and 18 more

... and 863 files with indirect coverage changes

@sfoster1 sfoster1 force-pushed the hardware-error-codes branch from bee9789 to 0fb2e36 Compare June 28, 2023 21:02
@sfoster1 sfoster1 requested a review from a team June 29, 2023 15:30
Copy link
Contributor

@ahiuchingau ahiuchingau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general questions and i think we need to raise an extra TimeourError in the MoveScheduler

@@ -174,7 +176,14 @@ async def capacitive_probe(
messenger,
)
if not threshold:
raise RuntimeError("Could not set threshold for probe")
raise CanbusCommunicationError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general question: We are raising the CommandTimedOutError in the send threshold listener if the sensor scheduler didn't get a response back. Are we expecting to see both the CommandTimedOutError and CanbusCommunicationError if the listener does time out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you'd only see the timeout error since we're not catching and ignoring it. I think the only time you'd see this getting raised is if the microcontroller responded with threshold=0

raise MoveConditionNotMet
elif self._error is not None:
log.warning(f"Recoverable firmware error during {group_id}: {self._error}")
raise MoveConditionNotMetError(detail={"group-id": str(group_id)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll need to raise MoveConditionNotMetError there as we have already appended it toself._errors in _handle_move_completed()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this code won't run, right? because if self._errors will run? I think this clause might not necessarily have to exist or at least be a weird generic catchall though, maybe it should just be a general error

@@ -528,9 +548,6 @@ async def run(self, can_messenger: CanMessenger) -> _Completions:
log.warning(
f"Expected nodes in group {str(group_id)}: {str(self._get_nodes_in_move_group(group_id))}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TimeoutError should be raised now (I got it in internal-release_0.13.0 but not edge)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I don't want to add that in this PR - I want your commit in its entirety. So I think that either we can merge this now as is and re-add it after yours merges (and, we're going to have a lot of fixups for errors added since 0.13.0 diverged from edge, so it won't be unique) or wait on this PR until 0.13.0 gets merged back.

if self._errors:
if len(self._errors) > 1:
raise MotionFailedError(
"Motion failed with multiple errors", wrapping=self._errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMAZING

== MessageId.error_message
):
await self._handle_error(build)
handled = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we could get in a state where a listener accepts an error message but for whatever reason, it doesn't have an error handling method and hence not raise it?

In that case, would this error message just be thrown away since the can messenger thinks it was handled by a listener?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. All I can really say is, like, "don't do that"

# Log this separately if it's some unknown error
log.exception(f"Unexpected error in CAN read task: {e}")
continue
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What restarts this task now if we have an exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh good catch I guess this PR was old enough that it missed those changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed @vegano1

@@ -60,7 +68,15 @@ async def find_dfu_device(pid: str, expected_device_count: int) -> str:
if stdout is None and stderr is None:
continue
if stderr:
raise RuntimeError(f"Error finding dfu device: {stderr.decode()}")
raise BootloaderNotReady(
USBTarget.rear_panel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this guaranteed to always be the rear panel?
maybe so, since its the only USB update target for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now yeah... I should really rework it to not do this though

Copy link
Contributor

@vegano1 vegano1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!
Is there a follow-up pr on the robot-server side to handle any of these errors getting propagated?

@sfoster1
Copy link
Member Author

sfoster1 commented Jul 6, 2023

This is great! Is there a follow-up pr on the robot-server side to handle any of these errors getting propagated?

Nope, that one was already merged. All the structures to forward these errors should already be in place in robot server and protocol engine.

@sfoster1 sfoster1 force-pushed the hardware-error-codes branch from 7cd8a6d to dded2ad Compare July 6, 2023 18:29
@sfoster1 sfoster1 force-pushed the hardware-error-codes branch from 5be5b16 to 38741fb Compare July 10, 2023 18:24
@sfoster1 sfoster1 merged commit b411304 into edge Jul 10, 2023
@sfoster1 sfoster1 deleted the hardware-error-codes branch July 10, 2023 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants