Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(update-server): Avoid bricking when given a bad name #10219

Merged
merged 3 commits into from
May 12, 2022

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented May 5, 2022

Overview

This PR fixes bugs related to the update server's rename functionality.

Changelog

Refactors

Primarily:

  • Lots of renames and docstring rewrites. These are mostly aimed at teasing apart the different ideas of "the robot's name" more explicitly.
  • Make sure that getters only get, setters only set, and each thing only deals with a single explicit notion of "the robot's name" at once. This is related to the bug below, which was buried behind a lot of this confusion.

Fix #9960

At a high level, the shape of this bug was like this:

  1. An HTTP client sends a poisonous name to the update server.
  2. The update server stores the name in a file.
  3. The update server informs a different system process of the new name. The system process decides the new name is bad, rejects it, and raises an exception. This causes the HTTP request to come back with a 500 error.
  4. Later, the robot is rebooted. The update server starts initializing.
  5. As part of its initialization, the update server figures out what the robot's name is by reading the file. Then, as before, it, informs the other system process of whatever the name is. But the file still contains the bad name from before. So the other system process raises an exception again—only this time, instead of merely causing an HTTP 500 error, the exception propagates out of the update server's initialization function and prevents the server from starting.

See #9960 (comment) for more details.

The solution in this PR is twofold:

  • When the update server starts up, tolerate exceptions from the other system process instead of letting them take down the whole server.
  • When renaming the robot, only store the new name in the file after the other system process accepts it, so we know it's usable.

Fix #10198

Quick fix. When a request isn't valid JSON, we now handle it gracefully by returning an HTTP 400 error with a descriptive message.

Review requests

  • Do the new comments, function names, and docstrings make sense?
  • Are there any immediate things we can do to improve test coverage? This is tricky because we don't have Decoy in this project, and things aren't really set up for the kind of testing that we're used to in robot-server. I'm sure we could make it set up for testability, but refactors have their own risks.

Testing

A couple parts of this unfortunately must be tested on a real robot, since they interact with several system processes.

Before beginning testing, I strongly recommend making sure you have SSH access set up on your OT-2, in case you have to recover from a brick.

Testing that robot renaming still works, in general

Send a name to the robot.

POST /server/name
{"name": "new name here"}

Examples of names that should work:

  • otie
  • Hello World
  • 👉😎👉

Make sure the new name shows up in the Opentrons App. On macOS, you can also make sure it's correctly advertising itself over mDNS + DNS-SD by monitoring dns-sd -B. You should see the exact name that you sent.

Testing that this solved #9960

  1. SSH in.
  2. Send a formerly "poisonous" name to the robot over HTTP.
    • Names that I believe were "poisonous":
      • 1234567890123456789012345678901234567890123456789012345678901234567890 (70 digits, 70 bytes)
      • 名名名名名名名名名名名名名名名名名名名名名名名名名名名名名名 (30 kanji, 90 bytes when encoded as UTF-8)
      • Empty string, 0 bytes
    • Known issues:
      • The robot will stop advertising itself until you reboot it or restart the update server.
      • You'll get a 500 error back, but ideally this should maybe be some kind of 4xx error.
  3. Over SSH, restart the update server via systemctl restart opentrons-update-server.
  4. Make sure the server still initializes. (Check systemctl status opentrons-update-server or journalctl --boot.)
  5. Make sure the OT-2 is still discoverable, under its old name, through the Opentrons App and dns-sd -B.

Risk assessment

High. This code is difficult to meaningfully cover by tests because it depends heavily on interaction with external system stuff. The highest-risk code is the stuff in __main__.py and __init.py__. If that goes wrong somehow, it will prevent the update server from starting up, and leave a robot bricked except by heavy intervention.

@SyntaxColoring SyntaxColoring added fix PR fixes a bug update server robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). labels May 5, 2022
@codecov
Copy link

codecov bot commented May 5, 2022

Codecov Report

Merging #10219 (6f30500) into edge (4701699) will increase coverage by 0.00%.
The diff coverage is 48.97%.

Impacted file tree graph

@@           Coverage Diff           @@
##             edge   #10219   +/-   ##
=======================================
  Coverage   74.72%   74.72%           
=======================================
  Files        2111     2111           
  Lines       55691    55702   +11     
  Branches     5635     5635           
=======================================
+ Hits        41616    41626   +10     
- Misses      12915    12916    +1     
  Partials     1160     1160           
Flag Coverage Δ
app 71.49% <ø> (ø)
components 52.82% <ø> (ø)
labware-library 49.74% <ø> (ø)
protocol-designer 45.13% <ø> (ø)
react-api-client 82.48% <ø> (ø)
step-generation 86.49% <ø> (ø)
update-server 72.81% <48.97%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
update-server/otupdate/buildroot/__main__.py 0.00% <0.00%> (ø)
update-server/otupdate/common/name_management.py 35.38% <59.45%> (+5.87%) ⬆️
update-server/otupdate/buildroot/__init__.py 87.17% <100.00%> (ø)
update-server/otupdate/openembedded/__init__.py 85.41% <100.00%> (ø)

@SyntaxColoring SyntaxColoring marked this pull request as ready for review May 6, 2022 19:38
@SyntaxColoring SyntaxColoring requested a review from a team as a code owner May 6, 2022 19:38
@SyntaxColoring SyntaxColoring requested review from sanni-t and removed request for a team May 6, 2022 19:38
Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me behavior wise and is well commented, nice.

if the pretty hostname could not be written.
"""
try:
# We can't run `hostnamectl --pretty <name>` to write this for us
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jfyi - this is to do with the way these files are stored, as I recall. The inode we get when we open("/etc/machine-info") in this program is stored in /var/ and bind-mounted onto /etc/machine-info. That bind happens after systemd somehow caches what you get when you open /etc/machine-info at boot time before the bind-mount, which is read-only. This is why we use this style here, and I think I went with using the file directly for consistency in other places also.

We definitely could test this more by slightly restructuring or mocking to use various weird /etc/machine-info files captured from robots.

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh thanks, that’s super helpful. Maybe there’s a way to get it working by adjusting when systemd-hostnamed starts, so it caches the correct thing. I’ll play around with it.

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a couple approaches to working around this today, but I didn't get anywhere.

  • I tried doing systemctl restart systemd-hostnamed before doing hostnamectl --pretty in hopes that it would pick up our file in /var instead of the underlying file in /etc, but that didn't change anything.
  • I tried editing /usr/lib/systemd/system/systemd-hostnamed.service to expand its ReadWritePaths from /etc to /etc /etc/machine-info /var/machine-info, but that didn't change anything.

I'm not sure this has to do with caching. I think it might have to do with the way systemd-hostnamed tries to overwrite the file. I suspect it tries to atomically place a new file at the /etc/machine-info path instead of modifying the file that's already there, and I guess that doesn't work with bind mounts? Here's a Stack Exchange question I just asked about part of it. 🤷

Resolve conflicts in:
* update-server/tests/common/test_name_management.py
* update-server/otupdate/openembedded/__init__.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix PR fixes a bug robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). update server
Projects
None yet
2 participants