Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(api,update-server): Synchronize robot name more tightly between between api and update-server #11175

Merged
merged 7 commits into from
Jul 26, 2022

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Jul 19, 2022

Overview

This PR fixes bugs that could cause a robot's name to be different depending on whether you were looking at GET /server/update/health or GET /health.

Closes #10413.

Changelog

  • In update-server, when retrieving the robot's pretty hostname, do not manually parse it from /etc/machine-info. Instead, shell out to hostnamectl --pretty status. This is to match how api has been doing it.
  • When parsing the output of hostnamectl --pretty status, do not unnecessarily strip leading and trailing whitespace.
    • Stripping whitespace might be reasonable, but I don't think this is the place to do it, because it makes it harder to keep the robot's name in sync across the various places where it's visible.
  • In update-server, whenever we rewrite /etc/machine-info to store a new name, also restart systemd-hostnamed.
    • This is necessary for the hostnamectl --pretty status calls in update-server and api to reliably pick up the new value.
  • Do not fall back to the name "opentrons-develop" if shelling out to hostnamectl fails. Instead, just let the endpoint return 500.
    • This interacted poorly with the systemd-hostnamed restart described above, and it seemed unused.

Review requests

  • Confirm that you can still rename the OT-2 through the Opentrons App.
  • Confirm that this fixes bug: GET /health and GET /server/health disagree on the robot's name #10413.
  • Do we expect the removal of the fallback name "opentrons-develop" to cause any problems on dev machines?
    • Tested by @SyntaxColoring with make -C robot-server test and make -C robot-server dev. I'm pretty sure we're okay.
  • GET /health, GET /server/update/health, and GET /server/name may now temporarily return a 500 error if they're called within a few seconds of a rename. This is due to implementation limitations that I don't think are feasibly avoidable right now. See the comments in the code. Do we expect this to cause any robot discovery problems?
    • Discussed briefly in-person with @shlokamin. Off the top of his head, he thinks it should be fine.
    • Discussed with @mcous in the thread below.

Risk assessment

Medium.

There’s a risk that the temporary 500 after a name change will confuse discovery-client and screw up robot connectivity.

There's also a risk that something not covered by CI is relying on the fallback name opentrons-develop. In that case, robot-server's GET /health call will start failing with a 500 error.

@SyntaxColoring SyntaxColoring added api Affects the `api` project update server python General Python tickets and PRs; e.g. tech debt or Python guild activities robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). and removed python General Python tickets and PRs; e.g. tech debt or Python guild activities labels Jul 19, 2022
@codecov
Copy link

codecov bot commented Jul 19, 2022

Codecov Report

Merging #11175 (d1ba8ae) into edge (bb861c5) will increase coverage by 0.00%.
The diff coverage is 48.64%.

Impacted file tree graph

@@           Coverage Diff           @@
##             edge   #11175   +/-   ##
=======================================
  Coverage   73.81%   73.81%           
=======================================
  Files        2086     2086           
  Lines       57722    57725    +3     
  Branches     5855     5855           
=======================================
+ Hits        42609    42612    +3     
  Misses      13825    13825           
  Partials     1288     1288           
Flag Coverage Δ
notify-server 89.17% <ø> (ø)
update-server 73.11% <44.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
update-server/otupdate/buildroot/__main__.py 0.00% <0.00%> (ø)
update-server/otupdate/common/control.py 92.85% <ø> (ø)
...server/otupdate/common/name_management/__init__.py 100.00% <ø> (ø)
update-server/otupdate/openembedded/__main__.py 0.00% <0.00%> (ø)
...otupdate/common/name_management/pretty_hostname.py 34.14% <29.41%> (+5.19%) ⬆️
api/src/opentrons/config/__init__.py 58.08% <58.33%> (ø)
update-server/otupdate/buildroot/__init__.py 88.88% <100.00%> (ø)
...update/common/name_management/name_synchronizer.py 97.50% <100.00%> (ø)
update-server/otupdate/openembedded/__init__.py 86.66% <100.00%> (ø)

@SyntaxColoring SyntaxColoring marked this pull request as ready for review July 19, 2022 21:43
@SyntaxColoring SyntaxColoring requested a review from a team as a code owner July 19, 2022 21:43
# Strip the trailing newline, since it's not part of the actual name value.
# TODO(mm, 2022-07-18): When we upgrade to systemd 249, use
# `hostnamectl --json` for CLI output that we can parse more robustly.
assert len(result) >= 1 and result[-1] == "\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to assert that the last character is a new line? can we just trim white spaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because as far as we're concerned in this part of the code, these are valid and distinct names:

  • "foo"
  • " foo"
  • "foo "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these valid names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of!

There are two components to whether or not a string is valid here:

  1. Input restrictions and output guarantees made by the 3rd-party commands hostnamectl set-hostname --pretty and hostnamectl status --pretty.
  2. Input restrictions and output guarantees that we add ourselves, in our own Python code.

But (1) is not well documented by the 3rd-party tools. And (2) happens in update-server, far away from this code. So I think the most maintainable way to write this code is for it to assume that those strings can be valid, and to avoid chopping them up unnecessarily.

Copy link
Contributor

@TamarZanzouri TamarZanzouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me! added a few remarks but other than that I feel good about approving this. Should I test this on a robot?

@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Jul 22, 2022

Testing this with the Opentrons App and my robot connected over Wi-Fi, I'm noticing weirdness where, after rename, the robot will disappear from the devices page and never reappear. I'm trying to discern whether that's purely the existing issue of #11101, or whether this PR somehow makes it worse.

Edit: Based on further testing, I don't believe this PR somehow makes it worse. I think I was just seeing #11101.

@mcous
Copy link
Contributor

mcous commented Jul 25, 2022

GET /health, GET /server/update/health, and GET /server/name may now temporarily return a 500 error if they're called within a few seconds of a rename. This is due to implementation limitations that I don't think are feasibly avoidable right now. See the comments in the code. Do we expect this to cause any robot discovery problems?

If a Discovery Client poll happens to line up with this period of inaccessibility, the robot will be marked as "unconnectable" (as opposed to "connectable" or "unreachable") while polls return 500.

Given that this will resolve itself, and that the rename itself will look like a brand new robot to the DC, this feels acceptable to me, but will likely require closer scrutiny once we track robots more reliably

@SyntaxColoring SyntaxColoring merged commit f3ac4d9 into edge Jul 26, 2022
@SyntaxColoring SyntaxColoring deleted the robot_name_keep_api_and_update_server_in_sync branch July 26, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Affects the `api` project robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). update server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: GET /health and GET /server/health disagree on the robot's name
3 participants