Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(robot-server,system-server): Make SQL transactions behave sanely #13424

Merged
merged 2 commits into from
Aug 30, 2023

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Aug 29, 2023

Overview

This fixes a bug (unticketed) where SQL transactions would not actually be transactional.👌

One effect of this was that if the process were killed in the middle of a long database migration, it would leave the database in a broken half-migrated state.

Technical details

Between SQLAlchemy and the underlying SQLite database, there is a middle layer: Python's built-in sqlite3 driver, aka pysqlite. According to this note from the SQLAlchemy docs, there are known issues with it:

In its default mode of operation, SQLite features such as SERIALIZABLE isolation, transactional DDL, and SAVEPOINT support are non-functional, and in order to use these features, workarounds must be taken.

SERIALIZABLE isolation and transactional DDL are especially important for us. SERIALIZABLE isolation is just the sane transaction behavior that everybody intuitively expects. Transactional DDL is the ability to do database migrations as one long transaction, rolling back the whole thing in case any part of it fails.

Fortunately, the good folks at SQLAlchemy have provided some magic code snippets to fix it the problems. So:

  1. Use the magic code.
  2. Share that code between robot-server and system-server. We do this through a new module in server_utils: server_utils.sql_utils.

Test Plan

The original bug is annoying to trigger artificially. I've reproduced it in another PR by:

  1. Adding a new migration that adds a new column with an ALTER TABLE statement
  2. Inserting a time.sleep(120) right after that ALTER TABLE statement
  3. Running make -C robot-server dev OT_ROBOT_SERVER_persistence_directory=tests/integration/persistence_snapshots/v6.2.0_large/
  4. Running pkill make in another terminal
  5. Opening the .db file in the sqlite3 CLI and noticing that the new column was there, even though the migration was left incomplete

I've confirmed that the bug doesn't happen with this fix applied.

This is our second SQLite fixup—the first is that we have to go out of our way to enable foreign key constraints. I think these have grown to the point where they need their own tests. I’m imagining we make some temporary databases with dummy schemas, apply these fixup functions to them, and test that when we access the tables, they behave the way we want. I've omitted those tests from this PR for expedience, but I'm happy to do them in a follow-up after some other high-priority stuff. RSS-331

Review requests

Are these functions documented sufficiently so it's clear how to use them and why, even if it's not clear how they work at a low level?

Risk assessment

There's low risk that this will break anything, but there is some risk that it won't work properly to fix the bug, given how non-obvious it is. See my note in the test plan about adding unit tests later.

@SyntaxColoring SyntaxColoring requested review from fsinapi and a team August 29, 2023 22:28
@codecov
Copy link

codecov bot commented Aug 29, 2023

Codecov Report

Merging #13424 (efa80c9) into chore_release-7.0.0 (9d9e497) will increase coverage by 0.12%.
Report is 6 commits behind head on chore_release-7.0.0.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@                   Coverage Diff                   @@
##           chore_release-7.0.0   #13424      +/-   ##
=======================================================
+ Coverage                71.56%   71.68%   +0.12%     
=======================================================
  Files                     2430     2427       -3     
  Lines                    67751    67569     -182     
  Branches                  7846     7783      -63     
=======================================================
- Hits                     48486    48437      -49     
+ Misses                   17426    17306     -120     
+ Partials                  1839     1826      -13     
Flag Coverage Δ
system-server 96.02% <100.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...ystem-server/system_server/persistence/database.py 82.35% <100.00%> (-4.02%) ⬇️

... and 38 files with indirect coverage changes

@SyntaxColoring SyntaxColoring marked this pull request as ready for review August 30, 2023 00:58
@SyntaxColoring SyntaxColoring requested a review from a team as a code owner August 30, 2023 00:58
Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lmao gross. Looks like a good fix to me

Copy link
Contributor

@fsinapi fsinapi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice

@SyntaxColoring SyntaxColoring merged commit 89c0a4f into chore_release-7.0.0 Aug 30, 2023
@SyntaxColoring SyntaxColoring deleted the acid_burn branch August 30, 2023 14:14
@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Aug 30, 2023

Discussed on a call with the RSS team, got 👍s.

TamarZanzouri pushed a commit that referenced this pull request Sep 13, 2023
…13424)

* Use SQLAlchemy's workarounds to fix transactions.

* Deduplicate with system-server.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants