Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sqlite] Do not run sqlite when autostop is imported to avoid database locked #1576

Merged
merged 6 commits into from
Jan 11, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jan 8, 2023

A user experienced the database locked problem for her spot jobs again #1509 (detailed logs)

The problem is caused by the sqlite command is called whenever the autostop_lib module is imported, even though it is not used, creating a lot of concurrent sqlite commands.

To mitigate the problem, this PR ensures the commands are only called when needed. Also, we changed to using cursor to run the command for setting WAL mode instead of connection.

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
    • sky launch --use-spot --cloud gcp -i 1 echo hi and check the ~/.sky/skylet.log
    • sky status -r
    • sky spot launch -n test-autostop --cloud gcp "echo hi; sleep 1800; echo bye" and check sky logs sky-spot-controller... no database locked problem.
  • User's own workload.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: global_user_state also issues a statement on import; does it have the same problem (can defer)?

CREATE TABLE IF NOT EXISTS config (
key TEXT PRIMARY KEY,
value TEXT)""")
_table_created = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this have a race?

Copy link
Collaborator Author

@Michaelvll Michaelvll Jan 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is only a weak guard to reduce the overhead for calling these commands every time the function in configs.py is called in the same SkyPilot process. The main effective part for avoid concurrent sqlite commands call is to avoid running the commands every time the configs.py is imported.

For the race condition, I don't think it matters much, because

  1. This variable is a weak guard and does shared across different SkyPilot processes launched by different CLIs.
  2. We will not switch it back to False, so even if there are multiple threads calling functions in configs.py, only the first few calls will run the sqlite commands together, which might be fine.

@@ -34,7 +34,7 @@
def create_table(cursor, conn):
# Enable WAL mode to avoid locking issues.
# See: issue #1441 and PR #1509
conn.execute('PRAGMA journal_mode=WAL')
cursor.execute('PRAGMA journal_mode=WAL')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason in PR description seems truncated:

Also, we changed to using cursor to run the command for setting WAL mode instead of connection as we
As our autostop_lib

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, just updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to make the usage of conn and cursor consistent and avoid an unnecessary commit the conn.execute implies.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll. Some questions:

  1. [sqlite] Do not run sqlite when autostop is imported to avoid database locked #1576 (review)

  2. Do we know what concurrent processes are causing this issue? Is it because on the controller VM, there are many controller processes (one for each spot job), and each controller process periodically checks

(worker-230107-map2-25, pid=4090) I 01-08 18:49:45 spot_utils.py:64] === Checking the job status... ===

which imports this module?

Looking at https://github.com/skypilot-org/skypilot/blob/master/sky/skylet/job_lib.py#L619 it's from sky.skylet import job_lib, log_lib, not the same as the gist indicated:


(worker-230107-map2-25, pid=4090) I 01-08 18:49:45 spot_utils.py:64] === Checking the job status... ===
(worker-230107-map2-25, pid=4090) E 01-08 18:49:57 subprocess_utils.py:70] Traceback (most recent call last):
(worker-230107-map2-25, pid=4090) E 01-08 18:49:57 subprocess_utils.py:70]   File "<string>", line 1, in <module>
(worker-230107-map2-25, pid=4090) E 01-08 18:49:57 subprocess_utils.py:70]   File "/home/qzeng_salk_edu/mambaforge/lib/python3.8/site-packages/sky/__init__.py", line 10, in <module>
(worker-230107-map2-25, pid=4090) E 01-08 18:49:57 subprocess_utils.py:70]     from sky import backends

Maybe from sky.skylet import job_lib, log_lib triggers a sky import?

Please document the above in the relevant places in code ;)

@Michaelvll
Copy link
Collaborator Author

Q: global_user_state also issues a statement on import; does it have the same problem (can defer)?

Yes, it should be adopted to global_user_state as well, but seems the problem happens for the autostop_lib only now. Let us defer it to the future PR.

Do we know what concurrent processes are causing this issue? Is it because on the controller VM, there are many controller processes (one for each spot job), and each controller process periodically checks

Good question. I forgot to include it in the description. The problem happens when the controller process is trying to check the job status on the spot cluster, and the locked problem occurs on the remote spot cluster instead of the controller VM. Since we are importing the job_lib and log_lib with from sky.skylet import job_lib, log_lib, python will try to import sky.__init__ first, which imports the backends.cloud_vm_ray_backends where we import autostop_lib and configs. Just added the comment. PTAL

statuses = backend.get_job_status(handle, stream_logs=False)

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll!

@Michaelvll Michaelvll merged commit c927eb5 into master Jan 11, 2023
@Michaelvll Michaelvll deleted the sqlite-autostop branch January 11, 2023 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants