-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sqlite] Do not run sqlite when autostop is imported to avoid database locked #1576
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: global_user_state
also issues a statement on import; does it have the same problem (can defer)?
CREATE TABLE IF NOT EXISTS config ( | ||
key TEXT PRIMARY KEY, | ||
value TEXT)""") | ||
_table_created = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this have a race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is only a weak guard to reduce the overhead for calling these commands every time the function in configs.py
is called in the same SkyPilot process. The main effective part for avoid concurrent sqlite commands call is to avoid running the commands every time the configs.py
is imported.
For the race condition, I don't think it matters much, because
- This variable is a weak guard and does shared across different SkyPilot processes launched by different CLIs.
- We will not switch it back to False, so even if there are multiple threads calling functions in
configs.py
, only the first few calls will run the sqlite commands together, which might be fine.
sky/global_user_state.py
Outdated
@@ -34,7 +34,7 @@ | |||
def create_table(cursor, conn): | |||
# Enable WAL mode to avoid locking issues. | |||
# See: issue #1441 and PR #1509 | |||
conn.execute('PRAGMA journal_mode=WAL') | |||
cursor.execute('PRAGMA journal_mode=WAL') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason in PR description seems truncated:
Also, we changed to using cursor to run the command for setting WAL mode instead of connection as we
As our autostop_lib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, just updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is to make the usage of conn
and cursor
consistent and avoid an unnecessary commit
the conn.execute
implies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll. Some questions:
-
[sqlite] Do not run sqlite when autostop is imported to avoid database locked #1576 (review)
-
Do we know what concurrent processes are causing this issue? Is it because on the controller VM, there are many controller processes (one for each spot job), and each controller process periodically checks
(worker-230107-map2-25, pid=4090) I 01-08 18:49:45 spot_utils.py:64] === Checking the job status... ===
which imports this module?
Looking at https://github.com/skypilot-org/skypilot/blob/master/sky/skylet/job_lib.py#L619 it's from sky.skylet import job_lib, log_lib
, not the same as the gist indicated:
(worker-230107-map2-25, pid=4090) I 01-08 18:49:45 spot_utils.py:64] === Checking the job status... ===
(worker-230107-map2-25, pid=4090) E 01-08 18:49:57 subprocess_utils.py:70] Traceback (most recent call last):
(worker-230107-map2-25, pid=4090) E 01-08 18:49:57 subprocess_utils.py:70] File "<string>", line 1, in <module>
(worker-230107-map2-25, pid=4090) E 01-08 18:49:57 subprocess_utils.py:70] File "/home/qzeng_salk_edu/mambaforge/lib/python3.8/site-packages/sky/__init__.py", line 10, in <module>
(worker-230107-map2-25, pid=4090) E 01-08 18:49:57 subprocess_utils.py:70] from sky import backends
Maybe from sky.skylet import job_lib, log_lib
triggers a sky
import?
Please document the above in the relevant places in code ;)
Yes, it should be adopted to
Good question. I forgot to include it in the description. The problem happens when the controller process is trying to check the job status on the spot cluster, and the locked problem occurs on the remote spot cluster instead of the controller VM. Since we are importing the skypilot/sky/spot/spot_utils.py Line 65 in 524aa6b
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll!
A user experienced the database locked problem for her spot jobs again #1509 (detailed logs)
The problem is caused by the sqlite command is called whenever the
autostop_lib
module is imported, even though it is not used, creating a lot of concurrent sqlite commands.To mitigate the problem, this PR ensures the commands are only called when needed. Also, we changed to using
cursor
to run the command for setting WAL mode instead ofconnection
.Tested (run the relevant ones):
sky launch --use-spot --cloud gcp -i 1 echo hi
and check the~/.sky/skylet.log
sky status -r
sky spot launch -n test-autostop --cloud gcp "echo hi; sleep 1800; echo bye"
and checksky logs sky-spot-controller...
no database locked problem.