Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sqlite] Mitigate the database locked problem for skylet config #1509

Merged
merged 3 commits into from
Dec 12, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Dec 10, 2022

Describe the changes in this PR:

This PR is to mitigate the problem in #1507

Enable the WAL mode for the database we are using for autostop to eliminate the undesired locking.

After this PR, the snippet in #1507 fails 1/30, while it fails 30/30 when using the current master branch.

import time
from multiprocessing.pool import Pool

def import_skylet_config(i):
    from sky.skylet import configs
    print('Process', i)
    configs.set_config('autostop_last_active_time', str(time.time()))

if __name__ == '__main__':
    cnt = 0
    total = 30
    for i in range(total):
        try:
            with Pool(100) as p:
                p.map(import_skylet_config, range(10000))
        except Exception as e:
            print(e)
            cnt += 1
    print('Failed', cnt, 'times out of', total)

Future TODO:

  • Adopt the WAL mode to all the database we are using?

Tested (run the relevant ones):

@Michaelvll Michaelvll marked this pull request as ready for review December 10, 2022 21:07
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! There may be some perf implications of using WAL but we can observe.

From user report this line errors out:

from sky.skylet import configs

Is it possible to also revise the repro to test the above statement?

@concretevitamin
Copy link
Member

BTW: Do we know why #1458's snippet using bash doesn't repro this db-is-locked error, but the snippet using mp.Pool() does?

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Dec 11, 2022

There may be some perf implications of using WAL but we can observe.

According to the document of sqlite, there will be 1%-2% performance degradation, which sounds ok to me, if it fix the locking problem for us.
"WAL might be very slightly slower (perhaps 1% or 2% slower) than the traditional rollback-journal approach in applications that do mostly reads and seldom write."
https://www.sqlite.org/wal.html

Is it possible to also revise the repro to test the above statement?

I tried to modify the reproduce snippets to get that error, but failed. We may need still keep an eye on the problem.

BTW: Do we know why #1458 snippet using bash doesn't repro this db-is-locked error, but the snippet using mp.Pool() does?

I adds the following snippets from #1458 into the multiple process, and the database locked problem happens for the master branch. I think one reason might be that we are testing 100 times for the 100 processes, which have better chance to have two writes happens at the same time?

    from sky.skylet.autostop_lib import set_last_active_time_to_now
    set_last_active_time_to_now()

I would like to have this PR in first as it can at least mitigate the locking problem. We can keep an eye on the locking issue with the import configs. Wdyt?

@concretevitamin
Copy link
Member

Sounds good! Thanks for the details. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants