Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle the "DatabaseError: database disk image is malformed" error #7628

Merged
merged 24 commits into from
Nov 23, 2023

Conversation

kozlovsky
Copy link
Contributor

@kozlovsky kozlovsky commented Oct 9, 2023

Fixes #7623, #7037, #5252, and #1993 (the latter is the pre-ORM version of the same problem)

This PR adds handling to the error "DatabaseError: database disk image is malformed". The error is handled in all databases that Tribler has: "metadata.db", "knowledge.db", "bandwidth.db", and "tags.db" (if it presents during the migration). The database "processes.sqlite" is already handled separately.

The main goal of the PR is to restore normal Tribler functionality by deleting the corrupted database file and allowing Tribler to create a new, fresh database file. Instead of deleting the file, we can rename it as an alternative way to handle the issue. Still, I'm not sure it is worth it, as it will take some valuable disk space, and not many users will try to restore information from the corrupted database file. It is possible to extract some data from the corrupted file, but I doubt it is worth it for us to provide such recovery tools.

What the PR does:

  1. It adds tracking of the corresponding error to most places where the error can arise
  2. In case of an error, the file "<db_filename>.is_corrupted" is created near the corresponding database file.
  3. The notification message box about the database corruption fixing is displayed to the user.
  4. The Tribler Core is automatically restarting under the hood, so the user does not need to restart Tibler.

@kozlovsky kozlovsky changed the title DatabaseError: database disk image is malformed Handle the "DatabaseError: database disk image is malformed" error Oct 9, 2023
@kozlovsky kozlovsky force-pushed the fix/database_is_malformed branch 5 times, most recently from dec4465 to 0526d7e Compare October 10, 2023 06:28
@kozlovsky kozlovsky marked this pull request as ready for review October 10, 2023 06:28
@kozlovsky kozlovsky requested review from a team and drew2a and removed request for a team October 10, 2023 06:28
@drew2a
Copy link
Contributor

drew2a commented Oct 10, 2023

The error is shown to the user in the usual way; it is possible to add a special handling of the error in the GUI.

Based on your description, this shouldn't be presented to the user in the typical manner. Instead, we should display an error to the user stating "Database is malformed" and provide additional details indicating that the database file will be renamed and removed later on. Furthermore, we should prevent these errors from being sent to us.

Copy link
Contributor

@drew2a drew2a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with @synctext's suggestion that we should resolve the issue for the user without needing to restart Tribler.

When a database corruption is detected, we should display a dialog to the user explaining the situation and assuring them that we will automatically address the problem as soon as they press the 'OK' button.

@drew2a
Copy link
Contributor

drew2a commented Oct 10, 2023

Additionally, please link to any issues that will be resolved once this PR is merged.

@kozlovsky kozlovsky force-pushed the fix/database_is_malformed branch 7 times, most recently from fada7ac to 394d7f6 Compare October 16, 2023 12:55
@kozlovsky kozlovsky requested a review from drew2a October 16, 2023 12:58
@kozlovsky
Copy link
Contributor Author

Turns out the PR fix-db-corruption logic conflicted with the previous Upgrader logic, now I fixed it. Also, I linked the PR with the relevant issues

Copy link
Contributor

@drew2a drew2a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the GUI components and the tests.

I'll proceed with reviewing the core once the Upgrader has been completed.

src/tribler/gui/tribler_window.py Outdated Show resolved Hide resolved
src/tribler/gui/defs.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/tests/test_pony_utils.py Outdated Show resolved Hide resolved
@kozlovsky
Copy link
Contributor Author

I discovered that the initially implemented in this PR corruption-handled logic was incomplete. It was implemented by handling the error in specific places of Tribler code, like opening the database connection or calling the cursor.execute() method. To implement this, a complicated code was added to the TriblerDatabase, TriblerSQLiteProvider, and TriblerPool classes that override the standard PonyORM classes for SQLite providers.

As it turns out, the corruption error is unpredictable and can randomly happen in multiple other places, like cursor.fetchone(), cursor.__next__(), connection.commit(), etc. It is unfeasible to handle all these places in the Tribler code explicitly. For that reason, I completely reimplemented the corruption-handling logic.

Now, I added a new sqlite_utils module that can replace the native sqlite3 module completely. It provides improved versions of the Connection and Cursor classes that handle the database corruption error in all places where the error can arise.

Now, by using the sqlite_utils module instead of the base sqlite3 module, it becomes possible to remove the error-handing logic from many places of Tribler code and make the code more straightforward, as the detection of the database corruption is concentrated in a single place, the sqlite_util module. Significant code was removed from the pony_utils module, and the TriblerPool class is no longer necessary. Now, using context manager in Tribler code to handle the database corruption error is not necessary as the low-level connection and cursor methods handle it.

Also, some Core-GUI interaction problems were fixed, and GUI now correctly re-connects to the restarted Core in all cases.

@drew2a drew2a self-requested a review November 22, 2023 14:51
Comment on lines 54 to 67
if isinstance(e, DatabaseIsCorrupted):
# When the database corruption is detected, we should stop the process immediately.
# Tribler GUI will restart the process and the database will be recreated.
process_manager = get_global_process_manager()
process_manager.sys_exit(EXITCODE_DATABASE_IS_CORRUPTED, e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it is necessary. The actual reason is a bit complex to explain:

An error that is raised during the component's startup is wrapped in ComponentStartupException. Then, it is handled in a special way:

    def _reraise_startup_exception_in_separate_task(self):
        self.logger.info('Reraise startup exception in separate task')

        async def exception_reraiser():
            self.logger.info('Exception reraiser')

            e = self._startup_exception
            if isinstance(e, ComponentStartupException) and e.component.tribler_should_stop_on_component_error:
                self.logger.info('Shutdown with exit code 1')
                self.exit_code = 1
                self.shutdown_event.set()

            # the exception should be intercepted by event loop exception handler
            self.logger.info(f'Reraise startup exception: {self._startup_exception}')
            raise self._startup_exception

        self.async_group.add_task(exception_reraiser())

This logic triggers a weird asyncio behavior described in https://togithub.com/python/cpython/issues/69675. In short, if CoreExceptionHandler.unhandled_error_observer is called during the Tribler Core shutdown, the SystemExit error raised from unhandled_error_observer can be suppressed and ignored by asyncio. Apparently, unhandled_error_observer can be called inside the Task.__del__() method that ignores all re-raised exceptions, including the SystemExit error. As a result, Tribler Core finishes with exit code 1 (instead of code 99 dedicated to the database corruption error), and Tribler GUI does not restart the core.

Fixing and refactoring the logic of unhandled_error_observer can be a tricky and complicated task that is outside of the current PR scope. On the other side, raising SystemExit directly from the component's method works well and provides the desired results.

Also, raising SystemExit in the component allows TriblerCore to restart without waiting first until all components start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please put this explanation into the code?

src/tribler/core/utilities/sqlite_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/sqlite_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/tests/test_sqlite.py Outdated Show resolved Hide resolved
src/tribler/gui/event_request_manager.py Show resolved Hide resolved
@drew2a
Copy link
Contributor

drew2a commented Nov 22, 2023

We discussed the requested changes with @kozlovsky in person.

Copy link
Contributor

@drew2a drew2a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please, add a comment about necessity of process_manager.sys_exit() in component.py.

@kozlovsky kozlovsky merged commit 349b76d into Tribler:release/7.13 Nov 23, 2023
17 checks passed
@kozlovsky kozlovsky deleted the fix/database_is_malformed branch November 27, 2023 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants