Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the detection of running processes by including process UID to the check #7637

Closed
wants to merge 1 commit into from

Conversation

kozlovsky
Copy link
Contributor

@kozlovsky kozlovsky commented Oct 19, 2023

This PR should fix #7603 by associating a unique UID value with each process. It fixes the problem of non-unique PID values.
In addition, each process periodically (1 time per second) updates its last_alive_at time in the database. When the last_alive_at time is not updated long enough (for 5 seconds), the process is considered not running.

Now, the following situations should be handled correctly:

  1. In Flatpak, when the new GUI process can't find a currently running primary GUI process, it understands that the last_alive_at was recently updated and that the primary process is running in a separate virtual environment.
  2. When a GUI process creates a Core process and awaits for the api_port value, it understands that the correct Core process should specify the correct UID of the current GUI process. That prevents connecting with the wrong Core process.
  3. When Core and GUI processes display the recent process list, they are correctly linked to each other.

@kozlovsky kozlovsky force-pushed the fix/process_uid branch 5 times, most recently from 702a8b3 to a1bd858 Compare October 19, 2023 07:44
…ld process. Stop relying on PIDs when detecting running processes
@kozlovsky kozlovsky marked this pull request as ready for review October 19, 2023 08:07
@kozlovsky kozlovsky requested review from a team and xoriole and removed request for a team October 19, 2023 08:07
@xoriole xoriole requested a review from drew2a October 23, 2023 07:27
@drew2a
Copy link
Contributor

drew2a commented Oct 23, 2023

I've been added as a reviewer for this PR. I noticed another PR addresses the same issue. Have you (@kozlovsky, @xoriole) reached a consensus on which approach we should take?

@drew2a
Copy link
Contributor

drew2a commented Oct 23, 2023

Just a quick note: from my perspective, this PR is unlikely to resolve #7603, as it addresses a very specific flatpak problem, which is just a small part of the original issue.

The platform distribution for the original issue is as follows:

image

@kozlovsky
Copy link
Contributor Author

kozlovsky commented Oct 23, 2023

@drew2a @xoriole

First, I agree that this is not a complete fix for #7603. Essentially, this Sentry issue combines two unrelated bugs that look similar to Sentry.

  1. A Flatpak-specific bug due to simultaneous processes having the same PID.
  2. Another bug with an unclear origin.

Second, I disagree that the Flatpak issue is only a small part of the original issue. I manually traversed through all currently available reports of this sentry issue and counted how many events are related to Flatpak: https://sentry.tribler.org/organizations/tribler/issues/2003/events/latest/?project=7&referrer=latest-event

In the available reports, I counted the following:

  • 35 error reports from Flatpak environments;
  • 49 error reports from non-Flatpak environments

So, the Flatpak-related issue corresponds to 41.7% of all recorded reports. It is a pretty significant number.

By addressing the Flatpak problem in 7.13.1 with this PR, we can demystify the reports, making the second bug easier to isolate and investigate.

Regarding another PR #7621 my opinion is:

  • The "quick fix" in Fix multiple instances of Tribler on Flatpak #7621 does not look correct to me, as it introduces a race condition similar to what we already encountered before. It fixes the Flatpack-related bug but introduces a different bug due to that race condition.
  • It is theoretically possible to completely remove the lock file and rely only on QLocalSocket. But it is a pretty significant change that looks not like a simple fix but like a major rewriting that requires pretty significant efforts to implement. For that reason, it is hard to do it in a bugfix release. We should expect that this major rewrite can introduce completely new bugs and should have time to fix them in several RC releases of the following major release.

For that reason, I suggest the following plan:

  • In 7.13.1, we can solve the Flatpack issue using the fix from the current PR
  • It should allow us to investigate and fix the second (non-Flatpak) issue in the later release.
  • In the next major release, we can try switching to a new purely QLocalSocket approach for process coordination and stop relying on file locks. However, implementing it is a non-trivial task.

@drew2a
Copy link
Contributor

drew2a commented Oct 24, 2023

I haven't done a deep dive into the PR yet, but my initial impression is one of concern due to its complexity. I believe it was already intricate to the point where, I suspect, no one besides @kozlovsky fully grasped its logic. Now, it appears even more complicated with the addition of a new thread and the GUI_UID_ENV_KEY environment variable. The underlying logic has also grown in complexity.

When I approved the PR, I assumed that despite its complexity, it would address the "two instance" problem, which remains unresolved.

I'd suggest considering a rollback of the entire SQL lock implementation in favor of a slightly tweaked previous approach.

Below is a simplified example of a potential file lock solution. It's incredibly straightforward and would protect us from race conditions, which was the original intention behind implementing #7212.

import asyncio

from filelock import FileLock, Timeout

lock = FileLock("gui.lock", timeout=0)


async def run_tribler_gui():
    try:
        with lock:
            print('Tribler is running')
            await asyncio.sleep(10)  # simulation of a normal tribler run
    except Timeout:
        print("Tribler is already running. Closing the Tribler...")
        exit(1)


asyncio.run(run_tribler_gui())

Ref: https://py-filelock.readthedocs.io/en/latest/index.html

@drew2a
Copy link
Contributor

drew2a commented Oct 24, 2023

The simplified core part could be like:

import asyncio

from filelock import FileLock, Timeout

lock = FileLock("core.lock", timeout=0)
port_file_name = 'core.port'


async def run_tribler_core():
    try:
        with lock:
            print('Tribler core is running')
            print('Creating REST server..')
            # assume it has been created on port 1234
            port = 1234
            with open(port_file_name, "w+") as f:
                f.write(str(port))

            await asyncio.sleep(10)  # simulation of a normal tribler run

    except Timeout:
        print("Tribler core is already running. Closing the Tribler core...")
        exit(1)


asyncio.run(run_tribler_core())

@kozlovsky
Copy link
Contributor Author

Yes, I agree that it is possible for us to use the filelock library. I considered it before implementing SQLite-based lock implementation, but at that moment, the filelock library was considered unmaintained and had multiple open issues with different problems on different OSes. Since then, the library has been adopted by virtualenv maintainers, and multiple issues have been fixed. So it is probably safe now to use it.

With this, I doubt that the full solution will be as simple as in the suggested example:

  1. SQLite is more widespread on different systems compared to filelock, and its lock implementation can be more robust.

  2. Some possible bugs are still present in the filelock library, so by replacing SQLite-based locks with filelock locks, we can encounter new error cases.

  3. One of the reasons for Add ProcessLocker to ensure that no more than one Tribler GUI/Core process runs locally at any moment #7212 was that not only do Core processes require proper locking, but GUI processes need this as well. Originally, Tribler relied on QLocalSocket to ensure the proper "locking" of the GUI processes, but it turned out that QLocalSocket implementation has inherent race conditions that were impossible to solve without some additional mechanism. This means that with filelock, you most probably need not one, but several file locks, one for GUI processes and another for Core processes. You need to keep different data in lock-protected files for Core and GUI, and it can quickly become more complicated compared to keeping all information in a single file.

  4. As the filelock user reports, it is possible for a terminated process to leave the lock file on disk, and in that case, it can be necessary to manually delete the lock file:

    As noted by a comment at the blog post, this solution isn't "perfect", in that it's possible for the program to terminate in such a way that the lock is left in place and you have to manually delete the lock before the file becomes accessible again.

    If in Tribler we encounter this type of problem with filelock, it can greatly complicate the startup logic.

  5. The current implementation not only solves the lock problem but also maintains the list of currently running processes and the history of the last processes, which allows us to detect tricky bugs. It may be inconvenient to lose this type of valuable information.

@synctext
Copy link
Member

synctext commented Oct 25, 2023

As discussed at length: lets try to keep it simple; use a file-lock and new dependency.

Always a good idea to see if we can keep it simple, as long as we are 90% certain it will work. So a 10% chance of introducing new errors is fine, when we might keep-things-simple 🤔
@drew2a @kozlovsky @egbertbouman @xoriole Please review the upcoming code. Goal is 2 days of simple code.
Simplify database (process ID), no new thread, and start relying on the file lock.
As introduced by @kozlovsky What about the start of 2 GUI processes on flatpack, with the similar process IDs? (just users misclicking on Tribler, double startup, elusively on Flatpack: no-fix)

Lets add this idea to the Application Tester, it is the primary usage scenario of Tribler. To download something. If a user clicks on a torrent file, it will open a new instance of Tribler, and not just double clicking

@kozlovsky
Copy link
Contributor Author

Closed in favor of #7660

@kozlovsky kozlovsky closed this Nov 16, 2023
@kozlovsky kozlovsky deleted the fix/process_uid branch June 13, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants