-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Setting core.download:parallel
causes ERROR: database is locked
#13924
Comments
Hi @sykhro Thanks for your report I have been trying to reproduce this, on Windows 10, Python 3.11, downloading opencv from ConanCenter that it has a bunch of transitive dependencies, and I haven't been able to make it fail (ran it several times) Some quick questions:
|
Note that I have also not been able to reproduce this with a similar setup in macos |
|
Do you mean that there are parallel OS-level jobs/tasks, running over the same cache? |
No, no parallel jobs. We call this function sequentially for each build config (Debug, Release) and for two different conanfiles. def _install_from_conanfile(
path_to_conanfile: str,
path_to_host_profile: str,
path_to_build_profile: str,
deploy_folder=None,
environment=None,
build_missing=False,
):
conan_run_args = [
"conan",
"install",
path_to_conanfile,
f"--profile:host={path_to_host_profile}",
f"--profile:build={path_to_build_profile}",
"--generator=CMakeToolchain",
]
if build_missing:
conan_run_args.append("--build=missing")
if deploy_folder:
conan_run_args[-1:-1] = [
"--deploy",
"full_deploy",
"--output-folder",
deploy_folder,
]
conan_res = subprocess.run(
conan_run_args, env=environment if environment else os.environ,
) |
Then yeah, maybe the high-core count is the necessary thing to fire this, thanks for the feedback. |
The same thing now happened on my servers. The docker which runs the conan commands has In my case, there was a parallel run, but each Jenkins node has its own docker container that doesn't share anything. |
Thanks @DoDoENT for the extra feedback. |
Ok, I have had a very interesting debugging session, resulting in #14410 It happened that the timeout in the DB of 1 second was not enough to acquire the locks when excessive concurrency, and it resulted in that error. I didn't expect that such timeout was actually raising when there were concurrent access and locks, so I was looking for 1hr in all the wrong places. I have been testing raising it to 10 seconds, and it seems to be working fine now. Thanks again for the feedback! |
We are still occasionally getting this error on our CI servers even with Conan 2.0.10 (haven't tested with 2.0.11, but nothing from release notes indicates that there were any changes there). |
Happens still with conan 2.0.13 on servers with 56 threads under heavy server load. |
Lets try to raise the timeout to 20 seconds. Is it possible that the servers are just too loaded, so the system becomes a bit too unresponsive? |
That was the most probable cause in our case.
🤔, wouldn't it be just better to add a new config to |
I think because that is quite a difficult thing to understand and connect the dots to the conf when that is happening.
I think that if the system cannot sync a simple DB read/write in 20 seconds, then the problem is a different one, and the system is already degraded to a point when other things can also go wrong. But I get your point, let's give it 20 seconds at this moment, and if it happens again we will consider the conf. |
Look what I found python/cpython#70812 Seems that it is actually a "bug" in CPython sqlite3 module. Check file https://bugs.python.org/file42259/unintuitive_sqlite_behavior.py. There seems to be no option to fix this in <3.12 where CPython devs made a new |
I'm not 100% sure about this since you use Taken from the same unintuitive_sqlite_behavior.py, you may try to add |
Yes, exactly, we did this in the past to avoid issues, and make sure everything was a transaction. The new autocommit from Python 3.12 is not enough, because we will not be able to enforce Python 3.12 at least in a few years :(, we should try other approaches. Adding a cursors |
I think proper way is to first check that original script from python issue works, then modify it to have same code as in conan and see if it still shows database locking. I can try to do it later. |
Hi @vient Did you manage to try that? I am following up on the open things for 2.0.14 release, we'd like to release it next week. |
Ok, I planned to do it later but will try today then. |
I did not manage to create a reproducer based on that CPython bug if connection is always closed after I noticed one interesting effect though: making db access exclusive improves speed a lot in tests with 100+ threads (and it's hard to measure effects with less threads). By exclusivity I mean placing whole
Example from contextlib import contextmanager
import sqlite3
import threading
DB_FILE = "test.db"
_db_lock = threading.Lock()
@contextmanager
def db_connection():
with _db_lock:
connection = sqlite3.connect(DB_FILE, isolation_level=None, timeout=120)
try:
yield connection
finally:
connection.close()
with db_connection() as conn:
cursor = conn.cursor()
cursor.execute("create table if not exists t (k int primary key, v int)")
with conn:
cursor.execute("insert or ignore into t (k, v) values (1, 1), (2, 2)")
def fn():
try:
with db_connection() as conn:
conn.execute("select k from t2").fetchone()
except Exception:
with db_connection() as conn2:
conn2.execute("update t set v = 3 where k = ?", (2,))
with db_connection() as conn3:
conn3.execute("update t set v = 3 where k = ?", (2,))
threads = []
for _ in range(1000):
threads.append(threading.Thread(target=fn))
for thr in threads:
thr.start()
for thr in threads:
thr.join() Here I am creating 1000 threads. This version works 0.5s, this time is stable. Removing |
Indeed very interesting. I wouldn't have expected this, and I would have trusted that the sqlite3 internals had equally efficient locking. But that is the thing with performance and parallelism, most often it is difficult to know or intuition doesn't work, and only profiling is useful. Good investigation. |
I have ran it in Windows, and the savings are not that dramatic, but still large, for example 300 threads it reduces from 12 seconds to 6 seconds, still a big win, seems it is worth. |
Without diving into sqlite code I can only guess that waiting on single lock is more efficient because threads are waiting in futex syscall and kernel can efficiently resume one of them. Sqlite on Linux uses POSIX file locks via fcntl which may not be implemented as efficiently? At least locking logic in sqlite is surely more complicated. Concurrent accesses may be beneficial when we have a lot of tables or do mostly reads but in this case we have only two tables, recipes and packages, and all threads perform writing. |
Closed by #15029 |
This has now been replaced with this error (conan v2.9.3).
It happens at random, only on servers with high number of cores (128 in our case), under high server load. Is there a config option to increase this timeout? |
Not sure if this timeout is needed at all. If it is there as a "just-in-case check so conan does not get stuck forever", may as well be increased to 10 minutes? IMO it's better than adding a config option. I think that timeout can be removed because all db operations look pretty safe - if they take long, it seems to always be because of the high load by other processes, which is normal. |
Environment details
Steps to reproduce
core.download:parallel
to some value. In our case, this is global.conf:conan install ...
sqlite3.OperationalError: database is locked
I should note that
conan install
is invoked in succession by a Python script. However, no amount ofsleep()
seems to fix the issue and unsettingcore.download:parallel
makes the problem disappearLogs
The text was updated successfully, but these errors were encountered: