-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heavy load with 338 instances #20
Comments
Hi Thomas, Thanks for the report! I think this is coming from the new per-db snapshot feature. Can you try a custom patch? Maybe this would be enough to fix the problem: diff --git a/powa_collector/snapshot.py b/powa_collector/snapshot.py
index d7eb5e2..c0a1b82 100644
--- a/powa_collector/snapshot.py
+++ b/powa_collector/snapshot.py
@@ -113,6 +113,7 @@ def copy_remote_data_to_repo(cls, data_name,
buf = StringIO()
try:
data_src.copy_expert("COPY (%s) TO stdout" % data_src_sql, buf)
+ data_src.execute("RELEASE src")
except psycopg2.Error as e:
src_ok = False
err = "Error retrieving datasource data %s:\n%s" % (data_name, e)
@@ -125,6 +126,7 @@ def copy_remote_data_to_repo(cls, data_name,
try:
cls.logger.debug("Calling %s..." % cleanup_sql)
data_src.execute(cleanup_sql)
+ data_src.execute("RELEASE src")
except psycopg2.Error as e:
err = "Error while calling %s:\n%s" % (cleanup_sql, e)
errors.append(err)
@@ -142,6 +144,7 @@ def copy_remote_data_to_repo(cls, data_name,
try:
# For data import the schema is now on the repository server
data_ins.copy_expert("COPY %s FROM stdin" % target_tbl_name, buf)
+ data_ins.execute("RELEASE data")
except psycopg2.Error as e:
err = "Error while inserting data:\n%s" % e
cls.logger.warning(err) |
I'm trying it now, I'll let you know ASAP. |
Things are probably better, now I experiment memory errors. I'll let you know on monday I think. |
ok, that's a least a first good news I hope! |
Also, now many sessions are experiencing LWLock waits for BufferMapping, as well as relation extension locks. I have to check this issue first. |
this one looks just a side effect of having that many servers on a fresh install. once you hit the coalesce and purge, auto vacuum should kick in and you shouldn't hit the relation extension anymore. for the buffer mapping, there is probably not much to do apart from lowering the frequency or excluding some databases from the per db snapshots. do you usually have one db per instance or a lot? |
Thanks for the feedback. I have usually one db per instance. So I'll try to solve this memory issue first. |
I committed the fix for the missing RELEASE commands. It's likely not the only problem but clearly they should be released so this is one less thing to worry about. |
Hi, With a bit more CPUs and more RAM, I got to a load average of 190. I monitored the await with " Are there some settings to try, I didn't notice anything in the documentation but there's a high probability I missed something. |
Unfortunately without any indication at all of why the load is high it's hard to give advice on what to do. Can you check for any bottleneck, or maybe it's just that you have 190 connections active in pg_stat_activity? One thing you could blindly try is to disable the per-db modules, something like that should work
Unfortunately nothing happened on the documentation. I didn't have time to update it and everyone was asking for a release ASAP so your only option for now is to read powa-archivist and powa-collector to find out what you can do. |
So, I did two things :
Also, subtransaction_buffers has been RESET since your latest changes to the code. I noticed a lot of buffer alloc wait event before, that led me to increase largely shared_buffers. Both things seemed to solve the load issue. Now I'll activate modules one after each other. I'll let you know. |
Just on a side-note, I started again from scratch, with all modules activated and when load goes up to 200, I have the following activity :
|
Will get back on it tomorrow, I'll let you know. |
Most of the activity seems to be on the catalog snapshot. This one is known to be expensive, so it's done only once a month. For you it should mean doing a dozen per day but since you're adding all the servers at once it's doing them all concurrently which can easily explain your problem. You could try to disable it (I don't remember how to do it, there is probably a table somewhere where you can exclude databases) and see if with that the rest scales. We should probably improve the behavior, like doing it only one initial sync and let users do it on-demand since the UI has this feature. But even if this was implemented you would still hit a massive load in your specific scenario :( |
Hello, I got back in this topic only now. As discussed on irc, I added pgBouncer between the collector and the db. This helped a lot, there are no more lock standing locks and locked sessions. The collector seems to behave correctly. I'll do some more experiment on this side and give you a more detailed feedback. Thanks for your support. |
With the help of pgBouncer, the load was seriously reduced and behaves correctly. I've set max_db_connections to 10 for the collector, so the load remains normal. |
Thanks for your support ! |
thanks for the feedback. do you have some numbers to share to get an idea of what "a lot more important" means? is it due to powa-team/powa-archivist#80 only or other things too? according to this issue it doesn't look like the extra space is due to powa 5 per se but adding hundreds of servers at once, and the numbers there suggest about 132MB per server, which doesn't feel like a lot. Note that I do expect powa 5 by default to consume more disk because it gathers more metrics, but I have no idea how much more on average. |
It seems that my main issue regarding disk space usage was powa-team/powa-archivist#80 After a whole run during this night, powa_statements_history_current stays at about 15 GB now, instead of >50/60GB. I'll be able to give you some numbers in 10 days, because that's how our retention is set up. |
Oops, I forgot that I've set up powa_coalesce to 50 instead of 100 to check that the aggregation work is spread across time. As it's now OK, I've got to 100 snapshots between aggregations. So we'll have to wait a few more days to see how it behaves. |
Hello,
Just a quick feedback when using powa-collector 1.3.0 with PoWA 5.0. We have 338 PostgreSQL instances monitored by a single PoWA instance. While running, the collector generates a huge load on our server. Loadavg goes up to 50. I notice a lot of sessions in waiting state with wait_event_type LWLock and wait_event SubtransSLRU.
Just a few minutes after upgrading our PoWA db to PostgreSQL 17, we have a huge activity on the subtransaction SLRU :
I'm not sure that playing with subtransaction_buffers will help a lot here, as I only have hits.
Things goes better after having set subtransaction_buffers to '1GB', SLRU blks_hit appears less often. However, load goes up even more, I have to investigate further this issue - probably a mix of infrastructure issue and something else. I'll let you know. I thought it's a good idea to share this with you.
The text was updated successfully, but these errors were encountered: