Heavy load with 338 instances #20

frost242 · 2024-11-15T14:25:51Z

Hello,

Just a quick feedback when using powa-collector 1.3.0 with PoWA 5.0. We have 338 PostgreSQL instances monitored by a single PoWA instance. While running, the collector generates a huge load on our server. Loadavg goes up to 50. I notice a lot of sessions in waiting state with wait_event_type LWLock and wait_event SubtransSLRU.

[postgres]# select wait_event_type, wait_event, count(*) from pg_stat_activity group by 1, 2;
 wait_event_type |      wait_event      | count
-----------------+----------------------+-------
 [null]          | [null]               |    77
 Timeout         | CheckpointWriteDelay |     1
 Activity        | WalWriterMain        |     1
 Lock            | extend               |     2
 Activity        | LogicalLauncherMain  |     1
 IO              | DataFileRead         |    14
 LWLock          | BufferMapping        |   238
 Activity        | AutovacuumMain       |     1
 IPC             | BufferIo             |     8
 Client          | ClientRead           |     1
 Activity        | BgwriterMain         |     1
 Extension       | Extension            |     1
(12 lignes)

Just a few minutes after upgrading our PoWA db to PostgreSQL 17, we have a huge activity on the subtransaction SLRU :

[postgres]# select * from pg_stat_slru ;
       name       | blks_zeroed | blks_hit | blks_read | blks_written | blks_exists | flushes | truncates |          stats_reset
------------------+-------------+----------+-----------+--------------+-------------+---------+-----------+-------------------------------
 commit_timestamp |           0 |        0 |         0 |            0 |           0 |      12 |         0 | 2024-11-15 14:48:34.904968+01
 multixact_member |           1 |        2 |         4 |            5 |           0 |      20 |         0 | 2024-11-15 14:48:34.904968+01
 multixact_offset |           1 |        7 |        10 |            8 |          11 |      20 |         0 | 2024-11-15 14:48:34.904968+01
 notify           |           0 |        0 |         0 |            0 |           0 |       0 |         0 | 2024-11-15 14:48:34.904968+01
 serializable     |           0 |        0 |         0 |            0 |           0 |       0 |         0 | 2024-11-15 14:48:34.904968+01
 subtransaction   |          10 | 51057277 |         0 |            8 |           0 |      12 |        12 | 2024-11-15 14:48:34.904968+01
 transaction      |           2 |    14242 |         9 |           11 |           0 |      12 |         0 | 2024-11-15 14:48:34.904968+01
 other            |           0 |        0 |         0 |            0 |           0 |       0 |         0 | 2024-11-15 14:48:34.904968+01
(8 lignes)

I'm not sure that playing with subtransaction_buffers will help a lot here, as I only have hits.

Things goes better after having set subtransaction_buffers to '1GB', SLRU blks_hit appears less often. However, load goes up even more, I have to investigate further this issue - probably a mix of infrastructure issue and something else. I'll let you know. I thought it's a good idea to share this with you.

The text was updated successfully, but these errors were encountered:

rjuju · 2024-11-15T14:54:24Z

Hi Thomas,

Thanks for the report!

I think this is coming from the new per-db snapshot feature. Can you try a custom patch? Maybe this would be enough to fix the problem:

diff --git a/powa_collector/snapshot.py b/powa_collector/snapshot.py
index d7eb5e2..c0a1b82 100644
--- a/powa_collector/snapshot.py
+++ b/powa_collector/snapshot.py
@@ -113,6 +113,7 @@ def copy_remote_data_to_repo(cls, data_name,
     buf = StringIO()
     try:
         data_src.copy_expert("COPY (%s) TO stdout" % data_src_sql, buf)
+        data_src.execute("RELEASE src")
     except psycopg2.Error as e:
         src_ok = False
         err = "Error retrieving datasource data %s:\n%s" % (data_name, e)
@@ -125,6 +126,7 @@ def copy_remote_data_to_repo(cls, data_name,
         try:
             cls.logger.debug("Calling %s..." % cleanup_sql)
             data_src.execute(cleanup_sql)
+            data_src.execute("RELEASE src")
         except psycopg2.Error as e:
             err = "Error while calling %s:\n%s" % (cleanup_sql, e)
             errors.append(err)
@@ -142,6 +144,7 @@ def copy_remote_data_to_repo(cls, data_name,
     try:
         # For data import the schema is now on the repository server
         data_ins.copy_expert("COPY %s FROM stdin" % target_tbl_name, buf)
+        data_ins.execute("RELEASE data")
     except psycopg2.Error as e:
         err = "Error while inserting data:\n%s" % e
         cls.logger.warning(err)

frost242 · 2024-11-15T15:02:44Z

I'm trying it now, I'll let you know ASAP.

frost242 · 2024-11-15T15:11:06Z

Things are probably better, now I experiment memory errors. I'll let you know on monday I think.

rjuju · 2024-11-15T15:12:46Z

ok, that's a least a first good news I hope!

frost242 · 2024-11-15T15:16:52Z

Also, now many sessions are experiencing LWLock waits for BufferMapping, as well as relation extension locks. I have to check this issue first.

rjuju · 2024-11-15T15:20:52Z

this one looks just a side effect of having that many servers on a fresh install. once you hit the coalesce and purge, auto vacuum should kick in and you shouldn't hit the relation extension anymore. for the buffer mapping, there is probably not much to do apart from lowering the frequency or excluding some databases from the per db snapshots. do you usually have one db per instance or a lot?

frost242 · 2024-11-15T15:21:56Z

Thanks for the feedback. I have usually one db per instance. So I'll try to solve this memory issue first.

rjuju · 2024-11-16T00:15:51Z

I committed the fix for the missing RELEASE commands. It's likely not the only problem but clearly they should be released so this is one less thing to worry about.

frost242 · 2024-11-18T13:02:01Z

Hi,

With a bit more CPUs and more RAM, I got to a load average of 190. I monitored the await with "sar -d -p 1", it remained most of the time at ~0.7.
The load goes too high. Maybe I should try to decrease the number of instances monitored, but at some point we would like to monitor all instances.

Are there some settings to try, I didn't notice anything in the documentation but there's a high probability I missed something.

rjuju · 2024-11-18T13:19:24Z

Unfortunately without any indication at all of why the load is high it's hard to give advice on what to do. Can you check for any bottleneck, or maybe it's just that you have 190 connections active in pg_stat_activity?

One thing you could blindly try is to disable the per-db modules, something like that should work

UPDATE powa_db_module_config SET dbnames = '{""}'::text[];

Unfortunately nothing happened on the documentation. I didn't have time to update it and everyone was asking for a release ASAP so your only option for now is to read powa-archivist and powa-collector to find out what you can do.

frost242 · 2024-11-18T14:39:14Z

So, I did two things :

apply the UPDATE you provide to disable all modules
shared_buffers = 4000MB instead of 500MB

Also, subtransaction_buffers has been RESET since your latest changes to the code.

I noticed a lot of buffer alloc wait event before, that led me to increase largely shared_buffers. Both things seemed to solve the load issue. Now I'll activate modules one after each other. I'll let you know.

frost242 · 2024-11-18T14:52:46Z

Just on a side-note, I started again from scratch, with all modules activated and when load goes up to 200, I have the following activity :

[postgres]# select datname, substr(application_name, 0, 20), wait_event_type, wait_event, state, count(*) from pg_stat_activity where backend_type = 'client backend' group by datname, substr(application_name, 0, 20), wait_event_type, wait_event, state order by datname, state, wait_event_type, wait_event;
 datname  |       substr        | wait_event_type |      wait_event      |        state        | count
----------+---------------------+-----------------+----------------------+---------------------+-------
 postgres | psql                | [null]          | [null]               | active              |     1
 powa5    | PoWA collector - re | IPC             | ProcarrayGroupUpdate | active              |    16
 powa5    | PoWA - pg_track_set | Lock            | extend               | active              |     4
 powa5    | PoWA - pg_track_set | LWLock          | BufferMapping        | active              |     1
 powa5    | PoWA collector - re | LWLock          | ProcArrayLock        | active              |    16
 powa5    | PoWA - powa_stateme | LWLock          | ProcArrayLock        | active              |    50
 powa5    | PoWA - pg_track_set | LWLock          | SubtransSLRU         | active              |     6
 powa5    | PoWA - powa_catalog | LWLock          | SubtransSLRU         | active              |     1
 powa5    | PoWA - powa_stat_ac | LWLock          | SubtransSLRU         | active              |     1
 powa5    | PoWA - powa_stat_re | LWLock          | SubtransSLRU         | active              |     1
 powa5    | PoWA - powa_user_fu | LWLock          | SubtransSLRU         | active              |     1
 powa5    | PoWA - pg_track_set | [null]          | [null]               | active              |    80
 powa5    | PoWA - powa_all_tab | [null]          | [null]               | active              |    36
 powa5    | PoWA - powa_catalog | [null]          | [null]               | active              |   118
 powa5    | PoWA - powa_stateme | [null]          | [null]               | active              |     2
 powa5    | PoWA - powa_user_fu | [null]          | [null]               | active              |     1
 powa5    | PoWA - powa_wait_sa | [null]          | [null]               | active              |     1
 powa5    | PoWA collector - ma | Client          | ClientRead           | idle                |     1
 powa5    | PoWA - snapshot fin | Client          | ClientRead           | idle in transaction |     1

frost242 · 2024-11-18T15:05:50Z

Will get back on it tomorrow, I'll let you know.

rjuju · 2024-11-19T01:09:47Z

Most of the activity seems to be on the catalog snapshot. This one is known to be expensive, so it's done only once a month. For you it should mean doing a dozen per day but since you're adding all the servers at once it's doing them all concurrently which can easily explain your problem.

You could try to disable it (I don't remember how to do it, there is probably a table somewhere where you can exclude databases) and see if with that the rest scales.

We should probably improve the behavior, like doing it only one initial sync and let users do it on-demand since the UI has this feature. But even if this was implemented you would still hit a massive load in your specific scenario :(

frost242 · 2024-11-25T10:36:25Z

Hello,

I got back in this topic only now.

As discussed on irc, I added pgBouncer between the collector and the db. This helped a lot, there are no more lock standing locks and locked sessions. The collector seems to behave correctly.

I'll do some more experiment on this side and give you a more detailed feedback.

Thanks for your support.

frost242 · 2024-11-27T09:07:00Z

With the help of pgBouncer, the load was seriously reduced and behaves correctly. I've set max_db_connections to 10 for the collector, so the load remains normal.
On a side note, the disk space requirements is a lot more important than with PoWA v4. I still need to fine tune some things here.

frost242 · 2024-11-27T09:58:06Z

Thanks for your support !

rjuju · 2024-11-28T01:37:44Z

thanks for the feedback.

do you have some numbers to share to get an idea of what "a lot more important" means?

is it due to powa-team/powa-archivist#80 only or other things too? according to this issue it doesn't look like the extra space is due to powa 5 per se but adding hundreds of servers at once, and the numbers there suggest about 132MB per server, which doesn't feel like a lot.

Note that I do expect powa 5 by default to consume more disk because it gathers more metrics, but I have no idea how much more on average.

frost242 · 2024-11-28T08:31:52Z

It seems that my main issue regarding disk space usage was powa-team/powa-archivist#80

After a whole run during this night, powa_statements_history_current stays at about 15 GB now, instead of >50/60GB. I'll be able to give you some numbers in 10 days, because that's how our retention is set up.
Now I'll be able to give you a more detailed view on disk usage grows when I'll be able to upgrade our production PoWA. I'll proceed next week. Then wait 10 days after the install to get the consumption. I'll give you a more detailed feedback.

frost242 · 2024-11-28T08:50:20Z

Oops, I forgot that I've set up powa_coalesce to 50 instead of 100 to check that the aggregation work is spread across time. As it's now OK, I've got to 100 snapshots between aggregations. So we'll have to wait a few more days to see how it behaves.

rjuju self-assigned this Nov 15, 2024

rjuju added the bug Something isn't working label Nov 15, 2024

frost242 closed this as completed Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heavy load with 338 instances #20

Heavy load with 338 instances #20

frost242 commented Nov 15, 2024

rjuju commented Nov 15, 2024 •

edited

Loading

frost242 commented Nov 15, 2024

frost242 commented Nov 15, 2024

rjuju commented Nov 15, 2024

frost242 commented Nov 15, 2024

rjuju commented Nov 15, 2024

frost242 commented Nov 15, 2024 •

edited

Loading

rjuju commented Nov 16, 2024

frost242 commented Nov 18, 2024

rjuju commented Nov 18, 2024

frost242 commented Nov 18, 2024

frost242 commented Nov 18, 2024

frost242 commented Nov 18, 2024

rjuju commented Nov 19, 2024

frost242 commented Nov 25, 2024

frost242 commented Nov 27, 2024 •

edited

Loading

frost242 commented Nov 27, 2024

rjuju commented Nov 28, 2024

frost242 commented Nov 28, 2024

frost242 commented Nov 28, 2024

Heavy load with 338 instances #20

Heavy load with 338 instances #20

Comments

frost242 commented Nov 15, 2024

rjuju commented Nov 15, 2024 • edited Loading

frost242 commented Nov 15, 2024

frost242 commented Nov 15, 2024

rjuju commented Nov 15, 2024

frost242 commented Nov 15, 2024

rjuju commented Nov 15, 2024

frost242 commented Nov 15, 2024 • edited Loading

rjuju commented Nov 16, 2024

frost242 commented Nov 18, 2024

rjuju commented Nov 18, 2024

frost242 commented Nov 18, 2024

frost242 commented Nov 18, 2024

frost242 commented Nov 18, 2024

rjuju commented Nov 19, 2024

frost242 commented Nov 25, 2024

frost242 commented Nov 27, 2024 • edited Loading

frost242 commented Nov 27, 2024

rjuju commented Nov 28, 2024

frost242 commented Nov 28, 2024

frost242 commented Nov 28, 2024

rjuju commented Nov 15, 2024 •

edited

Loading

frost242 commented Nov 15, 2024 •

edited

Loading

frost242 commented Nov 27, 2024 •

edited

Loading