Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

powa_kcache_metrics_current grows enormously #42

Closed
banlex73 opened this issue Mar 17, 2021 · 12 comments
Closed

powa_kcache_metrics_current grows enormously #42

banlex73 opened this issue Mar 17, 2021 · 12 comments
Assignees

Comments

@banlex73
Copy link
Contributor

No description provided.

@banlex73
Copy link
Contributor Author

What I found on my powa repository DB
powa_kcache_metrics_current was as big as 133Gb
and SELECT powa_kcache_aggregate(13) threw an error:
ERROR: duplicate key value violates unique constraint "powa_kcache_metrics_pkey" Detail: Key (srvid, coalesce_range, queryid, dbid, userid, top)=(13, ["2021-03-16 12:19:31.162797-07","2021-03-16 16:46:16.570921-07"], -9206572327289564196, 862035, 861727, t) already exists.
PL/pgSQL function powa_kcache_aggregate(integer) line 12 at SQL statement
and my fix was adding ON CONFLICT DO NOTHING;
I don't know if it is acceptable solution

`CREATE OR REPLACE FUNCTION public.powa_kcache_aggregate(_srvid integer)
RETURNS void
LANGUAGE plpgsql
AS $function$
DECLARE
result bool;
v_funcname text := 'powa_kcache_aggregate(' || _srvid || ')';
v_rowcount bigint;
BEGIN
PERFORM powa_log(format('running %I', v_funcname));

PERFORM powa_prevent_concurrent_snapshot(_srvid);

-- aggregate metrics table
INSERT INTO public.powa_kcache_metrics (coalesce_range, srvid, queryid,
                                        top, dbid, userid, metrics,
                                        mins_in_range, maxs_in_range)
    SELECT tstzrange(min((metrics).ts), max((metrics).ts),'[]'),
    srvid, queryid, top, dbid, userid, array_agg(metrics),
    ROW(min((metrics).ts),
        min((metrics).plan_reads), min((metrics).plan_writes),
        min((metrics).plan_user_time), min((metrics).plan_system_time),
        min((metrics).plan_minflts), min((metrics).plan_majflts),
        min((metrics).plan_nswaps),
        min((metrics).plan_msgsnds), min((metrics).plan_msgrcvs),
        min((metrics).plan_nsignals),
        min((metrics).plan_nvcsws), min((metrics).plan_nivcsws),
        min((metrics).exec_reads), min((metrics).exec_writes),
        min((metrics).exec_user_time), min((metrics).exec_system_time),
        min((metrics).exec_minflts), min((metrics).exec_majflts),
        min((metrics).exec_nswaps),
        min((metrics).exec_msgsnds), min((metrics).exec_msgrcvs),
        min((metrics).exec_nsignals),
        min((metrics).exec_nvcsws), min((metrics).exec_nivcsws)
    )::powa_kcache_type,
    ROW(max((metrics).ts),
        max((metrics).plan_reads), max((metrics).plan_writes),
        max((metrics).plan_user_time), max((metrics).plan_system_time),
        max((metrics).plan_minflts), max((metrics).plan_majflts),
        max((metrics).plan_nswaps),
        max((metrics).plan_msgsnds), max((metrics).plan_msgrcvs),
        max((metrics).plan_nsignals),
        max((metrics).plan_nvcsws), max((metrics).plan_nivcsws),
        max((metrics).exec_reads), max((metrics).exec_writes),
        max((metrics).exec_user_time), max((metrics).exec_system_time),
        max((metrics).exec_minflts), max((metrics).exec_majflts),
        max((metrics).exec_nswaps),
        max((metrics).exec_msgsnds), max((metrics).exec_msgrcvs),
        max((metrics).exec_nsignals),
        max((metrics).exec_nvcsws), max((metrics).exec_nivcsws)
    )::powa_kcache_type
    FROM powa_kcache_metrics_current
    WHERE srvid = _srvid
    GROUP BY srvid, queryid, top, dbid, userid
 **ON CONFLICT DO NOTHING;**

`

@rjuju rjuju self-assigned this Mar 18, 2021
@rjuju
Copy link
Member

rjuju commented Mar 18, 2021

Hi,

Do you have some monitoring to see how is the growth factor like? If it's a constant and regular growth then it's likely due to inadequate autovacuum tuning.

For the ON CONFLICT, it's a workaround for a bug. So sure for now you can keep it as is, but it's a sign that there's something wrong going on here, and the root problem should be fixed instead.

That being said I have no clue how that could be happening. It seems that the function is aggregating the same data twice, but that shouldn't be possible because the function removes the data it just aggregated. As the function is atomic, you should either have both aggregated the data and remove the source OR done nothing.

Ideally what should be done is removing the ON CONFLICT cause, wait to see if the error happens again and if it does check if the problematic record exists in powa_kcache_metrics table.

@banlex73
Copy link
Contributor Author

Unfortunately, I don't have any monitoring in place..
I monitor 6 clusters (253 DBs), retention policy looks like:
DBs Retention
4 3 days
4 7 days
245 2 days
Table was bloated, after vacuum full it became: 12 GB (133Gb before)
I use default auto vacuum setting on postgres and nothing special on that table.
To find the root cause, I am going to remove ON CONFLICT and start monitoring it carefully, to understand what causes the issue.

Thank you for support.

@banlex73 banlex73 changed the title powa_kcache_metrics_current grows enorm powa_kcache_metrics_current grows enormously Mar 18, 2021
@banlex73
Copy link
Contributor Author

banlex73 commented Mar 18, 2021

Found what is wrong with auto vacuum on my powa repository DB - I have powa_take_snapshot constantly running. auto vacuum doesn't have enough time to complete between snapshots.
powa_snapshot

@rjuju
Copy link
Member

rjuju commented Mar 19, 2021

What does those values actually mean?

But even if there's always one or multiple powa_take_snapshot() running, it shouldn't prevent autovacuum from working. It only means that you'll have an amount of bloat equal to the maximum number of snapshot being performed between two autovacuum runs. It can be a bit high, but it should stay relatively constant. And I don't think that it should represent 120GB.

Note also that if you have pg_qualstats setup and a lot of databases and/or users and/or different normalized queries, the aggregation can be quite expensive. We have a fix for that (682b753) but it's not released yet. If you have pg_stat_statements enabled (and maybe a local powa) on your repository server you could check if you could benefit from this patch.

@banlex73
Copy link
Contributor Author

banlex73 commented Mar 19, 2021 via email

@rjuju
Copy link
Member

rjuju commented Aug 31, 2022

Hi @banlex73, any news on this issue?

@banlex73
Copy link
Contributor Author

banlex73 commented Aug 31, 2022 via email

@rjuju
Copy link
Member

rjuju commented Sep 1, 2022

Ok, thanks!

@banlex73
Copy link
Contributor Author

All good now, thank you

@banlex73
Copy link
Contributor Author

All good now, thank you

@rjuju
Copy link
Member

rjuju commented Oct 1, 2022

Thanks a lot for the confirmation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants