Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #2020

Closed
vladiksun opened this issue Aug 8, 2024 · 11 comments
Closed

Memory leak #2020

vladiksun opened this issue Aug 8, 2024 · 11 comments
Labels
bug Something isn't working Stale Stale issues/PRs

Comments

@vladiksun
Copy link

vladiksun commented Aug 8, 2024

Describe the bug

Potential memory leak

How are you accessing AGE (Command line, driver, etc.)?

  • JDBC

What data setup do we need to do?

select ag_catalog.create_graph('test_graph');

--- create test labels ---
select ag_catalog.create_vlabel('test_graph','domain');
select ag_catalog.create_vlabel('test_graph','user');

--- create test relation ---
select ag_catalog.create_elabel('test_graph','edges');

--- generate test data, should take about 2 minutes ---
DO
$do$
    BEGIN
        FOR i IN 1..8000 LOOP
                EXECUTE format('
                select * from ag_catalog.cypher(''test_graph'',
                $$
                    CREATE (Org_level_1_%s: domain {id:"Org_level_1_%s", type:"Organization", name: "Org_level_1_%s some real name that could looooooooooooooong" ,hidden:false })
                    CREATE (Org_level_2_%s: domain {id:"Org_level_2_%s", type:"Organization", name: "Org_level_2_%s some real name that could looooooooooooooong" ,hidden:false })
                    CREATE (Org_level_3_%s: domain {id:"Org_level_3_%s", type:"Organization", name: "Org_level_3_%s some real name that could looooooooooooooong" ,hidden:false })
                    CREATE (Org_level_4_%s: domain {id:"Org_level_4_%s", type:"Organization", name: "Org_level_4_%s some real name that could looooooooooooooong" ,hidden:false })
                    CREATE (Dep_level_5_%s: domain {id:"Dep_level_5_%s", type:"Department", name: "Dep_level_5_%s some real name that could looooooooooooooong" ,hidden:false })

                    CREATE (User_%s: user { id:"User_%s", type:"User", name: "user_%s" ,hidden:false })

                    CREATE (User_%s)-[_rel_user:edges { from: ''User_%s'', to: ''Dep_level_5_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''user'', relation: ''user-domain'' }]->(Dep_level_5_%s)

                    CREATE (Dep_level_5_%s)-[_rel1:edges { from: ''Dep_level_5_%s'', to: ''Org_level_4_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_4_%s)
                    CREATE (Org_level_4_%s)-[_rel2:edges { from: ''Org_level_4_%s'', to: ''Org_level_3_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_3_%s)
                    CREATE (Org_level_3_%s)-[_rel3:edges { from: ''Org_level_3_%s'', to: ''Org_level_2_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_2_%s)
                    CREATE (Org_level_2_%s)-[_rel4:edges { from: ''Org_level_2_%s'', to: ''Org_level_1_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_1_%s)
                $$
            ) as (any_vertex ag_catalog.agtype)
            ', VARIADIC ARRAY(SELECT array_fill('iter_' || i::text, '{100}')) );
            END LOOP;
    END
$do$;

--- create test label GIN index ---
CREATE INDEX domain_gin_idx ON test_graph.domain USING GIN (properties);
CREATE INDEX user_gin_idx ON test_graph.user USING GIN (properties);

CREATE INDEX edges_gin_idx ON test_graph.edges USING GIN (properties);

What is the command that caused the error?

-- run all three scripts below consistently multiple times and observe DB memory usage

-- create some data
select * from ag_catalog.cypher('test_graph', $$
    CREATE (d: domain { id: "test1", type: "Organization", name: "test1" , hidden: false } )
    RETURN d $$
) as (vertex agtype);

-- delete some data
select * from ag_catalog.cypher('test_graph',
                                $$
    MATCH (any_vertex: domain { id: "test1" } )
    DELETE any_vertex
    RETURN any_vertex
$$) as (catalog ag_catalog.agtype);

-- run query
select
    domain
from
    ag_catalog.cypher('test_graph',$$

        MATCH (user_vertex: user { hidden: false, id: 'User_iter_100'})
            -[e1:edges  {hidden: false, fromCollection: 'user', toCollection: 'domain', relation: 'user-domain'}]->(to1: domain {hidden: false})
            -[e2:edges*0..10 {hidden: false, fromCollection: 'domain', toCollection: 'domain', relation: 'parent'}]->(to2: domain {hidden: false})

        RETURN to2
$$) as domain(domain ag_catalog.agtype);
Database went into recovery mode, could be seen as an IO error on the client side.

Expected behavior

No server crash happens

Environment (please complete the following information):

  • Postgres 14 + Apache AGE 1.5.0

Additional context

It looks like any Cypher script's result is being cached after the first run.
The second run and all subsequent runs are faster unless any data has been created/updated/deleted.
If create/update/delete happens the scripts executes slower again.
This leads to increased DB memory consumption that never released.

For our real data (about 20 vertex labels with different connectedness over 3 million edges) similar script such as below executes
around 5 seconds first time, around 300ms second time until create/update/delete happens.

select
    domain
from
    ag_catalog.cypher('test_graph',$$

        MATCH (user_vertex: user { hidden: false, id: 'User_iter_100'})
            -[e1:edges  {hidden: false, fromCollection: 'user', toCollection: 'domain', relation: 'user-domain'}]->(to1: domain {hidden: false})
            -[e2:edges*0..10 {hidden: false, fromCollection: 'domain', toCollection: 'domain', relation: 'parent'}]->(to2: domain {hidden: false})

        RETURN to2
$$) as domain(domain ag_catalog.agtype);

We used docker container with max 512 RAM available for it to reproduce this behaviour with provided test case.

Please take a look at the attached screenshot.
memory_leak

@vladiksun vladiksun added the bug Something isn't working label Aug 8, 2024
@jrgemignani
Copy link
Contributor

jrgemignani commented Aug 8, 2024

@vladiksun Yes, the VLE component does cache data for multiple contexts (graphs). It will only free them when the -

  1. Transaction id changes (updates).
  2. Maximum number of cached contexts is reached and a new one is created, one is removed. (LRU cache if I recall).
  3. delete_global_graphs function is called to purge unused graphs.
  4. session exits.

But, it shouldn't crash. I will take a look.

@jrgemignani
Copy link
Contributor

jrgemignani commented Aug 8, 2024

@vladiksun I have not been able to reproduce a crash, but will keep trying. However, I am using the latest master branch (PG16) and there have been fixes to the VLE caching mechanism since version 1.5.0 to address crashes. If you would like to test against the latest, unsupported image, try dev_snapshot_<version> For you, that would be, dev_snapshot_PG14.

@jrgemignani
Copy link
Contributor

@vladiksun I do see the memory leak you are talking about and have narrowed it down to a few commands.

@jrgemignani
Copy link
Contributor

@vladiksun PR #2028 Should address most of the issue with this memory leak. Although, there might be more, more subtle leaks.

@vladiksun
Copy link
Author

@jrgemignani thanks for taking a look. We are waiting when it is ready then.

@jrgemignani
Copy link
Contributor

@vladiksun It is in the master branch right now. I'll be making PRs for the other branches today.

@jrgemignani
Copy link
Contributor

jrgemignani commented Aug 9, 2024

@vladiksun This fix is now in all branches. If you want to test the latest for the branch you are using, try dev_snapshot_<version>. For example, dev_snapshot_P14.

@vladiksun
Copy link
Author

vladiksun commented Aug 19, 2024

@jrgemignani I checked the fix. It works within one session. As I understand with more sessions there would be more memory consumption to keep the cache. For example using 100 connections via pg_bouncer means we would be out of memory pretty soon. Is it right ? In our case we have approx. 40 thousand vertices with hierarchical relations. So typical hierarchical query before the cache takes about 5 seconds wich is too long anyway.

I am curious why delete_global_graphs does not clear all the cache.
cypher_memory_consumption

@jrgemignani
Copy link
Contributor

jrgemignani commented Aug 19, 2024

@vladiksun There are some other memory leaks, for example #2046 which I am currently working on, that could be contributing factors. That particular one highlights quite a few other areas that need corrective action as well. This week I will be focusing on getting patches out to deal with those.

I am hesitant to suggest that this will, or will not, fix memory issues, due to the complexity of PostgreSQL's memory system (contexts and caches). However, I will note that in my debugging of these memory leaks, I have found that PostgreSQL doesn't like to give back memory even after it is freed. It seems that once it gets it from the system, it may hold onto it for a while, just in case.

Also, keep in mind that PostgreSQL itself will cache items if it feels caching will improve performance.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove "Abondoned" label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the Stale Stale issues/PRs label Oct 19, 2024
Copy link

github-actions bot commented Nov 2, 2024

This issue was closed because it has been stalled for further 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale Stale issues/PRs
Projects
None yet
Development

No branches or pull requests

2 participants