Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernfs_memcg: Add helpers to gather memcgroup related data #96

Merged
merged 2 commits into from
Dec 13, 2024

Conversation

imran-kn
Copy link
Contributor

@imran-kn imran-kn commented Aug 1, 2024

This as of now is just a dump of some of my bespoke debug scripts.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Aug 1, 2024
@imran-kn imran-kn changed the title DRAFT: kernfs_memcg: starting work. kernfs_memcg: Add helpers to gather memcgroup related data Oct 21, 2024
@imran-kn
Copy link
Contributor Author

This as of now is just a dump of some of my bespoke debug scripts.

I have added other helpers and modified the earlier ones, so that they work with other UEK versions as well

Copy link
Member

@biger410 biger410 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating memcg helpers. I added couple comments. And beside that, please create a bug for it and put the number in the git log.

drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
imran-kn added a commit that referenced this pull request Nov 25, 2024
@imran-kn
Copy link
Contributor Author

Thanks @biger410 for reviewing this. I have addressed your review comments. Could you please have a look and let me know if you have any further feedback

imran-kn added a commit that referenced this pull request Nov 26, 2024
Orabug: 37322867
Signed-off-by: Imran Khan <[email protected]>
@biger410
Copy link
Member

The new changes look good to me. One more request is that can we add a corelen module for it? It should run either with -M option or -A option.

@imran-kn
Copy link
Contributor Author

The new changes look good to me. One more request is that can we add a corelen module for it? It should run either with -M option or -A option.

I have added corelens module.

python3 -m drgn_tools.corelens ~/Local-vmcores/Workqueue-study/UEK-6-rds-issue/VMCORE -d ~/Local-vmcores/Workqueue-study/UEK-6-rds-issue/

page: 0xffffc35c881b0000 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b0040 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b0080 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b00c0 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b0100 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b0140 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b0180 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b01c0 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

@biger410
Copy link
Member

biger410 commented Nov 26, 2024

Is there a paste error? How could this trigger the corelen cmd?

python3 -m drgn_tools.corelens ~/Local-vmcores/Workqueue-study/UEK-6-rds-issue/VMCORE -d ~/Local-vmcores/Workqueue-study/UEK-6-rds-issue/

And with 10000 default, how could user run corelen cmd to dump all pages?

@imran-kn
Copy link
Contributor Author

imran-kn commented Nov 28, 2024

Is there a paste error? How could this trigger the corelen cmd?

python3 -m drgn_tools.corelens ~/Local-vmcores/Workqueue-study/UEK-6-rds-issue/VMCORE -d ~/Local-vmcores/Workqueue-study/UEK-6-rds-issue/

And with 10000 default, how could user run corelen cmd to dump all pages?

Yes, it was a copy paste error and missed the -M part . The actual command is:
python3 -m drgn_tools.corelens ~/Local-vmcores/Workqueue-study/UEK-6-rds-issue/VMCORE -d ~/Local-vmcores/Workqueue-study/UEK-6-rds-issue/ -M kernfs_memcg

and output is like the one shown below:

`imran@imran-metabox:~/oracle-samples-drgn-tools/drgn-tools$ python3 -m drgn_tools.corelens ~/Local-vmcores/Workqueue-study/UEK-6-rds-issue/VMCORE -d ~/Local-vmcores/Workqueue-study/UEK-6-rds-issue/ -M kernfs_memcg
page: 0xffffc35c881b0000 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b0040 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b0080 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b00c0 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b0100 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b0140 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b0180 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

page: 0xffffc35c881b01c0 cgroup: /user.slice/user-0.slice/session-7128.scope state: CSS_ONLINE|CSS_VISIBLE path: /datastore/oracle_230124_0745EST/diag/asm/cell/scaqat06celadm02/incpkg/pkg_76/seq_1/cell/IPSPKG_20230120060027_COM_1.zip

............................................................

Regarding default page count of 10K, I am using this value to make sure we don't end up spending a lot of time while collecting this data. We have seen that scanning all pages can take hours , so my idea here is that we get information of 10K pages and if that does not indicate anything conclusive , we can later scan all pages. Let me know if its sounds okay or using something other than 10K would be more acceptable
`

@biger410
Copy link
Member

biger410 commented Dec 3, 2024

The default value is good to me, to dump all pages, just pass option "-m 0". I think we should make the default value and how to dump all page clear in the cmd help doc for user.

@imran-kn imran-kn force-pushed the kernfs_memcg_work branch 2 times, most recently from ada1a81 to fe8d6ee Compare December 4, 2024 05:31
@imran-kn
Copy link
Contributor Author

imran-kn commented Dec 4, 2024

The default value is good to me, to dump all pages, just pass option "-m 0". I think we should make the default value and how to dump all page clear in the cmd help doc for user.

I have modified cmd help to provide this information as per your suggestion. The cmd help looks like below
`python3 -m drgn_tools.corelens -M kernfs_memcg --help
usage: kernfs_memcg [-h] [--max MAX]

Print information related to pages, that are pinning memcgroup(s)

options:
-h, --help show this help message and exit
--max MAX, -m MAX Maximum number of pages to show. By default first 10000 such pages are shown. Use 0 to list all such pages.`

Please let me know if it looks okay.

@biger410
Copy link
Member

biger410 commented Dec 4, 2024

Looks good to me. @brenns10 do you want to take a look as well?

Copy link
Member

@brenns10 brenns10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, these are looking really good. I have a few small comments, the only major thing here is the issue with PG_slab.

drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
imran-kn added a commit that referenced this pull request Dec 12, 2024
Orabug: 37322867
Signed-off-by: Imran Khan <[email protected]>
@imran-kn
Copy link
Contributor Author

@brenns10 I have addressed your review comments . Could you please have a look once more?

Copy link
Member

@biger410 biger410 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reorganize the patch to address the review feedback while not add new commits for that? Once done reorganize patch, git push --force-with-lease can help overwrite the previous commits.

class PagesPinningMemcgroups(CorelensModule):
"""Print information related to pages, that are pinning memcgroup(s)"""

name = "kernfs_memcg"
name = "pages-pinning-memcg"
run_when = "never"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes senes this only run when user requests it explicitly. Should we make the default behavior to dump all pages pinning zombie cgroup, that looks more common to me when troubleshooting zombie cgroup issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I have modified it to dump all pages by default.


def run(self, prog: Program, args: argparse.Namespace) -> None:
get_num_mem_cgroups(prog)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this more generic like the output of cat /proc/cgroups, not only support memory cgroup, but other cgroup as well? It doesn't have to be include with this pull request, you can start a new one for it if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I can include this change in a different PR. I included memcgroup numbers here because we often run into issue due to zombie memcgroups and hence having their number readily available will help us to decide if we need to run PagesPinningMemcgroups module.

drgn_tools/kernfs_memcg.py Show resolved Hide resolved
drgn_tools/kernfs_memcg.py Outdated Show resolved Hide resolved
Add kernfs based helpers to extract memcg related information,
like number of active and inactive memcgroups, page cache pages
pinning memcgroups etc.

Orabug: 37322867
Signed-off-by: Imran Khan <[email protected]>
Orabug: 37322867
Signed-off-by: Imran Khan <[email protected]>
@imran-kn
Copy link
Contributor Author

@biger410 , @brenns10, I have addressed your review comments. Could you please have a look ?

Copy link
Member

@biger410 biger410 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @imran-kn

@imran-kn imran-kn merged commit f395a8b into main Dec 13, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants