Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding single copy via xpmem #9358

Open
arunedarath opened this issue Sep 11, 2023 · 3 comments
Open

Regarding single copy via xpmem #9358

arunedarath opened this issue Sep 11, 2023 · 3 comments

Comments

@arunedarath
Copy link

arunedarath commented Sep 11, 2023

Hi All,

I am using the program
send_recv_pgm.txt
to understand the data transfer with xpmem.

  1. Bt of rank0:
    $6 mca_pml_ucx_send (buf=0x7f3330176000, count=262144, datatype=0x6020a0 <ompi_mpi_char>, dst=1, tag=4660,
    mode=MCA_PML_BASE_SEND_STANDARD, comm=0x6022a0 <ompi_mpi_comm_world>) at pml_ucx.c:944
    $7 0x00007f3336b453be in PMPI_Send (buf=0x7f3330176000, count=262144, type=0x6020a0 <ompi_mpi_char>, dest=1, tag=4660,
    comm=0x6022a0 <ompi_mpi_comm_world>) at psend.c:81
    $8 0x0000000000400c9c in main (argc=4, argv=0x7ffebda2d6f8) at send_recv.c:63

  2. Bt of rank1 Just before it is going to do a memcpy from the xpmem mapped area
    $2 0x00007f670b89988d in ucs_memcpy_relaxed (dst=0x7f6710126000, src=0x7f67100e5000, len=262144) at /home/arun/openmpi_work/ucx/src/ucs/arch/x86_64/cpu.h:115
    $3 0x00007f670b89a981 in ucp_memcpy_pack_unpack (name=0x7f670b9a8a29 "memcpy_unpack", length=262144, data=0x7f67100e5000, buffer=0x7f6710126000) at /home/arun/openmpi_work/ucx/src/ucp/dt/dt.h:74
    $4 ucp_dt_contig_unpack (mem_type=UCS_MEMORY_TYPE_HOST, length=262144, src=0x7f67100e5000, dest=0x7f6710126000, worker=0xe98100) at /home/arun/openmpi_work/ucx/src/ucp/dt/dt_contig.h:55
    $5 ucp_datatype_iter_unpack (src=0x7f67100e5000, offset=0, length=262144, worker=0xe98100, dt_iter=0xee1218) at /home/arun/openmpi_work/ucx/src/ucp/dt/datatype_iter.inl:452
    $6 ucp_proto_rndv_progress_rkey_ptr (arg=0xe98100) at rndv/rndv_rkey_ptr.c:139
    $7 0x00007f670b2a4af6 in ucs_callbackq_spill_elems_dispatch (cbq=0xe951a0) at datastruct/callbackq.c:383
    $8 0x00007f670b2a4f6f in ucs_callbackq_proxy_callback (arg=0xe951a0) at datastruct/callbackq.c:479
    $9 0x00007f670b7e8436 in ucs_callbackq_dispatch (cbq=0xe951a0) at /home/arun/openmpi_work/ucx/src/ucs/datastruct/callbackq.h:215
    $10 0x00007f670b7f5c8b in uct_worker_progress (worker=0xe951a0) at /home/arun/openmpi_work/ucx/src/uct/api/uct.h:2778
    $11 ucp_worker_progress (worker=0xe98100) at core/ucp_worker.c:2940
    $12 0x00007f670bdf439b in mca_pml_ucx_recv (buf=0x7f6710126000, count=262144, datatype=0x6020a0 <ompi_mpi_char>, src=0, tag=4660, comm=0x6022a0 <ompi_mpi_comm_world>, mpi_status=0x0) at pml_ucx.c:646
    $13 0x00007f67199c60c5 in PMPI_Recv (buf=0x7f6710126000, count=262144, type=0x6020a0 <ompi_mpi_char>, source=0, tag=4660, comm=0x6022a0 <ompi_mpi_comm_world>, status=0x0) at precv.c:82
    $14 0x0000000000400cd1 in main (argc=4, argv=0x7ffe3a29e058) at send_recv.c:65

  3. This means the buffer from rank0 at 0x7f3330176000 is exposed to rank1 at 0x7f67100e5000 and will perform a memcpy from this
    address.
    The below printks from xpmem prove this

[264733.921190] [1512524]xpmem_fault_handler: vaddr = 7f67100e5000, seg_vaddr = 7f3330176000
[264733.921205] [1512524]xpmem_fault_handler: calling remap_pfn_range() vaddr=7f67100e5000, pfn=2019fa

  1. We can see that both rank0 and rank1 points to the same pfn = 2019fa (Different virtual address but the same physical page)
    This can be verified from the linux kernel utils program "linux/tools/vm/page-types.c"
    a) For rank0

    ./page-types -p 1512523 -l | grep 7f3330176

    7f3330176       2019fa  1       __RU_l_____Ma_b____________________________
    

But for rank1 it is not showing the physical page for the address "7f67100e5000"

./page-types -p 1512524 -l | grep 7f67100e5 ---> No output..

I wonder why on rank1 the utility "linux/tools/vm/page-types.c" cannot show the physical page?

Also, gdb can't read from this area from rank1
(gdb) x/100x 0x7f67100e5000
0x7f67100e5000: Cannot access memory at address 0x7f67100e5000

Why this behavior? Can anyone explain this?

--Arun

@yosefe
Copy link
Contributor

yosefe commented Sep 11, 2023

@arunedarath Probably the xpmem driver does not fully implement what is needed for page-types and gdb to see the attached memory mappings

@arunedarath
Copy link
Author

@yosefe I am using xpmem from https://github.com/hpc/xpmem. I can see another version from ucx https://github.com/openucx/xpmem. Does the version from ucx have this capability?

@yosefe
Copy link
Contributor

yosefe commented Sep 11, 2023

@yosefe I am using xpmem from https://github.com/hpc/xpmem. I can see another version from ucx https://github.com/openucx/xpmem. Does the version from ucx have this capability?

AFAIK, no

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants