You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the program send_recv_pgm.txt
to understand the data transfer with xpmem.
Bt of rank0:
$6 mca_pml_ucx_send (buf=0x7f3330176000, count=262144, datatype=0x6020a0 <ompi_mpi_char>, dst=1, tag=4660,
mode=MCA_PML_BASE_SEND_STANDARD, comm=0x6022a0 <ompi_mpi_comm_world>) at pml_ucx.c:944
$7 0x00007f3336b453be in PMPI_Send (buf=0x7f3330176000, count=262144, type=0x6020a0 <ompi_mpi_char>, dest=1, tag=4660,
comm=0x6022a0 <ompi_mpi_comm_world>) at psend.c:81
$8 0x0000000000400c9c in main (argc=4, argv=0x7ffebda2d6f8) at send_recv.c:63
Bt of rank1 Just before it is going to do a memcpy from the xpmem mapped area
$2 0x00007f670b89988d in ucs_memcpy_relaxed (dst=0x7f6710126000, src=0x7f67100e5000, len=262144) at /home/arun/openmpi_work/ucx/src/ucs/arch/x86_64/cpu.h:115
$3 0x00007f670b89a981 in ucp_memcpy_pack_unpack (name=0x7f670b9a8a29 "memcpy_unpack", length=262144, data=0x7f67100e5000, buffer=0x7f6710126000) at /home/arun/openmpi_work/ucx/src/ucp/dt/dt.h:74
$4 ucp_dt_contig_unpack (mem_type=UCS_MEMORY_TYPE_HOST, length=262144, src=0x7f67100e5000, dest=0x7f6710126000, worker=0xe98100) at /home/arun/openmpi_work/ucx/src/ucp/dt/dt_contig.h:55
$5 ucp_datatype_iter_unpack (src=0x7f67100e5000, offset=0, length=262144, worker=0xe98100, dt_iter=0xee1218) at /home/arun/openmpi_work/ucx/src/ucp/dt/datatype_iter.inl:452
$6 ucp_proto_rndv_progress_rkey_ptr (arg=0xe98100) at rndv/rndv_rkey_ptr.c:139
$7 0x00007f670b2a4af6 in ucs_callbackq_spill_elems_dispatch (cbq=0xe951a0) at datastruct/callbackq.c:383
$8 0x00007f670b2a4f6f in ucs_callbackq_proxy_callback (arg=0xe951a0) at datastruct/callbackq.c:479
$9 0x00007f670b7e8436 in ucs_callbackq_dispatch (cbq=0xe951a0) at /home/arun/openmpi_work/ucx/src/ucs/datastruct/callbackq.h:215
$10 0x00007f670b7f5c8b in uct_worker_progress (worker=0xe951a0) at /home/arun/openmpi_work/ucx/src/uct/api/uct.h:2778
$11 ucp_worker_progress (worker=0xe98100) at core/ucp_worker.c:2940
$12 0x00007f670bdf439b in mca_pml_ucx_recv (buf=0x7f6710126000, count=262144, datatype=0x6020a0 <ompi_mpi_char>, src=0, tag=4660, comm=0x6022a0 <ompi_mpi_comm_world>, mpi_status=0x0) at pml_ucx.c:646
$13 0x00007f67199c60c5 in PMPI_Recv (buf=0x7f6710126000, count=262144, type=0x6020a0 <ompi_mpi_char>, source=0, tag=4660, comm=0x6022a0 <ompi_mpi_comm_world>, status=0x0) at precv.c:82
$14 0x0000000000400cd1 in main (argc=4, argv=0x7ffe3a29e058) at send_recv.c:65
This means the buffer from rank0 at 0x7f3330176000 is exposed to rank1 at 0x7f67100e5000 and will perform a memcpy from this
address.
The below printks from xpmem prove this
We can see that both rank0 and rank1 points to the same pfn = 2019fa (Different virtual address but the same physical page)
This can be verified from the linux kernel utils program "linux/tools/vm/page-types.c"
a) For rank0
Hi All,
I am using the program
send_recv_pgm.txt
to understand the data transfer with xpmem.
Bt of rank0:
$6 mca_pml_ucx_send (buf=0x7f3330176000, count=262144, datatype=0x6020a0 <ompi_mpi_char>, dst=1, tag=4660,
mode=MCA_PML_BASE_SEND_STANDARD, comm=0x6022a0 <ompi_mpi_comm_world>) at pml_ucx.c:944
$7 0x00007f3336b453be in PMPI_Send (buf=0x7f3330176000, count=262144, type=0x6020a0 <ompi_mpi_char>, dest=1, tag=4660,
comm=0x6022a0 <ompi_mpi_comm_world>) at psend.c:81
$8 0x0000000000400c9c in main (argc=4, argv=0x7ffebda2d6f8) at send_recv.c:63
Bt of rank1 Just before it is going to do a memcpy from the xpmem mapped area
$2 0x00007f670b89988d in ucs_memcpy_relaxed (dst=0x7f6710126000, src=0x7f67100e5000, len=262144) at /home/arun/openmpi_work/ucx/src/ucs/arch/x86_64/cpu.h:115
$3 0x00007f670b89a981 in ucp_memcpy_pack_unpack (name=0x7f670b9a8a29 "memcpy_unpack", length=262144, data=0x7f67100e5000, buffer=0x7f6710126000) at /home/arun/openmpi_work/ucx/src/ucp/dt/dt.h:74
$4 ucp_dt_contig_unpack (mem_type=UCS_MEMORY_TYPE_HOST, length=262144, src=0x7f67100e5000, dest=0x7f6710126000, worker=0xe98100) at /home/arun/openmpi_work/ucx/src/ucp/dt/dt_contig.h:55
$5 ucp_datatype_iter_unpack (src=0x7f67100e5000, offset=0, length=262144, worker=0xe98100, dt_iter=0xee1218) at /home/arun/openmpi_work/ucx/src/ucp/dt/datatype_iter.inl:452
$6 ucp_proto_rndv_progress_rkey_ptr (arg=0xe98100) at rndv/rndv_rkey_ptr.c:139
$7 0x00007f670b2a4af6 in ucs_callbackq_spill_elems_dispatch (cbq=0xe951a0) at datastruct/callbackq.c:383
$8 0x00007f670b2a4f6f in ucs_callbackq_proxy_callback (arg=0xe951a0) at datastruct/callbackq.c:479
$9 0x00007f670b7e8436 in ucs_callbackq_dispatch (cbq=0xe951a0) at /home/arun/openmpi_work/ucx/src/ucs/datastruct/callbackq.h:215
$10 0x00007f670b7f5c8b in uct_worker_progress (worker=0xe951a0) at /home/arun/openmpi_work/ucx/src/uct/api/uct.h:2778
$11 ucp_worker_progress (worker=0xe98100) at core/ucp_worker.c:2940
$12 0x00007f670bdf439b in mca_pml_ucx_recv (buf=0x7f6710126000, count=262144, datatype=0x6020a0 <ompi_mpi_char>, src=0, tag=4660, comm=0x6022a0 <ompi_mpi_comm_world>, mpi_status=0x0) at pml_ucx.c:646
$13 0x00007f67199c60c5 in PMPI_Recv (buf=0x7f6710126000, count=262144, type=0x6020a0 <ompi_mpi_char>, source=0, tag=4660, comm=0x6022a0 <ompi_mpi_comm_world>, status=0x0) at precv.c:82
$14 0x0000000000400cd1 in main (argc=4, argv=0x7ffe3a29e058) at send_recv.c:65
This means the buffer from rank0 at 0x7f3330176000 is exposed to rank1 at 0x7f67100e5000 and will perform a memcpy from this
address.
The below printks from xpmem prove this
[264733.921190] [1512524]xpmem_fault_handler: vaddr = 7f67100e5000, seg_vaddr = 7f3330176000
[264733.921205] [1512524]xpmem_fault_handler: calling remap_pfn_range() vaddr=7f67100e5000, pfn=2019fa
This can be verified from the linux kernel utils program "linux/tools/vm/page-types.c"
a) For rank0
./page-types -p 1512523 -l | grep 7f3330176
But for rank1 it is not showing the physical page for the address "7f67100e5000"
./page-types -p 1512524 -l | grep 7f67100e5 ---> No output..
I wonder why on rank1 the utility "linux/tools/vm/page-types.c" cannot show the physical page?
Also, gdb can't read from this area from rank1
(gdb) x/100x 0x7f67100e5000
0x7f67100e5000: Cannot access memory at address 0x7f67100e5000
Why this behavior? Can anyone explain this?
--Arun
The text was updated successfully, but these errors were encountered: