-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prov: gni fi_endpoint returned 16 (Device or resource busy) when creating 70 endpoints #1351
Comments
@tonyzinger what system were you running this test on? These are really messages from uGNI. |
@tonyzinger Can you attach a test case? |
I set ep_cnt value based on the assumption there is no limit in libfabrics based on input from the rest of the group. With Tony seeing these errors there is actually some sort of limit in ugni or we are exposing bug in gni provider. |
@hppritcha I was doing this testing on Jupiter using the daily build of Libfabrics. |
On Jupiter, in the directory: This test will create 3 ranks with each rank creating 64 (-T) pthreads. Here is the debug output for one of the pthreads, that encountered the "fi_endpoint returned 16 (Device or resource busy)" error.ajz@jupiter libfabric-1351 $ cat RMA_gni_rank_0_pth_48_tid_23112.log [nid00008, Rank: 0, Pthread: 48] CONFIG: Executing on CPU: 0, requested CPU: 0 [nid00008, Rank: 0, Pthread: 48] INFO: [prov: 0] fabric(gni), prov_name: 'gni', prov version: 0x00010001, api version: 0x00010004 [nid00008, Rank: 0, Pthread: 48] CONFIG: fi_info: caps = 0x0100000000312204 |
could you add --cpus-per-task=32 to the srun command line and see if that helps? |
Sorry, I did not see this request. Adding the --cpus-per-task=32 did NOT help. This test was executed on tiger. |
While executing a test that is creating 70 endpoints on a node, the fi_endpoint returns a "Device or resource busy" error. This test creates 70 pthreads. Then each pthread creates its own endpoint. So this test has created about 70 endpoints (I believe that maximum number of endpoints that I can create is 63).
From the fi_getinfo(), the domain attribute ep_cnt has the value 4294967295. This value is unrealistic.
Here is the Libfabric trace information:
2: [ 214] MR: GNI_MemRegister: Mem reg of 4096 length at addr 0x7fff66aa7000
2: [ 214] MR: GNI_MemRegister: ioctl(GNI_IOC_MEM_REGISTER) returned error - No space left on device
2: libfabric:gni:ep_ctrl:__nic_setup_irq_cq():230 [38214:57] GNI_MemRegister returned GNI_RC_ERROR_RESOURCE
2: libfabric:gni:ep_ctrl:gnix_nic_alloc():1257 [38214:57] __nic_setup_irq_cq returned Device or resource busy
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: part_dev_res_info: device resource_type: 2 not valid for this architecture
2: [ 214] JOB: GNI_GetDevResInfo: device resource_type: 2 not valid for this architecture
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: part_dev_res_info: Unable to parse '/sys/class/gni/ghal0/partitions' (rc: 1, expected: 7)
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: part_dev_res_info: Unable to parse '/sys/class/gni/ghal0/partitions' (rc: 1, expected: 7)
2: libfabric:gni:fabric:_gnix_dump_gni_res():673 [38214:57] Device Resources:
2: dev res: MDD, avail: 3894 res: 409 held: 0 total: 4095
2: dev res: CQ, avail: 1654 res: 64 held: 0 total: 2047
2: dev res: FMA, avail: 125 res: 4 held: 0 total: 127
2: dev res: CE, avail: 4 res: 0 held: 0 total: 4
2: dev res: DLA, avail: 16384 res: 1024 held: 0 total: 16384
2: dev res: TCR, avail: 60584 res: 0 held: 0 total: 16
2: dev res: DVA, avail: 4398046511104 res: 1099511627776 held: 0 total: 4398046511104
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: MDD (1) used: 192 limit: 1806
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource_type: MRT (2) not valid for this architecture
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: IOMMU (3) used: 134217728 limit: 134217728
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource_type: GART (4) not valid for this architecture
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: CQ (5) used: 384 limit: 495
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: FMA (6) used: 0 limit: 123
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: RDMA (7) used: 0 limit: 18446744073709551615
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: CE (8) used: 0 limit: 1
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: DLA (9) used: 0 limit: 15360
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: SFMA (10) used: 0 limit: 123
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: DIRECT (11) used: 0 limit: 18446744073709551615
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: PCI-IOMMU (12) used: 0 limit: 18446744073709551615
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [ 214] JOB: GNI_GetJobResInfo: job resource: non-VMDH (13) used: 127 limit: 18446744073709551615
2: libfabric:gni:fabric:_gnix_dump_gni_res():689 [38214:57] Job Resources:
2: ptag[214] job res: MDD used: 192 limit: 1806
2: ptag[214] job res: IOMMU used: 134217728 limit: 134217728
2: ptag[214] job res: CQ used: 384 limit: 495
2: ptag[214] job res: FMA used: 0 limit: 123
2: ptag[214] job res: RDMA used: 0 limit: 18446744073709551615
2: ptag[214] job res: CE used: 0 limit: 1
2: ptag[214] job res: DLA used: 0 limit: 15360
2: ptag[214] job res: SFMA used: 0 limit: 123
2: ptag[214] job res: DIRECT used: 0 limit: 18446744073709551615
2: ptag[214] job res: PCI-IOMMU used: 0 limit: 18446744073709551615
2: ptag[214] job res: non-VMDH used: 127 limit: 18446744073709551615
2: libfabric:gni:ep_ctrl:_gnix_mbox_allocator_destroy():613 [38214:57]
2: libfabric:gni:ep_ctrl:__destroy_slab():390 [38214:57]
2: libfabric:gni:ep_ctrl:_gnix_cm_nic_alloc():649 [38214:57] gnix_nic_alloc returned Device or resource busy
2: libfabric:gni:ep_ctrl:_gnix_ep_nic_init():2133 [38214:57] _gnix_cm_nic_alloc returned Device or resource busy
2: libfabric:gni:ep_ctrl:gnix_ep_open():2350 [38214:57] _gnix_ep_nic_init returned -16
2: libfabric:gni:ep_ctrl:_gnix_xpmem_handle_destroy():310 [38214:57]
2: libfabric:gni:ep_ctrl:__xpmem_hndl_destruct():98 [38214:57]
2: [nid00010, Rank: 2, Pthread: 169] ltu_ep_init Line: 1149, ERROR: : fi_endpoint returned 16 (Device or resource busy)
2: [nid00010, Rank: 2, Pthread: 169] rma_execute Line: 521, ERROR: : ltu_ep_init returned 16 (Device or resource busy)
2: [ 214] JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x2a6b
2: [nid00010, Rank: 2, Pthread: 169] RESULT: RMA test Aborted.
2: [ 214] FMA: GNI_CdmAttach: FMA window size: 32768
2: [ 214] FMA: GNI_CdmAttach: NOPRIV_ERR masked
2: [ 214] JOB: GNI_CdmAttach: ptag = 214 inst_id = 5529665 fma_window = 0x0000000000000000 fma_ctrl = 0x0000000000000000
2: [ 214] CQ: GNI_CqCreate: entry_count: 1361 reqs: 1361 adjusted entries: 1395 alloc_count: 1396
2: [ 214] CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
2: [ 214] CQ: cq_create: #1 cq created, kern_cq_descr = 65, mode = 2, rd_index_ptr = 0x7fff66bc7ba0, queue = 0x7fff66bc5000, intr_mask = (nil)
2: [ 214] CQ: GNI_CqCreate: entry_count: 1361 reqs: 1361 adjusted entries: 1395 alloc_count: 1396
2: [ 214] CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
2: [ 214] CQ: cq_create: #1 cq created, kern_cq_descr = 1375, mode = 20, rd_index_ptr = 0x7fff66bc0000, queue = 0x7fff66bc1000, intr_mask = (nil)
2: [ 214] FLBTE: GNII_FlbteInit: FLBTE: tx_counter 0x7fff673cf008, chan 2, max_len -1, total 511
2: [ 214] CQ: GNI_CqCreate: entry_count: 2048 reqs: 2048 adjusted entries: 2559 alloc_count: 2560
2: [ 214] CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
2: [ 214] CQ: cq_create: #1 cq created, kern_cq_descr = 1376, mode = 5, rd_index_ptr = 0x7fff66bb6000, queue = 0x7fff66bb7000, intr_mask = 0x7fff66bcc1f8
2: [ 214] CQ: GNI_CqCreate: entry_count: 2048 reqs: 2048 adjusted entries: 2559 alloc_count: 2560
2: [ 214] CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
2: [ 214] CQ: cq_create: #1 cq created, kern_cq_descr = 1377, mode = 4, rd_index_ptr = 0x7fff66bad000, queue = 0x7fff66bae000, intr_mask = (nil)
2: [ 214] CQ: GNI_CqCreate: entry_count: 16384 reqs: 16384 adjusted entries: 16895 alloc_count: 16896
2: [ 214] CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE) returned error - Invalid argument
2: [ 214] CQ: GNI_CqCreate: GNI_IOC_CQ_CREATE with PHYS_MEM failed trying without PHYS_MEM
2: [ 214] MR: GNI_MemRegister: Mem reg of 135168 length at addr 0x7fff66b8c000
2: [ 214] CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
2: [ 214] CQ: cq_create: #1 cq created, kern_cq_descr = 1378, mode = 1, rd_index_ptr = 0x7fff66b8b000, queue = 0x7fff66b8c000, intr_mask = 0x7fff66bcc1fc
2: [ 214] CQ: GNI_CqCreate: entry_count: 16384 reqs: 16384 adjusted entries: 16895 alloc_count: 16896
2: [ 214] CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE) returned error - Invalid argument
2: [ 214] CQ: GNI_CqCreate: GNI_IOC_CQ_CREATE with PHYS_MEM failed trying without PHYS_MEM
2: [ 214] MR: GNI_MemRegister: Mem reg of 135168 length at addr 0x7fff66b6a000
2: [ 214] CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
2: [ 214] CQ: cq_create: #1 cq created, kern_cq_descr = 1379, mode = 0, rd_index_ptr = 0x7fff66b69000, queue = 0x7fff66b6a000, intr_mask = (nil)
2: libfabric:gni:ep_ctrl:__gnix_nic_tx_freelist_init():661 [38214:70]
2: libfabric:gni:ep_ctrl:_gnix_mbox_allocator_create():555 [38214:70]
2: libfabric:gni:ep_ctrl:__open_huge_page():195 [38214:70]
2: libfabric:gni:ep_ctrl:__create_slab():300 [38214:70]
The text was updated successfully, but these errors were encountered: