Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mjmac/DAOS 8331 no agent #14288

Closed
wants to merge 47 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
7c61bda
DAOS-8331 client: Telemetry dump should go to unique file paths
mjmac Apr 4, 2024
3e24b1d
incorporate review feedback, tweak variable name
mjmac Apr 20, 2024
d218fae
Merge branch 'master' into mjmac/DAOS-8331-dump_path
mjmac Apr 21, 2024
5acaa9e
Merge branch 'master' into mjmac/DAOS-8331-dump_path
mjmac Apr 24, 2024
61c497c
DAOS-9576 test: remove path to ddb src in ut (#14238)
johannlombardi Apr 24, 2024
0e0ef9c
DAOS-623 dfuse: Update dfuse thread names. (#14223)
ashleypittman Apr 24, 2024
15db9f3
DAOS-15499 dtx: cleanup DTX for failure (#14224)
Nasf-Fan Apr 24, 2024
75bcce1
DAOS-15648 test: Avoid failures with virtual NVMe (#14233)
phender Apr 24, 2024
b477a7d
DAOS-15642 test: Implement TestContainer register cleanup (#14159)
phender Apr 24, 2024
488b070
DAOS-15605 vos: Add version param to pool create (#14133)
liw Apr 25, 2024
4aab34f
DAOS-15595 cart: Remove SEP setting (#14110)
frostedcmos Apr 25, 2024
fcd7edb
DAOS-15654 control: Ignore NEW state NVMe devices when processing spa…
tanabarr Apr 25, 2024
f935fa3
DAOS-15750 test: Missing dfuse/mu_perms.py execution (#14249)
phender Apr 25, 2024
5c49895
DAOS-15747 test: Quote filenames when creating stack traces (#14246)
phender Apr 25, 2024
a54811f
DAOS-15717 bug: Fix memory leak cid 2555536 (#14231)
frostedcmos Apr 26, 2024
3de52bf
DAOS-15329 cq: Disable debug locking macros for coverity. (#14207)
ashleypittman Apr 26, 2024
a7beff9
DAOS-15622 test: enhance co_op_dup_timing() predictability (#14180)
kccain Apr 26, 2024
0ef6653
DAOS-15059 test: reduce parameter for rank_failure test (#14236)
liuxuezhao Apr 26, 2024
7ba3e52
DAOS-15749 test: Don't destroy an orphaned contianer (#14250)
phender Apr 26, 2024
28e702b
DAOS-15718 dfuse: Fix invalid read in error path. (#14237)
ashleypittman Apr 26, 2024
1a13905
DAOS-15768 test: skip cont cleanup in dmg_system_cleanup (#14264)
daltonbohning Apr 27, 2024
a811223
DAOS-15723 test: Fix coverity warning 2555531 (#14240)
tanabarr Apr 27, 2024
a034f2c
DAOS-15048 control: Display NSID only when populated in storage query…
tanabarr Apr 27, 2024
e301611
DAOS-15670 vos: SV overwrite missed tx_add_range() (#14241)
NiuYawei Apr 28, 2024
bd38606
DAOS-623 test: fix gid typo check in unit test (#14258)
mchaarawi Apr 29, 2024
5f43fcf
DAOS-13151 client: cache and reuse the attach info for the default sy…
mchaarawi Apr 29, 2024
e16d0ea
DAOS-15753 dfuse: Do not deadlock when failing to mount. (#14252)
ashleypittman Apr 29, 2024
5574a41
DAOS-14149 client: add compatible mode for libpil4dfs (#13294)
wiliamhuang Apr 29, 2024
47d5a35
DAOS-14657 test: ftest for libpil4dfs with fio (#13797)
knard38 Apr 29, 2024
0227079
DAOS-13292 control: Use cart API to detect fabric (#13989)
kjacque Apr 29, 2024
21eae71
DAOS-15745 dfuse: Add the pre_read metrics whilst holding reference. …
ashleypittman Apr 29, 2024
29e25c1
DAOS-15628 test: Verify maximum containers create with and without du…
dinghwah Apr 29, 2024
eb51ef1
DAOS-13520 control: Fix UUID filter for dmg check query (#13050)
kjacque Apr 29, 2024
85c3b19
DAOS-623 test: fix avocado run --failfast (#14253)
daltonbohning Apr 29, 2024
48e3b33
DAOS-15684 test: add test case for custom server name (#14225)
daltonbohning Apr 29, 2024
63c9b08
DAOS-14823 test: Changing scm-size for pool create (#13871)
saurabhtandan Apr 29, 2024
f5c7cb8
DAOS-15759 test: Remove utils/cr_demo (#14265)
shimizukko Apr 29, 2024
6196dce
DAOS-15659 test: fix local ftest prefix (#14173)
daltonbohning Apr 29, 2024
56103f4
DAOS-15713 chk: fix kinds of coverity issues (#14242)
Nasf-Fan Apr 30, 2024
570cd8f
DAOS-15661 object: set correct map version for layout create (#14222)
liuxuezhao Apr 30, 2024
63020d9
DAOS-15616 test: Update dfuse/find.py to work in a python venv (#14262)
phender Apr 30, 2024
7625ed1
Build(deps): Bump golang.org/x/net from 0.17.0 to 0.23.0 in /src/cont…
dependabot[bot] Apr 30, 2024
2694633
DAOS-15655 test: stop passing server group (#14201)
daltonbohning Apr 30, 2024
d684107
DAOS-15781 test: fix pool_acl and pool_groups (#14284)
daltonbohning Apr 30, 2024
b67848b
DAOS-4139 Coverity: fix Unchecked return value[2555519] (#14232)
ravalsam Apr 30, 2024
ecaecd1
DAOS-15655 test: fix set_daos_params conflict (#14286)
daltonbohning Apr 30, 2024
1a8826f
DAOS-8331 metrics: Support client metrics dump without agent
mjmac Apr 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion SConstruct
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,7 @@ MINIMAL_ENV = ('HOME', 'TERM', 'SSH_AUTH_SOCK', 'http_proxy', 'https_proxy', 'PK

# Environment variables that are also kept when LD_PRELOAD is set.
PRELOAD_ENV = ('LD_PRELOAD', 'D_LOG_FILE', 'DAOS_AGENT_DRPC_DIR', 'D_LOG_MASK', 'DD_MASK',
'DD_SUBSYS', 'D_IL_MAX_EQ')
'DD_SUBSYS', 'D_IL_MAX_EQ', 'D_IL_ENFORCE_EXEC_ENV', 'D_IL_COMPATIBLE')


def scons():
Expand Down
4 changes: 4 additions & 0 deletions docs/user/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -1020,6 +1020,10 @@ libpil4dfs intercepting summary for ops on DFS:
[op_sum ] 5003
```

### Turn on compatible mode in libpil4dfs
Fake file descriptor (FD) is used in regular mode in libpil4dfs.so for efficiency. open() returns fake fd to applications. In cases of some APIs are not intercepted, applications could crash with the error "Bad File Descriptor". Compatible mode is provided to work around such situations.
Setting env "D_IL_COMPATIBLE=1" turns on compatible mode. Kernel fd allocated by dfuse instead of fake fd will be returned to applications. This mode provides better compatibility with degraded performance in open, openat, and opendir, etc. Please start dfuse with "--disable-caching" to disable caching before using compatible mode.

### Child Process Inheritance

Normally child processes inherit environmental variables from parent processes. In rare cases, e.g.
Expand Down
2 changes: 0 additions & 2 deletions site_scons/prereq_tools/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1264,8 +1264,6 @@ def set_environment(self, env, needed_libs):
lib_paths.append(full_path)
# will adjust this to be a relative rpath later
env.AppendUnique(RPATH_FULL=[full_path])
# For binaries run during build
env.AppendENVPath("LD_LIBRARY_PATH", full_path)

# Ensure RUNPATH is used rather than RPATH. RPATH is deprecated
# and this allows LD_LIBRARY_PATH to override RPATH
Expand Down
32 changes: 12 additions & 20 deletions src/cart/README.env
Original file line number Diff line number Diff line change
Expand Up @@ -153,20 +153,9 @@ This file lists the environment variables used in CaRT.
If it is not set the default value of 64 is used.
Setting it to 0 disables quota

. CRT_CTX_SHARE_ADDR
Set it to non-zero to make all the contexts share one network address, in
this case CaRT will create one SEP and each context maps to one tx/rx
context pair.
When the ENV not set or set to 0 each context will create one separate SEP.

. CRT_CTX_NUM
When in scalable endpoint mode, this envariable specifies the number of contexts
that user wants to create. If CRT_CTX_NUM exceeds the OFI provider capability,
NA layer will fail to initialize. If user creates more contexts than CRT_CTX_NUM,
context creation will fail.
For regular (non scalable endpoint) mode:
- Maximum number of cart contexts is set to number of cores by default, up to 64.
- If CRT_CTX_NUM is set, this value is used instead as a limit.
If set, specifies the limit of number of allowed CaRT contexts to be created.
Valid range is [1, 64], with default being 64 if unset.

. D_FI_CONFIG
Specifies the fault injection configuration file. If this variable is not set
Expand All @@ -185,12 +174,15 @@ This file lists the environment variables used in CaRT.

. D_CLIENT_METRICS_ENABLE
When set to 1, client side metrics will be collected on each daos client, which
can by retrieved by daos_metrics -j job_id on each client.
can by retrieved by daos_metrics -j job_id on each client. Only needed if
daos_agent is not configured to enable client metrics for all connected processes.

. D_CLIENT_METRICS_RETAIN
when set to 1, client side metrics will be retained even after the job exits, i.e.
those metrics can be retrieved by daos_metrics even after job exits.

. D_CLIENT_METRICS_DUMP_PATH
Set client side metrics dump path(file) for each client, so these metrics will be
dumped to the specified file when the job exits.
When set to 1, client side metrics will be retained even after the job exits, i.e.
those metrics can be retrieved by daos_metrics even after job exits. Normally
managed by daos_agent.

. D_CLIENT_METRICS_DUMP_DIR
Set parent directory for client side metrics. Each client will write its metrics to
a file with the pattern <D_CLIENT_METRICS_DUMP_DIR>/<DAOS_JOBID>-<pid>.csv. As a
convenience, setting this variable automatically sets D_CLIENT_METRICS_ENABLE=1.
15 changes: 10 additions & 5 deletions src/cart/crt_init.c
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,7 @@ static const char *crt_env_names[] = {
"DAOS_SIGNAL_REGISTER",
"D_CLIENT_METRICS_ENABLE",
"D_CLIENT_METRICS_RETAIN",
"D_CLIENT_METRICS_DUMP_PATH",

"D_CLIENT_METRICS_DUMP_DIR",
};

static void
Expand Down Expand Up @@ -204,8 +203,8 @@ prov_data_init(struct crt_prov_gdata *prov_data, crt_provider_t provider,
return rc;

if (crt_is_service()) {
ctx_num = CRT_SRV_CONTEXT_NUM;
max_num_ctx = CRT_SRV_CONTEXT_NUM;
ctx_num = CRT_SRV_CONTEXT_NUM;
max_num_ctx = CRT_SRV_CONTEXT_NUM;
} else {
/* Only limit the number of contexts for clients */
d_getenv_uint("CRT_CTX_NUM", &ctx_num);
Expand All @@ -220,13 +219,19 @@ prov_data_init(struct crt_prov_gdata *prov_data, crt_provider_t provider,

if (max_num_ctx > CRT_SRV_CONTEXT_NUM)
max_num_ctx = CRT_SRV_CONTEXT_NUM;

/* To be able to run on VMs */
if (max_num_ctx < CRT_SRV_CONTEXT_NUM_MIN)
max_num_ctx = CRT_SRV_CONTEXT_NUM_MIN;

D_DEBUG(DB_ALL, "Max number of contexts set to %d\n", max_num_ctx);

d_getenv_bool("CRT_CTX_SHARE_ADDR", &set_sep);
if (set_sep)
D_WARN("Unsupported SEP mode requested. Unset CRT_CTX_SHARE_ADDR\n");

if (opt && opt->cio_sep_override && opt->cio_use_sep)
D_WARN("Unsupported SEP mode requested in init options\n");

if (opt && opt->cio_use_expected_size)
max_expect_size = opt->cio_max_expected_size;

Expand Down
23 changes: 3 additions & 20 deletions src/cart/utils/crt_utils.c
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,6 @@ crtu_dc_mgmt_net_cfg_setenv(const char *name)
{
int rc;
char *provider;
char *crt_ctx_share_addr = NULL;
char *cli_srx_set = NULL;
char *crt_timeout = NULL;
char *d_interface;
Expand Down Expand Up @@ -443,16 +442,6 @@ crtu_dc_mgmt_net_cfg_setenv(const char *name)
if (rc != 0)
D_GOTO(cleanup, rc = d_errno2der(errno));

rc = asprintf(&crt_ctx_share_addr, "%d", crt_net_cfg_info.crt_ctx_share_addr);
if (rc < 0) {
crt_ctx_share_addr = NULL;
D_GOTO(cleanup, rc = -DER_NOMEM);
}
D_INFO("setenv CRT_CTX_SHARE_ADDR=%s\n", crt_ctx_share_addr);
rc = d_setenv("CRT_CTX_SHARE_ADDR", crt_ctx_share_addr, 1);
if (rc != 0)
D_GOTO(cleanup, rc = d_errno2der(errno));

/* If the server has set this, the client must use the same value. */
if (crt_net_cfg_info.srv_srx_set != -1) {
rc = asprintf(&cli_srx_set, "%d", crt_net_cfg_info.srv_srx_set);
Expand All @@ -464,8 +453,6 @@ crtu_dc_mgmt_net_cfg_setenv(const char *name)
rc = d_setenv("FI_OFI_RXM_USE_SRX", cli_srx_set, 1);
if (rc != 0)
D_GOTO(cleanup, rc = d_errno2der(errno));

D_DEBUG(DB_MGMT, "Using server's value for FI_OFI_RXM_USE_SRX: %s\n", cli_srx_set);
} else {
/* Client may not set it if the server hasn't. */
d_agetenv_str(&cli_srx_set, "FI_OFI_RXM_USE_SRX");
Expand Down Expand Up @@ -501,9 +488,7 @@ crtu_dc_mgmt_net_cfg_setenv(const char *name)
D_GOTO(cleanup, rc = d_errno2der(errno));
} else {
d_interface = d_interface_env;
D_DEBUG(DB_MGMT,
"Using client provided D_INTERFACE: %s\n",
d_interface);
D_DEBUG(DB_MGMT, "Using client provided D_INTERFACE: %s\n", d_interface);
}

d_agetenv_str(&d_domain_env, "D_DOMAIN");
Expand All @@ -519,16 +504,14 @@ crtu_dc_mgmt_net_cfg_setenv(const char *name)
}

D_INFO("CaRT env setup with:\n"
"\tD_INTERFACE=%s, D_DOMAIN: %s, D_PROVIDER: %s, "
"CRT_CTX_SHARE_ADDR: %s, CRT_TIMEOUT: %s\n",
d_interface, d_domain, provider, crt_ctx_share_addr, crt_timeout);
"\tD_INTERFACE=%s, D_DOMAIN: %s, D_PROVIDER: %s, CRT_TIMEOUT: %s\n",
d_interface, d_domain, provider, crt_timeout);

cleanup:
d_freeenv_str(&d_domain_env);
d_freeenv_str(&d_interface_env);
d_freeenv_str(&crt_timeout);
d_freeenv_str(&cli_srx_set);
d_freeenv_str(&crt_ctx_share_addr);
dc_put_attach_info(&crt_net_cfg_info, crt_net_cfg_resp);

return rc;
Expand Down
2 changes: 2 additions & 0 deletions src/chk/chk_common.c
Original file line number Diff line number Diff line change
Expand Up @@ -1238,6 +1238,8 @@ chk_ins_init(struct chk_instance **p_ins)
out_init:
if (rc == 0)
*p_ins = ins;
else
D_FREE(ins);

return rc;
}
Expand Down
2 changes: 1 addition & 1 deletion src/chk/chk_engine.c
Original file line number Diff line number Diff line change
Expand Up @@ -2933,7 +2933,7 @@ chk_engine_pool_start(uint64_t gen, uuid_t uuid, uint32_t phase, uint32_t flags)
D_GOTO(put, rc = (rc == -DER_NONEXIST ? 1 : rc));

if (cbk->cb_phase < phase) {
cbk->cb_phase = cbk->cb_phase;
cbk->cb_phase = phase;
/* QUEST: How to estimate the left time? */
cbk->cb_time.ct_left_time = CHK__CHECK_SCAN_PHASE__CSP_DONE - cbk->cb_phase;
rc = chk_bk_update_pool(cbk, uuid_str);
Expand Down
5 changes: 2 additions & 3 deletions src/chk/chk_leader.c
Original file line number Diff line number Diff line change
Expand Up @@ -1396,7 +1396,7 @@ chk_leader_start_pool_svc(struct chk_pool_rec *cpr)

rc = ds_rsvc_dist_start(DS_RSVC_CLASS_POOL, &psid, cpr->cpr_uuid, ranks, RDB_NIL_TERM,
cpr->cpr_healthy ? DS_RSVC_START : DS_RSVC_DICTATE,
false /* bootstrap */, 0 /* size */);
false /* bootstrap */, 0 /* size */, 0 /* vos_df_version */);

out:
d_rank_list_free(ranks);
Expand Down Expand Up @@ -3385,8 +3385,7 @@ chk_leader_prop(chk_prop_cb_t prop_cb, void *buf)
{
struct chk_property *prop = &chk_leader->ci_prop;

return prop_cb(buf, (struct chk_policy *)prop->cp_policies,
CHK_POLICY_MAX - 1, prop->cp_flags);
return prop_cb(buf, prop->cp_policies, CHK_POLICY_MAX - 1, prop->cp_flags);
}

static int
Expand Down
31 changes: 12 additions & 19 deletions src/chk/chk_upcall.c
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,6 @@ chk_report_upcall(uint64_t gen, uint64_t seq, uint32_t cla, uint32_t act, int re
D_ASPRINTF(report.pool_uuid, DF_UUIDF, DP_UUID(*pool));
if (report.pool_uuid == NULL)
D_GOTO(out, rc = -DER_NOMEM);
} else {
report.pool_uuid = NULL;
}

report.pool_label = pool_label;
Expand All @@ -104,8 +102,6 @@ chk_report_upcall(uint64_t gen, uint64_t seq, uint32_t cla, uint32_t act, int re
D_ASPRINTF(report.cont_uuid, DF_UUIDF, DP_UUID(*cont));
if (report.cont_uuid == NULL)
D_GOTO(out, rc = -DER_NOMEM);
} else {
report.cont_uuid = NULL;
}

report.cont_label = cont_label;
Expand All @@ -114,24 +110,18 @@ chk_report_upcall(uint64_t gen, uint64_t seq, uint32_t cla, uint32_t act, int re
D_ASPRINTF(report.objid, DF_UOID, DP_UOID(*obj));
if (report.objid == NULL)
D_GOTO(out, rc = -DER_NOMEM);
} else {
report.objid = NULL;
}

if (!daos_iov_empty(dkey)) {
D_ASPRINTF(report.dkey, DF_KEY, DP_KEY(dkey));
if (report.dkey == NULL)
D_GOTO(out, rc = -DER_NOMEM);
} else {
report.dkey = NULL;
}

if (!daos_iov_empty(akey)) {
D_ASPRINTF(report.akey, DF_KEY, DP_KEY(akey));
if (report.akey == NULL)
D_GOTO(out, rc = -DER_NOMEM);
} else {
report.akey = NULL;
}

D_ASPRINTF(report.timestamp, "%s", ctime(&tm));
Expand All @@ -150,20 +140,23 @@ chk_report_upcall(uint64_t gen, uint64_t seq, uint32_t cla, uint32_t act, int re
goto out;

report.n_act_details = rc;
} else {
report.n_act_details = 0;
report.act_details = NULL;
}

rc = ds_chk_report_upcall(&report);

out:
D_FREE(report.pool_uuid);
D_FREE(report.cont_uuid);
D_FREE(report.objid);
D_FREE(report.dkey);
D_FREE(report.akey);
D_FREE(report.timestamp);
if (report.pool_uuid != protobuf_c_empty_string)
D_FREE(report.pool_uuid);
if (report.cont_uuid != protobuf_c_empty_string)
D_FREE(report.cont_uuid);
if (report.objid != protobuf_c_empty_string)
D_FREE(report.objid);
if (report.dkey != protobuf_c_empty_string)
D_FREE(report.dkey);
if (report.akey != protobuf_c_empty_string)
D_FREE(report.akey);
if (report.timestamp != protobuf_c_empty_string)
D_FREE(report.timestamp);
chk_sg_free(report.act_details, report.n_act_details);

D_CDEBUG(rc != 0, DLOG_ERR, DLOG_INFO,
Expand Down
14 changes: 11 additions & 3 deletions src/client/api/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ daos_init(void)
{
struct d_fault_attr_t *d_fault_init;
struct d_fault_attr_t *d_fault_mem = NULL;
struct d_fault_attr_t d_fault_mem_saved;
struct d_fault_attr_t d_fault_mem_saved;
int rc;

D_MUTEX_LOCK(&module_lock);
Expand Down Expand Up @@ -196,19 +196,24 @@ daos_init(void)
if (rc != 0)
D_GOTO(out_agent, rc);

/** get and cache attach info of default system */
rc = dc_mgmt_cache_attach_info(NULL);
if (rc != 0)
D_GOTO(out_job, rc);

/**
* get CaRT configuration (see mgmtModule.handleGetAttachInfo for the
* handling of NULL system names)
*/
rc = dc_mgmt_net_cfg(NULL);
if (rc != 0)
D_GOTO(out_job, rc);
D_GOTO(out_attach, rc);

/** set up event queue */
rc = daos_eq_lib_init();
if (rc != 0) {
D_ERROR("failed to initialize eq_lib: "DF_RC"\n", DP_RC(rc));
D_GOTO(out_job, rc);
D_GOTO(out_attach, rc);
}

/**
Expand Down Expand Up @@ -274,6 +279,8 @@ daos_init(void)
pl_fini();
out_eq:
daos_eq_lib_fini();
out_attach:
dc_mgmt_drop_attach_info();
out_job:
dc_job_fini();
out_agent:
Expand Down Expand Up @@ -334,6 +341,7 @@ daos_fini(void)
DF_RC"\n", DP_RC(rc));

dc_tm_fini();
dc_mgmt_drop_attach_info();
dc_agent_fini();
dc_job_fini();

Expand Down
Loading