Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into tanabarr/control-re…
Browse files Browse the repository at this point in the history
…intpools-pernode

Signed-off-by: Tom Nabarro <tom.nabarrointel.com>
  • Loading branch information
tanabarr committed Dec 18, 2024
2 parents c3e19e5 + f07e5da commit f53f8bd
Show file tree
Hide file tree
Showing 98 changed files with 1,259 additions and 1,288 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ossf-scorecard.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,6 @@ jobs:
# Upload the results to GitHub's code scanning dashboard (optional).
# Commenting out will disable upload of results to your repo's Code Scanning dashboard
- name: "Upload to code-scanning"
uses: github/codeql-action/upload-sarif@babb554ede22fd5605947329c4d04d8e7a0b8155 # v3.27.7
uses: github/codeql-action/upload-sarif@df409f7d9260372bd5f19e5b04e83cb3c43714ae # v3.27.9
with:
sarif_file: results.sarif
4 changes: 2 additions & 2 deletions .github/workflows/rpm-build-and-test-report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
esac
echo "STAGE_NAME=Build RPM on $DISTRO_NAME $DISTRO_VERSION" >> $GITHUB_ENV
- name: Test Report
uses: phoenix-actions/test-reporting@v10
uses: phoenix-actions/test-reporting@f957cd93fc2d848d556fa0d03c57bc79127b6b5e # v15
with:
artifact: ${{ env.STAGE_NAME }} test-results
name: ${{ env.STAGE_NAME }} Test Results (phoenix-actions)
Expand All @@ -60,7 +60,7 @@ jobs:
- name: Set variables
run: echo "STAGE_NAME=Functional Hardware ${{ matrix.stage }}" >> $GITHUB_ENV
- name: Test Report
uses: phoenix-actions/test-reporting@v10
uses: phoenix-actions/test-reporting@f957cd93fc2d848d556fa0d03c57bc79127b6b5e # v15
with:
artifact: ${{ env.STAGE_NAME }} test-results
name: ${{ env.STAGE_NAME }} Test Results (phoenix-actions)
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/trivy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ jobs:
trivy-config: 'utils/trivy/trivy.yaml'

- name: Upload Trivy scan results to GitHub Security tab
uses: github/codeql-action/upload-sarif@babb554ede22fd5605947329c4d04d8e7a0b8155 # v3.27.7
uses: github/codeql-action/upload-sarif@df409f7d9260372bd5f19e5b04e83cb3c43714ae # v3.27.9
with:
sarif_file: 'trivy-results.sarif'

Expand Down
35 changes: 0 additions & 35 deletions .github/workflows/version-checks.yml

This file was deleted.

4 changes: 2 additions & 2 deletions docs/QSG/qemu-vms.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,10 +199,10 @@ I follow these [steps](https://docs.daos.io/latest/QSG/setup_rhel/) to install b

5. Update config files.

Update the daos-server config file `/etc/daos/daos_server.yml` on daos-server. You may need to update "access\_points", "fabric\_iface" and "bdev\_list". Update "access\_points" accordingly if you name daos-server differently. Check if the network device has the same name as listed under "fabric\_iface". Look in the output of `lspci` for "bdev\_list". The info for our NVMe controller is like *??:??:? Non-Volatile memory controller: Red Hat, Inc. QEMU NVM Express Controller (rev 02)*. Prefix *??:??.?* is the address of the NVMe devices.
Update the daos-server config file `/etc/daos/daos_server.yml` on daos-server. You may need to update "mgmt\_svc\_replicas", "fabric\_iface" and "bdev\_list". Update "mgmt\_svc\_replicas" accordingly if you name daos-server differently. Check if the network device has the same name as listed under "fabric\_iface". Look in the output of `lspci` for "bdev\_list". The info for our NVMe controller is like *??:??:? Non-Volatile memory controller: Red Hat, Inc. QEMU NVM Express Controller (rev 02)*. Prefix *??:??.?* is the address of the NVMe devices.
```
name: daos_server
access_points:
mgmt_svc_replicas:
- daos-server
port: 10001
Expand Down
2 changes: 1 addition & 1 deletion docs/QSG/setup_rhel.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ Examples are available on [github](https://github.com/daos-stack/daos/tree/maste


name: daos_server
access_points:
mgmt_svc_replicas:
- server-1
port: 10001

Expand Down
2 changes: 1 addition & 1 deletion docs/QSG/setup_suse.md
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,7 @@ Examples are available on [github](https://github.com/daos-stack/daos/tree/maste
An example of the daos_server.yml is presented below. Copy the modified server yaml file to all the server nodes at `/etc/daos/daos_server.yml`.

name: daos_server
access_points:
mgmt_svc_replicas:
- node-4
port: 10001

Expand Down
2 changes: 1 addition & 1 deletion docs/QSG/tour.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ bring-up DAOS servers and clients.
### Run dfuse

# Bring up 4 hosts server with appropriate daos_server.yml and
# access-point, reference to DAOS Set-Up
# MS replicas, reference to DAOS Set-Up
# After DAOS servers, DAOS admin and client started.

$ dmg storage format
Expand Down
20 changes: 11 additions & 9 deletions docs/admin/administration.md
Original file line number Diff line number Diff line change
Expand Up @@ -825,9 +825,9 @@ device would remain in this state until replaced by a new device.
## System Operations
The DAOS server acting as the access point records details of engines
that join the DAOS system. Once an engine has joined the DAOS system, it is
identified by a unique system "rank". Multiple ranks can reside on the same
The DAOS server acting as the Management Service (MS) leader records details
of engines that join the DAOS system. Once an engine has joined the DAOS system,
it is identified by a unique system "rank". Multiple ranks can reside on the same
host machine, accessible via the same network address.
A DAOS system can be shutdown and restarted to perform maintenance and/or
Expand All @@ -837,14 +837,14 @@ made to the rank's metadata stored on persistent memory.
Storage reformat can also be performed after system shutdown. Pools will be
removed and storage wiped.
System commands will be handled by a DAOS Server acting as access point and
System commands will be handled by a DAOS Server acting as the MS leader and
listening on the address specified in the DMG config file "hostlist" parameter.
See
[`daos_control.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_control.yml)
for details.
At least one of the addresses in the hostlist parameters should match one of the
"access point" addresses specified in the server config file
`mgmt_svc_replicas` addresses specified in the server config file
[`daos_server.yml`](https://github.com/daos-stack/daos/blob/master/utils/config/daos_server.yml)
that is supplied when starting `daos_server` instances.
Expand Down Expand Up @@ -1028,13 +1028,15 @@ formatted again by running `dmg storage format`.
To add a new server to an existing DAOS system, one should install:
- the relevant certificates
- the server yaml file pointing to the access points of the running
DAOS system
- A copy of the relevant certificates from an existing server. All servers must
share the same set of certificates in order to provide services.
- A copy of the server yaml file from an existing server (DAOS server configurations
should be homogeneous) -- the `mgmt_svc_replicas` entry is used by the new server in
order to know which servers should handle its SystemJoin request.
The daos\_control.yml file should also be updated to include the new DAOS server.
Then starts the daos\_server via systemd and format the new server via
Then start the daos\_server via systemd and format the new server via
dmg as follows:
```
Expand Down
4 changes: 2 additions & 2 deletions docs/admin/common_tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ This section describes some of the common tasks handled by admins at a high leve
3. Install `daos-server` and `daos-client` RPMs.
4. Generate certificate files.
5. Copy one of the example configs from `utils/config/examples` to
`/etc/daos` and adjust it based on the environment. E.g., `access_points`,
`/etc/daos` and adjust it based on the environment. E.g., `mgmt_svc_replicas`,
`class`.
6. Check that the directory where the log files will be created exists. E.g.,
`control_log_file`, `log_file` field in `engines` section.
Expand Down Expand Up @@ -38,7 +38,7 @@ to server hosts and `daos-client` to client hosts.
4. Generate certificate files and distribute them to all the hosts.
5. Copy one of the example configs from `utils/config/examples` to
`/etc/daos` of one of the server hosts and adjust it based on the environment.
E.g., `access_points`, `class`.
E.g., `mgmt_svc_replicas`, `class`.
6. Check that the directory where the log files will be created exists. E.g.,
`control_log_file`, `log_file` field in `engines` section.
7. Start `daos_server`.
Expand Down
24 changes: 11 additions & 13 deletions docs/admin/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ A recommended workflow to get up and running is as follows:
server config file (default location at `/etc/daos/daos_server.yml`) has not
yet been populated.

* Run `dmg config generate -l <hostset> -a <access_points>` across the entire
* Run `dmg config generate -l <hostset> -r <ms_replicas>` across the entire
hostset (all the storage servers that are now running the `daos_server` service
after RPM install).
The command will only generate a config if hardware setups on all the hosts are
Expand Down Expand Up @@ -285,7 +285,7 @@ Help Options:

[generate command options]
-l, --helper-log-file= Log file location for debug from daos_server_helper binary
-a, --access-points= Comma separated list of access point addresses
-r, --ms-replicas= Comma separated list of MS replica addresses
<ipv4addr/hostname> (default: localhost)
-e, --num-engines= Set the number of DAOS Engine sections to be populated in the
config file output. If unset then the value will be set to the
Expand Down Expand Up @@ -331,7 +331,7 @@ Help Options:

[generate command options]
-l, --host-list= A comma separated list of addresses <ipv4addr/hostname> to connect to
-a, --access-points= Comma separated list of access point addresses <ipv4addr/hostname>
-r, --ms-replicas= Comma separated list of MS replica addresses <ipv4addr/hostname>
to host management service (default: localhost)
-e, --num-engines= Set the number of DAOS Engine sections to be populated in the
config file output. If unset then the value will be set to the
Expand Down Expand Up @@ -371,8 +371,8 @@ engines:

The options that can be supplied to the config generate command are as follows:

- `--access-points` specifies the access points (identified storage servers that will host the
management service for the DAOS system across the cluster).
- `--ms-replicas` specifies the MS replicas (identified storage servers that will host the
Management Service for the DAOS system across the cluster).

- `--num-engines` specifies the number of engine sections to populate in the config file output.
If not set explicitly on the commandline, default is the number of NUMA nodes detected on the host.
Expand Down Expand Up @@ -502,7 +502,7 @@ core_dump_filter: 19
name: daos_server
socket_dir: /var/run/daos_server
provider: ofi+tcp
access_points:
mgmt_svc_replicas:
- localhost:10001
fault_cb: ""
hyperthreads: false
Expand All @@ -515,7 +515,7 @@ and runs until the point where a storage format is required, as expected.
[user@wolf-226 daos]$ install/bin/daos_server start -i -o ~/configs/tmp.yml
DAOS Server config loaded from /home/user/configs/tmp.yml
install/bin/daos_server logging to file /tmp/daos_server.log
NOTICE: Configuration includes only one access point. This provides no redundancy in the event of an access point failure.
NOTICE: Configuration includes only one MS replica. This provides no redundancy in the event of a MS replica failure.
DAOS Control Server v2.3.101 (pid 1211553) listening on 127.0.0.1:10001
Checking DAOS I/O Engine instance 0 storage ...
Checking DAOS I/O Engine instance 1 storage ...
Expand Down Expand Up @@ -821,8 +821,6 @@ To set the addresses of which DAOS Servers to task, provide either:

Where `<hostlist>` represents a slurm-style hostlist string e.g.
`foo-1[28-63],bar[256-511]`.
The first entry in the hostlist (after alphabetic then numeric sorting) will be
assumed to be the access point as set in the server configuration file.

Local configuration files stored in the user directory will be used in
preference to the default location e.g. `~/.daos_control.yml`.
Expand Down Expand Up @@ -1322,7 +1320,7 @@ as follows to establish 2-tier storage:
```yaml
<snip>
port: 10001
access_points: ["wolf-71"] # <----- updated
mgmt_svc_replicas: ["wolf-71"] # <----- updated
<snip>
engines:
-
Expand Down Expand Up @@ -1367,10 +1365,10 @@ information, please refer to the [DAOS build documentation][6].
DAOS Control Servers will need to be restarted on all hosts after updates to the server
configuration file.

Pick an odd number of hosts in the system and set `access_points` to list of that host's
hostname or IP address (don't need to specify port).
Pick an odd number (3-7) of hosts in the system and set the `mgmt_svc_replicas` list to
include the hostnames or IP addresses (don't need to specify port) of those hosts.

This will be the host which bootstraps the DAOS management service (MS).
This will be the set of servers which host the replicated DAOS management service (MS).

>The support of the optional providers is not guarantee and can be removed
>without further notification.
Expand Down
6 changes: 3 additions & 3 deletions docs/admin/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ sudo ipcrm -M 0x10242049
1. Format the SCMs defined in the config file.
1. Generate the config file using `dmg config generate`. The various requirements will be populated without a syntax error.
1. Try starting with `allow_insecure: true`. This will rule out the credential certificate issue.
1. Verify that the `access_points` host is accessible and the port is not used.
1. Verify that the `mgmt_svc_replicas` host is accessible and the port is not used.
1. Check the `provider` entry. See the "Network Scan and Configuration" section of the admin guide for determining the right provider to use.
1. Check `fabric_iface` in `engines`. They should be available and enabled.
1. Check that `socket_dir` is writable by the daos_server.
Expand All @@ -327,7 +327,7 @@ sudo ipcrm -M 0x10242049
1. When the server configuration is changed, it's necessary to restart the agent.
1. `DER_UNREACH(-1006)`: Check the socket ID consistency between PMem and NVMe. First, determine which socket you're using with `daos_server network scan -p all`. e.g., if the interface you're using in the engine section is eth0, find which NUMA Socket it belongs to. Next, determine the disks you can use with this socket by calling `daos_server nvme scan` or `dmg storage scan`. e.g., if eth0 belongs to NUMA Socket 0, use only the disks with 0 in the Socket ID column.
1. Check the interface used in the server config (`fabric_iface`) also exists in the client and can communicate with the server.
1. Check the access_points of the agent config points to the correct server host.
1. Check the `access_points` of the agent config points to the correct server hosts.
1. Call `daos pool query` and check that the pool exists and has free space.

### Applications run slow
Expand Down Expand Up @@ -512,7 +512,7 @@ fabric providers.

After starting `daos_server`, ranks will be unable to join if their configuration's fabric provider
does not match that of the system. The system configuration is determined by the management service
(MS) leader node, which may be arbitrarily chosen from the configured access points.
(MS) leader node, which may be arbitrarily chosen from the configured MS replicas.

The error message will include the string: `fabric provider <provider1> does not match system provider <provider2>`

Expand Down
17 changes: 9 additions & 8 deletions src/cart/crt_corpc.c
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* (C) Copyright 2016-2023 Intel Corporation.
* (C) Copyright 2016-2024 Intel Corporation.
*
* SPDX-License-Identifier: BSD-2-Clause-Patent
*/
Expand Down Expand Up @@ -777,6 +777,7 @@ crt_corpc_req_hdlr(struct crt_rpc_priv *rpc_priv)
struct crt_opc_info *opc_info;
struct crt_corpc_ops *co_ops;
bool ver_match;
bool co_failout = false;
int i, rc = 0;

co_info = rpc_priv->crp_corpc_info;
Expand Down Expand Up @@ -906,18 +907,18 @@ crt_corpc_req_hdlr(struct crt_rpc_priv *rpc_priv)
}

forward_done:
if (rc != 0 && rpc_priv->crp_flags & CRT_RPC_FLAG_CO_FAILOUT) {
crt_corpc_complete(rpc_priv);
goto out;
}
if (rc != 0 && rpc_priv->crp_flags & CRT_RPC_FLAG_CO_FAILOUT)
co_failout = true;

/* NOOP bcast (no child and root excluded) */
if (co_info->co_child_num == 0 && co_info->co_root_excluded)
if (co_info->co_child_num == 0 && (co_info->co_root_excluded || co_failout))
crt_corpc_complete(rpc_priv);

if (co_info->co_root_excluded == 1) {
if (co_info->co_root_excluded == 1 || co_failout) {
if (co_info->co_grp_priv->gp_self == co_info->co_root) {
/* don't return error for root */
/* don't return error for root to avoid RPC_DECREF in
* fail case in crt_req_send.
*/
rc = 0;
}
D_GOTO(out, rc);
Expand Down
2 changes: 1 addition & 1 deletion src/cart/crt_rpc.c
Original file line number Diff line number Diff line change
Expand Up @@ -1532,7 +1532,7 @@ crt_req_send(crt_rpc_t *req, crt_cb_t complete_cb, void *arg)
/* failure already reported through complete cb */
if (complete_cb != NULL)
rc = 0;
} else if (!crt_rpc_completed(rpc_priv)) {
} else {
RPC_DECREF(rpc_priv);
}
}
Expand Down
1 change: 1 addition & 0 deletions src/control/SConscript
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ def scons():

# Sets CGO_LDFLAGS for rpath options
denv.d_add_rpaths("..", True, True)
denv.require('protobufc')
denv.AppendENVPath("CGO_CFLAGS", denv.subst("$_CPPINCFLAGS"), sep=" ")
if prereqs.client_requested():
install_go_bin(denv, "daos_agent")
Expand Down
10 changes: 7 additions & 3 deletions src/control/cmd/daos/pretty/health.go
Original file line number Diff line number Diff line change
Expand Up @@ -132,12 +132,16 @@ func printPoolHealth(out io.Writer, pi *daos.PoolInfo, verbose bool) {
fmt.Fprintf(out, "%s: %s\n", pi.Name(), strings.Join(healthStrings, ","))
}

func printContainerHealth(out io.Writer, ci *daos.ContainerInfo, verbose bool) {
func printContainerHealth(out io.Writer, pi *daos.PoolInfo, ci *daos.ContainerInfo, verbose bool) {
if ci == nil {
return
}

fmt.Fprintf(out, "%s: %s\n", ci.Name(), txtfmt.Title(ci.Health))
healthStr := txtfmt.Title(ci.Health)
if pi != nil && pi.DisabledTargets > 0 {
healthStr += " (Pool Degraded)"
}
fmt.Fprintf(out, "%s: %s\n", ci.Name(), healthStr)
}

// PrintSystemHealthInfo pretty-prints the supplied system health struct.
Expand Down Expand Up @@ -180,7 +184,7 @@ func PrintSystemHealthInfo(out io.Writer, shi *daos.SystemHealthInfo, verbose bo
iiiw := txtfmt.NewIndentWriter(iiw)
if len(shi.Containers[pool.UUID]) > 0 {
for _, cont := range shi.Containers[pool.UUID] {
printContainerHealth(iiiw, cont, verbose)
printContainerHealth(iiiw, pool, cont, verbose)
}
} else {
fmt.Fprintln(iiiw, "No containers in pool.")
Expand Down
Loading

0 comments on commit f53f8bd

Please sign in to comment.