Skip to content

Commit

Permalink
Merge branch 'amd/avocado-version' into amd/avocado-version-92
Browse files Browse the repository at this point in the history
Required-githooks: true
  • Loading branch information
ashleypittman committed Feb 2, 2024
2 parents d478861 + c64dc99 commit 2c14034
Show file tree
Hide file tree
Showing 302 changed files with 8,026 additions and 3,912 deletions.
1 change: 1 addition & 0 deletions .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ IndentCaseLabels: false
ForEachMacros: ['d_list_for_each_entry',
'd_list_for_each_safe',
'd_list_for_each_entry_safe',
'd_list_for_each_entry_reverse',
'evt_ent_array_for_each']
PointerAlignment: Right
AlignTrailingComments: true
Expand Down
12 changes: 12 additions & 0 deletions debian/changelog
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
daos (2.5.100-15) unstable; urgency=medium
[ Ashley M. Pittman ]
* Updated pydaos install process

-- Ashley M. Pittman <[email protected]> Fri, 02 Feb 2024 09:15:00 -0800

daos (2.5.100-14) unstable; urgency=medium
[ Brian J. Murrell ]
* NOOP change to keep in parity with RPM version

-- Brian J. Murrell <[email protected]> Tue, 09 Jan 2024 13:59:01 -0500

daos (2.5.100-13) unstable; urgency=medium
[ Brian J. Murrell ]
* Update for EL 8.8 and Leap 15.5
Expand Down
55 changes: 54 additions & 1 deletion docs/admin/administration.md
Original file line number Diff line number Diff line change
Expand Up @@ -478,6 +478,59 @@ boro-11
```
#### Exclusion and Hotplug
- Automatic exclusion of an NVMe SSD:
Automatic exclusion based on faulty criteria is the default behavior in DAOS
release 2.6. The default criteria parameters are `max_io_errs: 10` and
`max_csum_errs: <uint32_max>` (essentially eviction due to checksum errors is
disabled by default).
Setting auto-faulty criteria parameters can be done through the server config
file by adding the following YAML to the engine section of the server config
file.
```yaml
engines:
- bdev_auto_faulty:
enable: true
max_io_errs: 1
max_csum_errs: 2
```
On formatting the storage for the engine, these settings result in the
following `daos_server` log entries to indicate the parameters are written to
the engine's NVMe config:
```bash
DEBUG 13:59:29.229795 provider.go:592: BdevWriteConfigRequest: &{ForwardableRequest:{Forwarded:false} ConfigOutputPath:/mnt/daos0/daos_nvme.conf OwnerUID:10695475 OwnerGID:10695475 TierProps:[{Class:nvme DeviceList:0000:5e:00.0 DeviceFileSize:0 Tier:1 DeviceRoles:{OptionBits:0}}] HotplugEnabled:false HotplugBusidBegin:0 HotplugBusidEnd:0 Hostname:wolf-310.wolf.hpdd.intel.com AccelProps:{Engine: Options:0} SpdkRpcSrvProps:{Enable:false SockAddr:} AutoFaultyProps:{Enable:true MaxIoErrs:1 MaxCsumErrs:2} VMDEnabled:false ScannedBdevs:}
Writing NVMe config file for engine instance 0 to "/mnt/daos0/daos_nvme.conf"
```
The engine's NVMe config (produced during format) then contains the following
JSON to apply the criteria:
```json
[tanabarr@wolf-310 ~]$ cat /mnt/daos0/daos_nvme.conf
{
"daos_data": {
"config": [
{
"params": {
"enable": true,
"max_io_errs": 1,
"max_csum_errs": 2
},
"method": "auto_faulty"
...
```
These engine logfile entries indicate that the settings have been read and
applied:
```bash
01/12-13:59:41.36 wolf-310 DAOS[1299350/-1/0] bio INFO src/bio/bio_config.c:1016 bio_read_auto_faulty_criteria() NVMe auto faulty is enabled. Criteria: max_io_errs:1, max_csum_errs:2
```
- Manually exclude an NVMe SSD:
```bash
$ dmg storage set nvme-faulty --help
Expand All @@ -491,7 +544,7 @@ Usage:
-f, --force Do not require confirmation
```
To manually evict an NVMe SSD (auto eviction will be supported in a future release),
To manually evict an NVMe SSD (auto eviction is covered later in this section),
the device state needs to be set faulty by running the following command:
```bash
$ dmg -l boro-11 storage set nvme-faulty --uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
Expand Down
2 changes: 1 addition & 1 deletion docs/admin/deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ Application Options:
--allow-proxy Allow proxy configuration via environment
-o, --config= Server config file path
-b, --debug Enable debug output
-j, --json enable JSON output
-j, --json Enable JSON output
-J, --json-logging Enable JSON-formatted log output
--syslog Enable logging to syslog

Expand Down
8 changes: 4 additions & 4 deletions src/bio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,24 +81,24 @@ While monitoring this health data, an admin can now make the determination to ma

<a id="7"></a>
## Faulty Device Detection (SSD Eviction)
Faulty device detection and reaction can be referred to as NVMe SSD eviction. This involves all affected pool targets being marked as down and the rebuild of all affected pool targets being automatically triggered. A persistent device state is maintained in SMD and the device state is updated from NORMAL to FAULTY upon SSD eviction. The faulty device reaction will involve various SPDK cleanup, including all I/O channels released, SPDK allocations (termed 'blobs') closed, and the SPDK blobstore created on the NVMe SSD unloaded. Currently only manual SSD eviction is supported, and a future release will support automatic SSD eviction.
Faulty device detection and reaction can be referred to as NVMe SSD eviction. This involves all affected pool targets being marked as down and the rebuild of all affected pool targets being automatically triggered. A persistent device state is maintained in SMD and the device state is updated from NORMAL to FAULTY upon SSD eviction. The faulty device reaction involves various SPDK cleanup, including all I/O channels released, SPDK allocations (termed 'blobs') closed, and the SPDK blobstore created on the NVMe SSD unloaded. Automatic SSD eviction is enabled by default and can be disabled using the `bdev_auto_faulty` server config file engine parameter.

Useful admin commands to manually evict an NVMe SSD:
- <a href="#82">dmg storage set nvme-faulty</a> [used to manually set an NVMe SSD to FAULTY (ie evict the device)]

<a id="8"></a>
## NVMe SSD Hot Plug

**Full NVMe hot plug capability will be available and supported in DAOS 2.0 release. Use is currently intended for testing only and is not supported for production.**
NVMe hot plug with Intel VMD devices is supported in this release.

**Full hot plug capability when using non-Intel-VMD devices is to be supported in DAOS 2.8 release. Use is currently intended for testing only and is not supported for production.**

The NVMe hot plug feature includes device removal (an NVMe hot remove event) and device reintegration (an NVMe hotplug event) when a faulty device is replaced with a new device.

For device removal, if the device is a faulty or previously evicted device, then nothing further would be done when the device is removed. The device state would be displayed as UNPLUGGED. If a healthy device that is currently in use by DAOS is removed, then all SPDK memory stubs would be deconstructed, and the device state would also display as UNPLUGGED.

For device reintegration, if a new device is plugged to replace a faulty device, the admin would need to issue a device replacement command. All SPDK in-memory stubs would be created and all affected pool targets automatically reintegrated on the new device. The device state would be displayed as NEW initially and NORMAL after the replacement event occurred. If a faulty device or previously evicted device is re-plugged, the device will remain evicted, and the device state would display EVICTED. If a faulty device is desired to be reused (NOTE: this is not advised, mainly used for testing purposes), the admin can run the same device replacement command setting the new and old device IDs to be the same device ID. Reintegration will not occur on the device, as DAOS does not currently support incremental reintegration.

NVMe hot plug with Intel VMD devices is currently not supported in this release, but will be supported in a future release.

Useful admin commands to replace an evicted device:
- <a href="#83">dmg storage replace nvme</a> [used to replace an evicted device with a new device]
- <a href="#84">dmg storage replace nvme</a> [used to bring an evicted device back online (without reintegration)]
Expand Down
Loading

0 comments on commit 2c14034

Please sign in to comment.