Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream/release/2.6 into upstream/google/2.6 #15317

Merged
merged 28 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
1044efa
DAOS-16634 mercury: Update source build in release 2.6 with ucx patch…
jgmoore-or Oct 1, 2024
4cf4172
DAOS-16648 build: Tag 2.6.1 rc3 (#15226)
phender Oct 1, 2024
80a2e09
DAOS-9355 doc: DAOS 2.6.1 release notes (#15235)
gnailzenh Oct 4, 2024
70a7ea5
DAOS-16445 client: Add function to cycle OIDs non-sequentially (#1499…
jolivier23 Oct 4, 2024
18b9138
DAOS-16566 test: Update server/multiengine_persocket.py (#15127) (#15…
phender Oct 4, 2024
16cc101
DAOS-16480 test: Increase expected range for dirty_pages metric (#150…
phender Oct 4, 2024
9c8746e
DAOS-16577 test: remove hw tag from deployment/disk_failure.py (#1513…
daltonbohning Oct 4, 2024
b116591
DAOS-15776 test: remove DataMoverTestBase.create_pool (#15079) (#15254)
daltonbohning Oct 7, 2024
c05c486
DAOS-16298 test: improve get_clush_command timeout (#15113) (#15252)
daltonbohning Oct 7, 2024
b3868b6
DAOS-16550 test: use correct stonewall file with mdtest (#15109) (#15…
daltonbohning Oct 7, 2024
507e9e7
DAOS-16567 test: remove unused IorCommand.log_metrics (#15128) (#15246)
daltonbohning Oct 7, 2024
26f39c0
DAOS-623 test: Support running independent io sys admin steps (#15134…
daltonbohning Oct 7, 2024
1cb8101
DAOS-16540 test: include extra yaml for soak md on ssd (#15104) (#15124)
mjean308 Oct 8, 2024
3243349
DAOS-16628 client: reset eq counter to zero after fork() in IL (#1518…
wiliamhuang Oct 8, 2024
6673515
DAOS-16482 control: Increase min hugepages for single tgt count (#151…
tanabarr Oct 8, 2024
f65edd6
DAOS-15778 test: remove DataMoverTestBase.posix_local_test_paths (#14…
daltonbohning Oct 9, 2024
148267e
DAOS-16487 control: Require hostname for nvme set-faulty & replace (#…
tanabarr Oct 9, 2024
eb3b342
DAOS-16027 test: Adding daos_test REBUILD31-34 subtests (#14584) (#15…
phender Oct 9, 2024
3ebf80f
DAOS-16487 test: fix dmg c helper for set-faulty changes (#15151) (#1…
tanabarr Oct 9, 2024
a61ab80
DAOS-16447 test: set D_IL_REPORT per test (#15012) (#15251)
daltonbohning Oct 9, 2024
29cd12f
DAOS-16548 test: add ftest lint check for invalid test_ tag (#15106) …
daltonbohning Oct 10, 2024
6feb969
DAOS-16076 test: Automate dmg scale test to be run on Aurora (#14616)…
shimizukko Oct 10, 2024
2a021be
DAOS-16590 test: misc ftest/performance updates (#15144) (#15266)
daltonbohning Oct 10, 2024
a1c6beb
DAOS-16589 test: Support Functional Hardware Medium VMD stage (#15166…
phender Oct 11, 2024
daa77f4
DAOS-16667 client: bump hadoop-common from 3.3.6 to 3.4.0 (#15194) (#…
tanabarr Oct 14, 2024
d85eb82
DAOS-16509 test: replace IorTestBase.execute_cmd with run_remote (#15…
daltonbohning Oct 14, 2024
19136b1
Merge remote-tracking branch 'origin/release/2.6' into ncmurphy/googl…
techbasset Oct 14, 2024
c62b057
Merge upstream/release/2.6 into upstream/google/2.6
techbasset Oct 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,9 @@ pipeline {
booleanParam(name: 'CI_medium_md_on_ssd_TEST',
defaultValue: false,
description: 'Run the Functional Hardware Medium MD on SSD test stage')
booleanParam(name: 'CI_medium_vmd_TEST',
defaultValue: true,
description: 'Run the Functional Hardware Medium VMD test stage')
booleanParam(name: 'CI_medium_verbs_provider_TEST',
defaultValue: false,
description: 'Run the Functional Hardware Medium Verbs Provider test stage')
Expand Down Expand Up @@ -310,6 +313,9 @@ pipeline {
string(name: 'FUNCTIONAL_HARDWARE_MEDIUM_VERBS_PROVIDER_LABEL',
defaultValue: 'ci_nvme5',
description: 'Label to use for 5 node Functional Hardware Medium Verbs Provider (MD on SSD) stages')
string(name: 'FUNCTIONAL_HARDWARE_MEDIUM_VMD_LABEL',
defaultValue: 'ci_vmd5',
description: 'Label to use for the Functional Hardware Medium VMD stage')
string(name: 'FUNCTIONAL_HARDWARE_MEDIUM_UCX_PROVIDER_LABEL',
defaultValue: 'ci_ofed5',
description: 'Label to use for 5 node Functional Hardware Medium UCX Provider stage')
Expand Down Expand Up @@ -1183,6 +1189,19 @@ pipeline {
run_if_landing: false,
job_status: job_status_internal
),
'Functional Hardware Medium VMD': getFunctionalTestStage(
name: 'Functional Hardware Medium VMD',
pragma_suffix: '-hw-medium-vmd',
label: params.FUNCTIONAL_HARDWARE_MEDIUM_VMD_LABEL,
next_version: next_version,
stage_tags: 'hw_vmd,medium',
/* groovylint-disable-next-line UnnecessaryGetter */
default_tags: startedByTimer() ? 'pr daily_regression' : 'pr',
nvme: 'auto',
run_if_pr: false,
run_if_landing: false,
job_status: job_status_internal
),
'Functional Hardware Medium Verbs Provider': getFunctionalTestStage(
name: 'Functional Hardware Medium Verbs Provider',
pragma_suffix: '-hw-medium-verbs-provider',
Expand Down
2 changes: 1 addition & 1 deletion TAG
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.6.1-rc2
2.6.1-rc3
6 changes: 6 additions & 0 deletions debian/changelog
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
daos (2.6.1-3) unstable; urgency=medium
[ Phillip Henderson ]
* Third release candidate for 2.6.1

-- Phillip Henderson <[email protected]> Tue, 01 Oct 2024 14:23:00 -0500

daos (2.6.1-2) unstable; urgency=medium
[ Phillip Henderson ]
* Second release candidate for 2.6.1
Expand Down
30 changes: 8 additions & 22 deletions docs/admin/administration.md
Original file line number Diff line number Diff line change
Expand Up @@ -620,21 +620,17 @@ Usage:
[nvme-faulty command options]
-u, --uuid= Device UUID to set
-f, --force Do not require confirmation
-l, --host= Single host address <ipv4addr/hostname> to connect to
```

To manually evict an NVMe SSD (auto eviction is covered later in this section),
the device state needs to be set faulty by running the following command:
```bash
$ dmg -l boro-11 storage set nvme-faulty --uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
$ dmg storage set nvme-faulty --host=boro-11 --uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
NOTICE: This command will permanently mark the device as unusable!
Are you sure you want to continue? (yes/no)
yes
-------
boro-11
-------
Devices
UUID:5bd91603-d3c7-4fb7-9a71-76bc25690c19 [TrAddr:]
Targets:[] Rank:0 State:EVICTED LED:ON
set-faulty operation performed successfully on the following host: boro-11:10001
```
The device state will transition from "NORMAL" to "EVICTED" (shown above), during which time the
faulty device reaction will have been triggered (all targets on the SSD will be rebuilt).
Expand Down Expand Up @@ -693,19 +689,14 @@ Usage:
[nvme command options]
--old-uuid= Device UUID of hot-removed SSD
--new-uuid= Device UUID of new device
--no-reint Bypass reintegration of device and just bring back online.
-l, --host= Single host address <ipv4addr/hostname> to connect to
```

To replace an NVMe SSD with an evicted device and reintegrate it into use with
DAOS, run the following command:
```bash
$ dmg -l boro-11 storage replace nvme --old-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 --new-uuid=80c9f1be-84b9-4318-a1be-c416c96ca48b
-------
boro-11
-------
Devices
UUID:80c9f1be-84b9-4318-a1be-c416c96ca48b [TrAddr:]
Targets:[] Rank:1 State:NORMAL LED:OFF
$ dmg storage replace nvme --host=boro-11 --old-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 --new-uuid=80c9f1be-84b9-4318-a1be-c416c96ca48b
dev-replace operation performed successfully on the following host: boro-11:10001
```
The old, now replaced device will remain in an "EVICTED" state until it is unplugged.
The new device will transition from a "NEW" state to a "NORMAL" state (shown above).
Expand All @@ -716,14 +707,9 @@ In order to reuse a device that was previously set as FAULTY and evicted from th
system, an admin can run the following command (setting the old device UUID to be the
new device UUID):
```bash
$ dmg -l boro-11 storage replace nvme --old-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 --new-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
$ dmg storage replace nvme --host=boro-11 ---old-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19 --new-uuid=5bd91603-d3c7-4fb7-9a71-76bc25690c19
NOTICE: Attempting to reuse a previously set FAULTY device!
-------
boro-11
-------
Devices
UUID:5bd91603-d3c7-4fb7-9a71-76bc25690c19 [TrAddr:]
Targets:[] Rank:1 State:NORMAL LED:OFF
dev-replace operation performed successfully on the following host: boro-11:10001
```
The FAULTY device will transition from an "EVICTED" state back to a "NORMAL" state,
and will again be available for use with DAOS. The use case of this command will mainly
Expand Down
51 changes: 51 additions & 0 deletions docs/release/release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,57 @@

We are pleased to announce the release of DAOS version 2.6.

## DAOS Version 2.6.1 (2024-10-05)

The DAOS 2.6.1 release contains the following updates on top of DAOS 2.6.0:

* Mercury update for slingshot 11.0 host stack and other UCX provider fixes.

### Bug fixes and improvements

The DAOS 2.6.1 release includes fixes for several defects and a few changes
of administrator interface that can improve usability of DAOS system.

* Fix a race between MS replica stepping up as leader and engines joining the
system, this race may cause engine join to fail.

* Fix a race in concurrent container destroy which may cause engine crash.

* Pool destroy returns explicit error instead of success if there is an
in-progress destroy against the same pool.

* EC aggregation may cause inconsistency between data shard and parity shard,
this has been fixed in DAOS Version 2.6.1.

* Enable pool list for clients.

* Running "daos|dmg pool query-targets" with rank argument can query all
targets on that rank.

* Add daos health check command which allows basic system health checks from client.

* DAOS Version 2.6.0 always excludes unreachable engines reported by SWIM and schedule rebuild for
excluded engines, this is an overreaction if massive engines are impacted by power failure or
switch reboot because data recovery is impossible in these cases. DAOS 2.6.1 introduces a new
environment variable to set in the server yaml file for each engine (DAOS_POOL_RF) to indicate the
number of engine failures seen before stopping the changing of pool membership and completing in
progress rebuild. It will just let all I/O and on-going rebuild block. DAOS system can finish in
progress rebuild and be available again after bringing back impacted engines. The recommendation
is to set this environment variable to 2.

* In DAOS Version 2.6.0, accessing faulty NVMe device returns wrong error code
to DAOS client which can fail the application. DAOS 2.6.1 returns correct
error code to DAOS client so the client can retry and eventually access data
in degraded mode instead of failing the I/O.

* Pil4dfs fix to avoid deadlock with level zero library on aurora and support
for more libc functions that were not intercepted before

For details, please refer to the Github
[release/2.6 commit history](https://github.com/daos-stack/daos/commits/release/2.6)
and the associated [Jira tickets](https://jira.daos.io/) as stated in the commit messages.


## DAOS Version 2.6.0 (2024-07-26)

### General Support
Expand Down
6 changes: 3 additions & 3 deletions src/bio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ Devices:
<a id="82"></a>
- Manually Set Device State to FAULTY: **$dmg storage set nvme-faulty**
```
$ dmg storage set nvme-faulty --uuid=9fb3ce57-1841-43e6-8b70-2a5e7fb2a1d0
$ dmg storage set nvme-faulty --host=localhost --uuid=9fb3ce57-1841-43e6-8b70-2a5e7fb2a1d0
Devices
UUID:9fb3ce57-1841-43e6-8b70-2a5e7fb2a1d0 [TrAddr:0000:8d:00.0]
Targets:[0] Rank:0 State:EVICTED
Expand All @@ -219,7 +219,7 @@ Devices
<a id="83"></a>
- Replace an evicted device with a new device: **$dmg storage replace nvme**
```
$ dmg storage replace nvme --old-uuid=9fb3ce57-1841-43e6-8b70-2a5e7fb2a1d0 --new-uuid=8131fc39-4b1c-4662-bea1-734e728c434e
$ dmg storage replace nvme --host=localhost --old-uuid=9fb3ce57-1841-43e6-8b70-2a5e7fb2a1d0 --new-uuid=8131fc39-4b1c-4662-bea1-734e728c434e
Devices
UUID:8131fc39-4b1c-4662-bea1-734e728c434e [TrAddr:0000:8d:00.0]
Targets:[0] Rank:0 State:NORMAL
Expand All @@ -229,7 +229,7 @@ Devices
<a id="84"></a>
- Reuse a previously evicted device: **$dmg storage replace nvme**
```
$ dmg storage replace nvme --old-uuid=9fb3ce57-1841-43e6-8b70-2a5e7fb2a1d0 --new-uuid=9fb3ce57-1841-43e6-8b70-2a5e7fb2a1d0
$ dmg storage replace nvme --host=localhost --old-uuid=9fb3ce57-1841-43e6-8b70-2a5e7fb2a1d0 --new-uuid=9fb3ce57-1841-43e6-8b70-2a5e7fb2a1d0
Devices
UUID:9fb3ce57-1841-43e6-8b70-2a5e7fb2a1d0 [TrAddr:0000:8a:00.0]
Targets:[0] Rank:0 State:NORMAL
Expand Down
90 changes: 30 additions & 60 deletions src/bio/smd.pb-c.c
Original file line number Diff line number Diff line change
Expand Up @@ -2120,69 +2120,39 @@ const ProtobufCMessageDescriptor ctl__led_manage_req__descriptor =
(ProtobufCMessageInit) ctl__led_manage_req__init,
NULL,NULL,NULL /* reserved[123] */
};
static const ProtobufCFieldDescriptor ctl__dev_replace_req__field_descriptors[3] =
{
{
"old_dev_uuid",
1,
PROTOBUF_C_LABEL_NONE,
PROTOBUF_C_TYPE_STRING,
0, /* quantifier_offset */
offsetof(Ctl__DevReplaceReq, old_dev_uuid),
NULL,
&protobuf_c_empty_string,
0, /* flags */
0,NULL,NULL /* reserved1,reserved2, etc */
},
{
"new_dev_uuid",
2,
PROTOBUF_C_LABEL_NONE,
PROTOBUF_C_TYPE_STRING,
0, /* quantifier_offset */
offsetof(Ctl__DevReplaceReq, new_dev_uuid),
NULL,
&protobuf_c_empty_string,
0, /* flags */
0,NULL,NULL /* reserved1,reserved2, etc */
},
{
"no_reint",
3,
PROTOBUF_C_LABEL_NONE,
PROTOBUF_C_TYPE_BOOL,
0, /* quantifier_offset */
offsetof(Ctl__DevReplaceReq, no_reint),
NULL,
NULL,
0, /* flags */
0,NULL,NULL /* reserved1,reserved2, etc */
},
static const ProtobufCFieldDescriptor ctl__dev_replace_req__field_descriptors[2] = {
{
"old_dev_uuid", 1, PROTOBUF_C_LABEL_NONE, PROTOBUF_C_TYPE_STRING, 0, /* quantifier_offset */
offsetof(Ctl__DevReplaceReq, old_dev_uuid), NULL, &protobuf_c_empty_string, 0, /* flags */
0, NULL, NULL /* reserved1,reserved2, etc */
},
{
"new_dev_uuid", 2, PROTOBUF_C_LABEL_NONE, PROTOBUF_C_TYPE_STRING, 0, /* quantifier_offset */
offsetof(Ctl__DevReplaceReq, new_dev_uuid), NULL, &protobuf_c_empty_string, 0, /* flags */
0, NULL, NULL /* reserved1,reserved2, etc */
},
};
static const unsigned ctl__dev_replace_req__field_indices_by_name[] = {
1, /* field[1] = new_dev_uuid */
2, /* field[2] = no_reint */
0, /* field[0] = old_dev_uuid */
};
static const ProtobufCIntRange ctl__dev_replace_req__number_ranges[1 + 1] =
{
{ 1, 0 },
{ 0, 3 }
1, /* field[1] = new_dev_uuid */
0, /* field[0] = old_dev_uuid */
};
const ProtobufCMessageDescriptor ctl__dev_replace_req__descriptor =
{
PROTOBUF_C__MESSAGE_DESCRIPTOR_MAGIC,
"ctl.DevReplaceReq",
"DevReplaceReq",
"Ctl__DevReplaceReq",
"ctl",
sizeof(Ctl__DevReplaceReq),
3,
ctl__dev_replace_req__field_descriptors,
ctl__dev_replace_req__field_indices_by_name,
1, ctl__dev_replace_req__number_ranges,
(ProtobufCMessageInit) ctl__dev_replace_req__init,
NULL,NULL,NULL /* reserved[123] */
static const ProtobufCIntRange ctl__dev_replace_req__number_ranges[1 + 1] = {{1, 0}, {0, 2}};
const ProtobufCMessageDescriptor ctl__dev_replace_req__descriptor = {
PROTOBUF_C__MESSAGE_DESCRIPTOR_MAGIC,
"ctl.DevReplaceReq",
"DevReplaceReq",
"Ctl__DevReplaceReq",
"ctl",
sizeof(Ctl__DevReplaceReq),
2,
ctl__dev_replace_req__field_descriptors,
ctl__dev_replace_req__field_indices_by_name,
1,
ctl__dev_replace_req__number_ranges,
(ProtobufCMessageInit)ctl__dev_replace_req__init,
NULL,
NULL,
NULL /* reserved[123] */
};
static const ProtobufCFieldDescriptor ctl__set_faulty_req__field_descriptors[1] =
{
Expand Down
15 changes: 6 additions & 9 deletions src/bio/smd.pb-c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions src/client/dfuse/il/int_posix.c
Original file line number Diff line number Diff line change
Expand Up @@ -812,6 +812,7 @@ child_hdlr(void)
DFUSE_LOG_WARNING("daos_eq_create() failed: "DF_RC, DP_RC(rc));
else
ioil_iog.iog_main_eqh = ioil_eqh;
ioil_iog.iog_eq_count = 0;
}

/* Returns true on success */
Expand Down
1 change: 1 addition & 0 deletions src/client/dfuse/pil4dfs/int_dfs.c
Original file line number Diff line number Diff line change
Expand Up @@ -945,6 +945,7 @@ child_hdlr(void)
daos_dti_reset();
td_eqh = main_eqh = DAOS_HDL_INVAL;
context_reset = true;
d_eq_count = 0;
}

/* only free the reserved low fds when application exits or encounters error */
Expand Down
2 changes: 1 addition & 1 deletion src/client/java/hadoop-daos/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
<packaging>jar</packaging>

<properties>
<hadoop.version>3.3.6</hadoop.version>
<hadoop.version>3.4.0</hadoop.version>
<native.build.path>${project.basedir}/build</native.build.path>
<daos.install.path>${project.basedir}/install</daos.install.path>
</properties>
Expand Down
2 changes: 1 addition & 1 deletion src/common/tests_dmg_helpers.c
Original file line number Diff line number Diff line change
Expand Up @@ -1393,7 +1393,7 @@ dmg_storage_set_nvme_fault(const char *dmg_config_file,
D_GOTO(out, rc = -DER_NOMEM);
}

args = cmd_push_arg(args, &argcount, " --host-list=%s ", host);
args = cmd_push_arg(args, &argcount, " --host=%s ", host);
if (args == NULL)
D_GOTO(out, rc = -DER_NOMEM);

Expand Down
7 changes: 4 additions & 3 deletions src/control/cmd/dmg/json_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,11 @@ func TestDmg_JsonOutput(t *testing.T) {
testArgs = append(testArgs, "-l", "foo.com", "-a",
test.MockPCIAddr(), "-e", "0")
case "storage set nvme-faulty":
testArgs = append(testArgs, "--force", "-u", test.MockUUID())
testArgs = append(testArgs, "--host", "foo.com", "--force", "-u",
test.MockUUID())
case "storage replace nvme":
testArgs = append(testArgs, "--old-uuid", test.MockUUID(),
"--new-uuid", test.MockUUID())
testArgs = append(testArgs, "--host", "foo.com", "--old-uuid",
test.MockUUID(), "--new-uuid", test.MockUUID())
case "storage led identify", "storage led check", "storage led clear":
testArgs = append(testArgs, test.MockUUID())
case "pool create":
Expand Down
Loading
Loading