Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge master into CR at 20240305 #13604

Closed
wants to merge 249 commits into from
Closed

Conversation

Nasf-Fan
Copy link
Contributor

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Nasf-Fan and others added 30 commits March 29, 2022 13:35
Define the shared data for the interface between control plane and DAOS
check engine, include the following:

DAOS global inconsistency class, the action to repair inconsistency,
DAOS check scan phases, instance status, pool status, and so on.

Signed-off-by: Fan Yong <[email protected]>
Define the dRPC protocol that are used by control plane to control
DAOS check engine for the following use cases:

1) Start check - DRPC_METHOD_MGMT_CHK_START
2) Stop check - DRPC_METHOD_MGMT_CHK_STOP
3) Query check progress - DRPC_METHOD_MGMT_CHK_QUERY
4) Get check parameters and property - DRPC_METHOD_MGMT_CHK_PROP
5) Execute the action to repair the specified inconsistency under
   interation mode - DRPC_METHOD_MGMT_CHK_ACT

Signed-off-by: Fan Yong <[email protected]>
Define new dRPC upcall to control plane for the following use cases:

DRPC_METHOD_CHK_LIST_POOL: obtain the known pools list from MS.

DRPC_METHOD_CHK_REG_POOL: register the (orphan) pool to MS.

DRPC_METHOD_CHK_DEREG_POOL: deregister the (dangling) pool from MS.

DRPC_METHOD_CHK_REPORT: DAOS check engine reports to the control
plane with the found inconsistency and the repair result. If the
repair action is CIA_INTERACT, then notify the control plane to
interact with the admin for the repair decision.

Signed-off-by: Fan Yong <[email protected]>
The check module infrastructure and bootstrap sequence.
New options for start engine with check mode:

-C|--check:     Start engine with check mode, global consistency check.

Signed-off-by: Fan Yong <[email protected]>
Adds a new --checker flag to `dmg system start` which will
start the ranks in a special checker mode. All ranks must
first be stopped before starting in checker mode.

Signed-off-by: Michael MacDonald <[email protected]>
Adds new dmg commands to manage/query CR:
  * dmg check start
  * dmg check stop
  * dmg check query
  * dmg check prop

Signed-off-by: Michael MacDonald <[email protected]>
Ensure that deb/rpm packages know what to do with the file.

Signed-off-by: Michael MacDonald <[email protected]>
Implement control plane handlers for the following
engine checker dRPC upcalls:
  * CheckerListPools
  * CheckerRegisterPool
  * CheckerDeregisterPool

Also fixes a slow test.

Signed-off-by: Michael MacDonald <[email protected]>
Shouldn't have been merged into previous commit; causes
problems when trying to integrate with other branches.

Signed-off-by: Michael MacDonald <[email protected]>
Refactor the system package to remove the direct dependencies
on external raft/grpc packages in order to avoid bringing
them in for unrelated tools (e.g. daos, dmg, etc).

Features: control

Signed-off-by: Michael MacDonald <[email protected]>
Setting the system name in the RPC callback is racy.

Signed-off-by: Michael MacDonald <[email protected]>
The DAOS Debug Tool (ddb) is a new executable that allows a user to
navigate through a file in the VOS format. It is similar to debugfs
for ext2/3/4 and offers both a command line and interactive shell
mode. This commit is the introduction of the tool with just a couple
commands supported. For more details about the tool see
src/ddb/README.md

Signed-off-by: Ryon Jensen <[email protected]>
Broken after master was merged in.

Signed-off-by: Michael MacDonald <[email protected]>
* Move CheckReport to chk package
    - Separate report request from payload
    - Add Actions and Details lists to allow the checker to
      specify defined actions that could be taken in response
      to the inconsistency report.
    - Rename chk/check -> chk/chk to ensure unique namespaces.
  * Adjust srv/mgmt messages to use chk types directly

Signed-off-by: Michael MacDonald <[email protected]>
Provide a central repository for storing the system
checker progress. Add handler for NotifyReport upcall
to store checker reports for display to the admin. Add
new `dmg check repair` command to allow the admin to
select a repair action for an inconsistency.

Changes to chk.CheckReport message:
  * Actions -> ActChoices
  * Details -> ActDetails
  * Adds ActMsgs array for human-formatted action descriptions

TODO: Add test coverage once this settles down.

Signed-off-by: Michael MacDonald <[email protected]>
Add a new `dmg faults add-checker-report` command to allow
manual injection of checker reports for prototyping and testing.
The command and associated RPCs are not compiled into release builds.

Signed-off-by: Michael MacDonald <[email protected]>
Small fix to always export the chk_pb variable so it
can be used by server and client.

Signed-off-by: Ryon Jensen <[email protected]>
  * Introduces the dump, dump_ilog, dump_dtx, load, and rm commands for the
    ddb tool.
  * Reworked the construction of the unit test lists so that the
    test function name is printed in stead of a separate description.
  * Added some filtering to the unit tests.
  * Abstracted out the printing of the commands so that what is
    printed is more easily testable and to clean up the
    command functions a little.

Signed-off-by: Ryon Jensen <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/17/execution/node/1173/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/18/execution/node/1405/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/18/execution/node/1451/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/CR_baseline_20231208 branch 2 times, most recently from ea25607 to 923dba0 Compare March 3, 2024 06:44
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/20/execution/node/1150/log

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/21/execution/node/1173/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/CR_baseline_20231208 branch from 923dba0 to ec21d5a Compare March 5, 2024 09:48
@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/22/execution/node/288/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/22/execution/node/291/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/22/execution/node/350/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/22/execution/node/353/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/23/execution/node/319/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/23/execution/node/369/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/23/execution/node/330/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/23/execution/node/316/log

liw and others added 4 commits March 5, 2024 21:13
Signed-off-by: Li Wei <[email protected]>
Required-githooks: true
Signed-off-by: Li Wei <[email protected]>
Required-githooks: true
For lower layer primary group initialization.

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/CR_baseline_20231208 branch from ec21d5a to 191e761 Compare March 5, 2024 13:39
@Nasf-Fan Nasf-Fan changed the title Merge master into CR at 20240130 Merge master into CR at 20240305 Mar 5, 2024
Bump version and add new changelog entry for CR.

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/CR_baseline_20231208 branch from 191e761 to 8b4d4c4 Compare March 5, 2024 16:51
1. For test_lost_majority_ps_replicas, remove "rdb-pool" from two
   ranks that contain the pool service replica.

2. For test_dangling_rank_entry, the rebuild process after CR is
   not related with CR logic but may cause test timeout, drop it.

3. More log messages when update CR bookmark.

4. More class tags for ddb tests.

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/CR_baseline_20231208 branch from 8b4d4c4 to 48c6d6f Compare March 5, 2024 17:01
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13604/26/execution/node/1452/log

@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Mar 6, 2024

rebuild_simple timeout because of DAOS-15290.

@Nasf-Fan Nasf-Fan closed this Apr 11, 2024
@Nasf-Fan Nasf-Fan deleted the Nasf-Fan/CR_baseline_20231208 branch April 11, 2024 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants