Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-6063 control: event notifications from harness to MS #4032

Merged
merged 20 commits into from
Dec 23, 2020

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Dec 7, 2020

Initial implementation of RAS event notifications between control-plane
instances and management service. Event definitions described in
daos_srv/ras.h as a central location to be accessible from control and
data planes. Rank exit event registration and processing implemented.

@tanabarr tanabarr requested a review from mjmac December 7, 2020 23:15
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr tanabarr changed the title Tanabarr/control system event update3 DAOS-6063 control: event notifications from harness to MS Dec 8, 2020
@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/287/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/227/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/300/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Ubuntu 20.04 with Clang completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/304/log

@daosbuild1
Copy link
Collaborator

Test stage Build on CentOS 7 release completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/344/log

@daosbuild1
Copy link
Collaborator

Test stage Build on CentOS 7 debug completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/347/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/315/log

@daosbuild1
Copy link
Collaborator

Test stage Build on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/349/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/318/log

Base automatically changed from tanabarr/control-system-event-update2 to master December 8, 2020 19:32
@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/294/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/231/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/309/log

@daosbuild1
Copy link
Collaborator

Test stage Build on CentOS 7 release completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/339/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Ubuntu 20.04 with Clang completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/300/log

@daosbuild1
Copy link
Collaborator

Test stage Build on CentOS 7 debug completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/348/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/311/log

@daosbuild1
Copy link
Collaborator

Test stage Build on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/345/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/317/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/6/execution/node/288/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/6/execution/node/226/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/6/execution/node/303/log

@daosbuild1
Copy link
Collaborator

Test stage Build on CentOS 7 release completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/6/execution/node/327/log

@tanabarr tanabarr force-pushed the tanabarr/control-system-event-update3 branch from 2f2b9d1 to 9aac903 Compare December 17, 2020 15:51
@daosbuild1
Copy link
Collaborator

@tanabarr tanabarr requested review from mjmac and kjacque December 17, 2020 17:21
@daosbuild1
Copy link
Collaborator

mjmac
mjmac previously approved these changes Dec 17, 2020
Copy link
Contributor

@mjmac mjmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Few comments and things to address that can be done now or in a follow-on.

src/control/events/pubsub.go Outdated Show resolved Hide resolved
src/control/events/pubsub.go Outdated Show resolved Hide resolved
src/control/events/ras.go Outdated Show resolved Hide resolved
// EventForwarder implements the events.Handler interface, increments sequence
// number for each event forwarded and distributes requests to MS access points.
type EventForwarder struct {
seq uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess in that case it doesn't matter if the sequence number is reset because it's only relevant to the sender. I assume that the goal would be to retry or something if a corresponding response isn't received in some amount of time? If so, I think I would prefer to keep the retry logic in the MS client rather than adding another layer here.

I think it's fine to leave the sequence number in for now though. We may find another use for it (e.g. distinguishing between N similar events with distinct sequence numbers and N duplicates of the same event).

src/control/server/mgmt_system.go Outdated Show resolved Hide resolved
…trol-system-event-update3

Conflicts:
	src/control/common/proto/mgmt/svc.pb.go
	src/mgmt/svc.pb-c.c
@tanabarr tanabarr requested a review from mjmac December 18, 2020 23:13
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional_Hardware_Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/21/execution/node/1146/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional_Hardware_Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/22/execution/node/1090/log


enum ras_event_id {
RAS_RANK_EXIT = 1,
RAS_RANK_NO_RESP,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the event ID is supposed to be a 64-byte string as specified on the wiki, no? https://wiki.hpdd.intel.com/display/DC/RAS+Events

return "daos_rank_no_response";
}

return RAS_ID_UNKNOWN_STR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not maintain a list like that and just put everything in the JSON format itself. IMO, the list of event should only be maintained in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...pending off-line discussion

Copy link
Contributor Author

@tanabarr tanabarr Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential solution could be to read a json file on control plane start-up parse each entry and register events with pubsub (each entry would map to a concrete events interface implementation and would have a proto message definition to be used on the C side). This seems more complex to me but it might meet your requirements.

The complexity stems from the fact that we need to share the event structures (including boilerplate RAS details and optional. Extended info which is different for specific event type/topic) between C and go code, there are various approaches for this including:
1- protobuf where we generate helper code from proto schema
2- cgo where we would have the definitions in C and access and use the native C types from go
3- generate native types on both sides from a list of identifiers in a shared location
... (there are probably more)

The current approach (implemented by me with direction from @mjmac) is to use option #3 above with specific event info definitions in proto files, then we have all the auto generated structures for specific event types in C and Go and when raising event we just fill in the details from the shared list of identifiers (in src/daos_srv/ras.h) and populate the auto generated protobuf structures with the relevant event type specific extended info.

Generating event definitions for C and Go from JSON file at runtime or compiletime is not trivial given that we need a schema that understands extended info structure for specific event types. Django framework does this for python and implementation is not trivial.

If we choose to restrict ourselves to a fixed set of event types with predefined event info structures then we could create distinct proto definitions for each event in a JSON blob at compile time. When adding a new event developer would need to:

  • update json file with new entry
  • build the protobuf message definitions from the new entry in the json file (by running scons)
  • add event creation code on C side to populate auto generated proto structs for the new event to be sent over dRPC
  • add new concrete event type (not topic) on Go side with Un/Marshal methods to convert to/from proto to native and add handler for this new event

I'm not sure this is any more developer friendly than the current implementation, part of the complexity on the go side is because we made a decision to isolate protobuf type usage from internals hence the conversions to native types.

Hope that helps clarify.

}

return RAS_ID_UNKNOWN_STR;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@mjmac
Copy link
Contributor

mjmac commented Dec 22, 2020

@johannlombardi: The overall design of the RAS events needs to be fleshed out a bit more based on how you and others on the ioserver side intend to use this feature. In the meantime, can we land what we have now so that Tom can keep making progress on DAOS-6063 and DAOS-6088? Those can be adjusted pretty easily to fit whatever the shape the new design takes.

@johannlombardi johannlombardi self-requested a review December 22, 2020 15:23
Copy link
Contributor

@johannlombardi johannlombardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to land it and to keep the current structure (no json in io engine), but please provide a follow-on patch that changes the event ID to a char[64] as specified in the wiki. We will then flush out the details of the API on the I/O engine side after we are all back from holidays.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr tanabarr requested a review from a team December 23, 2020 08:14
@ashleypittman
Copy link
Contributor

This isn't my area so I'm not sure I can review it and I can't see any change in the testing behaviour.

If you've added me because you think it'll change the "dmg system stop" behaviour as discussed in #4022 then I can take a look at that and see if this solves the issue.

@tanabarr
Copy link
Contributor Author

This isn't my area so I'm not sure I can review it and I can't see any change in the testing behaviour.

If you've added me because you think it'll change the "dmg system stop" behaviour as discussed in #4022 then I can take a look at that and see if this solves the issue.

think Mike might have added you Re: check patch issues on previous runs, thanks. this won't have an effect on system stop behaviour as far as I'm aware

@mjmac mjmac merged commit baea574 into master Dec 23, 2020
@mjmac mjmac deleted the tanabarr/control-system-event-update3 branch December 23, 2020 13:09
dsikich pushed a commit to dsikich/daos that referenced this pull request Jan 5, 2021
…#4032)

Initial implementation of RAS event notifications between control-plane
instances and management service. Event definitions described in
daos_srv/ras.h as a central location to be accessible from control and
data planes. Rank exit event registration and processing implemented.

Signed-off-by: Tom Nabarro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants