-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-6063 control: event notifications from harness to MS #4032
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/60/log |
Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/287/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/227/log |
Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/300/log |
Test stage Build on Ubuntu 20.04 with Clang completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/304/log |
Test stage Build on CentOS 7 release completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/344/log |
Test stage Build on CentOS 7 debug completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/347/log |
Test stage Build on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/315/log |
Test stage Build on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/349/log |
Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/4/execution/node/318/log |
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/59/log |
Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/294/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/231/log |
Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/309/log |
Test stage Build on CentOS 7 release completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/339/log |
Test stage Build on Ubuntu 20.04 with Clang completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/300/log |
Test stage Build on CentOS 7 debug completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/348/log |
Test stage Build on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/311/log |
Test stage Build on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/345/log |
Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/5/execution/node/317/log |
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/6/execution/node/59/log |
Test stage Build RPM on CentOS 7 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/6/execution/node/288/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/6/execution/node/226/log |
Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/6/execution/node/303/log |
Test stage Build on CentOS 7 release completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/6/execution/node/327/log |
2f2b9d1
to
9aac903
Compare
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/18/execution/node/60/log |
Signed-off-by: Tom Nabarro <[email protected]>
Signed-off-by: Tom Nabarro <[email protected]>
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/19/execution/node/61/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Few comments and things to address that can be done now or in a follow-on.
// EventForwarder implements the events.Handler interface, increments sequence | ||
// number for each event forwarded and distributes requests to MS access points. | ||
type EventForwarder struct { | ||
seq uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess in that case it doesn't matter if the sequence number is reset because it's only relevant to the sender. I assume that the goal would be to retry or something if a corresponding response isn't received in some amount of time? If so, I think I would prefer to keep the retry logic in the MS client rather than adding another layer here.
I think it's fine to leave the sequence number in for now though. We may find another use for it (e.g. distinguishing between N similar events with distinct sequence numbers and N duplicates of the same event).
Signed-off-by: Tom Nabarro <[email protected]>
…trol-system-event-update3 Conflicts: src/control/common/proto/mgmt/svc.pb.go src/mgmt/svc.pb-c.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional_Hardware_Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/21/execution/node/1146/log |
…trol-system-event-update3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional_Hardware_Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-4032/22/execution/node/1090/log |
|
||
enum ras_event_id { | ||
RAS_RANK_EXIT = 1, | ||
RAS_RANK_NO_RESP, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, the event ID is supposed to be a 64-byte string as specified on the wiki, no? https://wiki.hpdd.intel.com/display/DC/RAS+Events
return "daos_rank_no_response"; | ||
} | ||
|
||
return RAS_ID_UNKNOWN_STR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not maintain a list like that and just put everything in the JSON format itself. IMO, the list of event should only be maintained in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...pending off-line discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential solution could be to read a json file on control plane start-up parse each entry and register events with pubsub (each entry would map to a concrete events interface implementation and would have a proto message definition to be used on the C side). This seems more complex to me but it might meet your requirements.
The complexity stems from the fact that we need to share the event structures (including boilerplate RAS details and optional. Extended info which is different for specific event type/topic) between C and go code, there are various approaches for this including:
1- protobuf where we generate helper code from proto schema
2- cgo where we would have the definitions in C and access and use the native C types from go
3- generate native types on both sides from a list of identifiers in a shared location
... (there are probably more)
The current approach (implemented by me with direction from @mjmac) is to use option #3 above with specific event info definitions in proto files, then we have all the auto generated structures for specific event types in C and Go and when raising event we just fill in the details from the shared list of identifiers (in src/daos_srv/ras.h) and populate the auto generated protobuf structures with the relevant event type specific extended info.
Generating event definitions for C and Go from JSON file at runtime or compiletime is not trivial given that we need a schema that understands extended info structure for specific event types. Django framework does this for python and implementation is not trivial.
If we choose to restrict ourselves to a fixed set of event types with predefined event info structures then we could create distinct proto definitions for each event in a JSON blob at compile time. When adding a new event developer would need to:
- update json file with new entry
- build the protobuf message definitions from the new entry in the json file (by running scons)
- add event creation code on C side to populate auto generated proto structs for the new event to be sent over dRPC
- add new concrete event type (not topic) on Go side with Un/Marshal methods to convert to/from proto to native and add handler for this new event
I'm not sure this is any more developer friendly than the current implementation, part of the complexity on the go side is because we made a decision to isolate protobuf type usage from internals hence the conversions to native types.
Hope that helps clarify.
} | ||
|
||
return RAS_ID_UNKNOWN_STR; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@johannlombardi: The overall design of the RAS events needs to be fleshed out a bit more based on how you and others on the ioserver side intend to use this feature. In the meantime, can we land what we have now so that Tom can keep making progress on DAOS-6063 and DAOS-6088? Those can be adjusted pretty easily to fit whatever the shape the new design takes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine to land it and to keep the current structure (no json in io engine), but please provide a follow-on patch that changes the event ID to a char[64] as specified in the wiki. We will then flush out the details of the API on the I/O engine side after we are all back from holidays.
…trol-system-event-update3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
This isn't my area so I'm not sure I can review it and I can't see any change in the testing behaviour. If you've added me because you think it'll change the "dmg system stop" behaviour as discussed in #4022 then I can take a look at that and see if this solves the issue. |
think Mike might have added you Re: check patch issues on previous runs, thanks. this won't have an effect on system stop behaviour as far as I'm aware |
…#4032) Initial implementation of RAS event notifications between control-plane instances and management service. Event definitions described in daos_srv/ras.h as a central location to be accessible from control and data planes. Rank exit event registration and processing implemented. Signed-off-by: Tom Nabarro <[email protected]>
Initial implementation of RAS event notifications between control-plane
instances and management service. Event definitions described in
daos_srv/ras.h as a central location to be accessible from control and
data planes. Rank exit event registration and processing implemented.