-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CT 1998 use google protobuf to enable more flexible dictionaries #7190
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Really appreciated the walkthrough & recording :)
I tried this out locally, and I didn't see any obvious changes in JSON-formatted logs. No blocking comments from me.
|
||
|
||
LOG_VERSION = 3 | ||
metadata_vars: Optional[Dict[str, str]] = None | ||
|
||
nofile_codes = ["Z012", "Z013", "Z014", "Z015"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we can no longer do this via class inheritance (NoFile
), we need to explicitly state here which types are "no file."
This is fine IMO - we don't have so many of these things - but it does feel trickier to test, now that these are no longer attributes.
(After watching the recorded walkthrough) Could we leave an inline code comment here? These events are specific to dbt clean
, and we can't write file logs during clean
because the log file might be one of the things that dbt is cleaning (= race condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment. In theory we could still use something like classes or method flags, but we'd have to get the class from msg.info.name, do getattr(types, class_name) and then check for the existence of another class or execute a class method. So that's still on option if some other flag thing comes along that makes it worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! agree that, in this case, it doesn't feel worth it — hard-coding the codes is a right-sized solution
@@ -223,10 +219,9 @@ def node_info(self): | |||
"node_status": str(self._event_status.get("node_status")), | |||
"node_started_at": self._event_status.get("started_at"), | |||
"node_finished_at": self._event_status.get("finished_at"), | |||
"meta": meta_stringified, | |||
"meta": getattr(self, "meta", {}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!!
core/dbt/contracts/results.py
Outdated
msg_dict["started_at"] = self.started_at.strftime("%Y-%m-%dT%H:%M:%SZ") | ||
if self.completed_at: | ||
msg_dict["completed_at"] = self.completed_at.strftime("%Y-%m-%dT%H:%M:%SZ") | ||
return msg_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this worth a utility function? It also looks like this is different from timestamp_to_datetime_string
- just based on the precision we want ("%H:%M:%S.%f"
vs "%Y-%m-%dT%H:%M:%SZ"
) - for parity with status quo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a datetime_to_json_format_string utility method, and used it here and in base_types.py. The other one (timestampe_to_datetime_string) goes in a different direction, from the timestamp to a string and string is shorter
@@ -67,7 +68,7 @@ def __exit__(self, exc_type, exc_value, traceback): | |||
with TimingProcessor(self.timing_info): | |||
fire_event( | |||
TimingInfoCollected( | |||
timing_info=self.timing_info.to_msg(), node_info=get_node_info() | |||
timing_info=self.timing_info.to_msg_dict(), node_info=get_node_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, do we expect this change (nested message → nested dict) to be breaking in any way? My sense is that it looks the same, if the event is being serialized to JSON, which is our only official current interface for consuming structured logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should have the same output. In betterproto we had to provide the pre-constructed sub-messages (which is the way python works) and with google protobuf, you can't.
core/dbt/events/base_types.py
Outdated
except Exception as exc: | ||
raise Exception(f"[{class_name}]: Unable to parse dict {kwargs},\n exc: {exc}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed during the live walkthrough, we want to stop throwing an exception (and stopping the run), and instead just log a warning that we failed to serialize the event:
@gshank made the point that it's a bit weird to fire an event within the code for firing events. I don't have strong opinions here on the implementation — just that, as an end user, I would want dbt to more gracefully handle serialization errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested fire_event seems to be working okay. Added a test for it.
Steps taken to mitigate the drawbacks of Google protobuf from above: | ||
* We are using a wrapping class around the logging events to enable a constructor that looks more like a Python constructor, as long as only keyword arguments are used. | ||
* The generated file is skipped in the pre-commit config | ||
* We can live with the awkward interfaces. It's just code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can live with the awkward interfaces. It's just code.
:)
resolves #6832
Description
In order to have more flexible dictionaries in our logging events, we need to have the protobuf Struct data type. This isn't supported by betterproto. In addition, Optional is only supported with a beta version. In order to have Structs, we are switching to using google protobuf instead.
There are a number of differences in implementation between betterproto and google protobuf. In google protobuf the generated "Python" classes are not very Python-like, and the interfaces are odd. Betterproto had support for a datetime-like timestamp field, which is different than protobuf. Protobuf can instantiate messages from nested dictionaries. It also catches type mismatches that betterproto didn't. When using betterproto it was necessary to pre-construct nested messages; this can't be done in protobuf. The interfaces for serializing are different.
Checklist
changie new
to create a changelog entry