-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instrumentation overhaul, fixes #91 #92
Conversation
@antiguru is likely to use this code for scaling-related mechanisms, so we went back and tried to clean it up; it seems to make sense+work, with a small amount of overhead (but I have to re-measure). For the scaling, we're thinking of replaying some of the logs in a dataflow in the same computation (as you [@frankmcsherry] proposed some time ago). |
let mut buffer = self.buffer.borrow_mut(); | ||
unsafe { ::abomonation::encode(&event, &mut *buffer).unwrap(); } | ||
self.stream.borrow_mut().write_all(&buffer[..]).unwrap(); | ||
buffer.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what's going on here? It looks like you have a second buffer into which you stage the serialized data, and then you copy it all out again. Is something else going on with the buffer that I'm missing? (( Also, the RefCell
around the writer )). Perhaps this makes sense given constraints somewhere else!
logging/src/lib.rs
Outdated
}) | ||
}; | ||
(time::precise_time_ns() as i64 - delta) as u64 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might this be a good time to sort out what notion of time we want to use in these logging channels? Way back when, I floated an alternative that we use a "timely-local" time, roughly "nanoseconds since initial synchronization" rather than "local system clock", which has some good things and some bad things.
-
Good: it meant that clock skew and such were less problematic (if two computers clocks are off by a minute, the existing approach will stall a bunch of logs, right?).
-
Bad: it is harder to correlate events in the logging stream with other events that occur in the system, outside of timely (e.g. spikes measured through other tools).
I think the bad thing could in principle be fixed by capturing the machine local time at which the synchronization happens and disseminating this to everyone (playing the role of the PRECISE_TIME_NS_DELTA
static). It would mean we need to write down the captured local time, to support correlation with other events that only have local time, but that seems doable.
Also we can ditch the time
crate, which is only included in timely for this, I think.
logging/src/lib.rs
Outdated
/* 8 */ GuardedProgress(GuardedProgressEvent), | ||
/* 9 */ CommChannels(CommChannelsEvent), | ||
/* 10 */ Input(InputEvent), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've talked a bit about this before, and I'm still not in love with it yet. :) If we treat all of the logging stuff as unstable, then no worries because we can try things out, but I think:
-
We may want separate streams, so that individual streams can be selectively (and programmatically) enabled for capture. E.g. right now, I could imagine wanting to turn off the
Schedule
events because they are noisy af. -
We may also want to support user-defined logging, at least playing out the fantasies I had with ETW back at MSR. :) If a user has higher-level begin-end events that they want to track, we probably wouldn't want to require a monolithic enum that they extend (perhaps we don't, and this enum gets locked down; no worries then).
Anyhow, just writing these down rather than putting them up as a roadblock!
logging/src/lib.rs
Outdated
} | ||
|
||
pub fn log(&self, l: L) { | ||
match self.internal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a particular reason to have the RefCell
inside the BufferingLogger
and to use &self
, versus having the Logger
type be Rc<RefCell<BufferingLogger>>
and have this take a &mut self
? Not judging, just trying to grok whether we are intentionally hiding the potential sharing and creating a risk (e.g. a logger that logs could panic, right?), or if it would work as well with &mut
and the RefCell
lifted up a level.
logging/src/lib.rs
Outdated
buf.push((ts, setup.clone(), l)); | ||
if buf.len() >= BUFFERING_LOGGER_CAPACITY { | ||
(*pushers.borrow_mut())(LoggerBatch::Logs(&buf)); | ||
buf.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about the Push
and Pull
idioms from the communication crate? Maybe this would just be Push
, actually, but... Rather than a function that takes a &Vec<_>
which is then cleared afterwards, you use a function that takes an Option<&mut Vec<_>>
, which allows the recipient to swap the backing memory if they like (e.g. if they just want to stash the Vec
in a linked list or something), or just drain
out of it, as appropriate.
I'm trying to get my head around the code at the moment, sorry for the delay. It may be better to get an in-person run-down from you folks. I'm reading, but not entirely clear how all the parts fit together at the moment (e.g. what the environment variables are for, can/should they be abstracted out as a configuration, who actually uses which types, etc). We can either chat about these things on gitter, or just chill for the year. Not sure who is around ETHZ and is active; I wouldn't want any of you to have to work when you should be drinking glühwein. |
I mentioned this to @utaal on gitter, but one high-level thought I have about the design is: It seems like there is a bunch of functionality parallel to what timely dataflow itself does. That is, all the log data get captured and buffered up and pushed at various event writers that may or may not fire them at a TCP stream (looking at Would it not be simpler and perhaps more tasteful if each of these logs just went in to an I think the scary issue here was originally "how to prevent logging from logging itself?", but I think we could finesse this pretty easily if we wanted (e.g. a Do you all (@utaal, @antiguru) have thoughts on whether and why we might prefer hand-rolling these parts? I could imagine some good reasons, and you may have told me before and I've forgotten, but can we work through that again, then? :) |
I like the idea of using as much as possible from the timely communication library to use in the logging infrastructure - the goal here is not to come up with a second implementation of something already provided. On your second comment: My goal is to use Timely Dataflow to observe a computation, feed that data into a policy and use the output of the policy to decide how to change a computation. This means it would be desirable to have the logging stream as another On the side, it seems the current PR has sending logged events on the critical path, which we probably don't want to have ( |
Re: the critical path, this seems legit, and is also currently true of Edit: thinking a bit more: the |
// None when the logging stream is closed | ||
frontier: Option<Product<RootTimestamp, u64>>, | ||
event_pusher: P, | ||
_s: ::std::marker::PhantomData<S>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: ::std::marker::PhantomData<(S, E)>
845940a
to
1e746d4
Compare
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Moritz Hoffmann <[email protected]>
Signed-off-by: Moritz Hoffmann <[email protected]>
Signed-off-by: Moritz Hoffmann <[email protected]>
Signed-off-by: Moritz Hoffmann <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Signed-off-by: Andrea Lattuada <[email protected]>
Latest performance number of master
this branch, logging disabled
+8.4% |
From my side, this is ready to review/merge. |
Cool, thank you! I think we should land this and then keep working, but I had two quick questions first:
|
Signed-off-by: Andrea Lattuada <[email protected]>
|
Looks good to me. There are a bunch of things still to do but you have an issue open for them. I may dive in and hack on some bits here and there, but I think we need a bit of experience trying to use it before we make too many binding decisions. |
No description provided.