-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
survey of logging frameworks #1318
Comments
FoolscapFoolscap (home page and github project) is a remote messaging system I built in 2006, inspired by CapTP/VatTP and ocap principles. It is written in Python, which is a bit too permissive to really support ocap confinement, but of course I found that following the rules anyways led to better partitioning of functionality and code that was easier to read. The networking stack was based on Twisted. The primary customer was the Tahoe-LAFS distributed+encrypted storage system (which I co-authored in 2008), although it was overkill and Tahoe is now moving to a simpler system. In Foolscap, every remotely-accessible object had a "FURL", the Foolscap URL (SturdyRef-as-data), which was shaped like a web-key: pb:// def gotRef(remoteReference):
remoteReference.callRemote("hello", arg1, arg2).addCallback(handleResponse)
tub.getReference(FURL).addCallback(gotRef) or they could use a reconnect-on-failure Foolscap also includes a logging system with a detailed philosophy document. I wrote this to support several aspects of Tahoe:
The Logging APIFoolscap's core logging API was a simple function The Each message had a "facility" (topic) and "severity" (an ordered scale of importance, ranging from NOISY and OPERATIONAL, up to UNUSUAL, INFREQUENT, and CURIOUS, and topping out with WEIRD, SCARY, and BAD). Each facility+severity tuple got its own circular buffer of messages. Foolscap retained a few hundred messages in each such buffer. Each severity level has an associated recommended event rate in the docs: NOISY was expected to be used for things that emit tens or hundreds of messages per second, while BAD might only happen once per process (because it records a fatal error). If, at any time, a message was logged at The Tahoe client had a web-based UI. We added a "Report Incident" button to that UI, which simply took a string from the user and logged a WEIRD-level event. Our "how to report a problem" user docs pointed folks at this button, and instructions to locate the incident tarball and email it to us. This helped a lot. We did take some care (although probably not enough) to avoid logging user secrets, with some wrapper classes that knew how to censor their contents when passed to the logger. Each log event was given a global numerical index (total ordering across the process). This event1 = log.msg("starting upload")
log.msg("calculating filesize", parent=event1) A common pattern was to curry the parent into a (Incidentally, returning an identifier from We did not have a full Causeway-style causality graph: each log event had at most one parent. We didn't encounter a strong need for it in Tahoe. It wouldn't have been hard to add, however building a viewer for it would have taken more work. When the logging system started, it logged a special event that included the application version, PID, starting timestamp, and other machine-specific data. Each instance of Foolscap also had an "Instance ID" to help distinguish it from other processes (e.g. to resolve ties on the "who is the leader" network-level negotiation), and to detect multiple connections to the same peer (so we could shut down all but one). This data was used to scope the lognum, so aggregates of events from multiple processes could be teased apart again. It took a year before a busy server ran long enough for us to discover that our lognum was limited to a 32-bit signed integer (the We established a convention of adding a short random string as a log.msg("negotiationFailed: DuplicateConnection", level=NOISY, umid="XRFlRA") Remote AccessA single line of code would enable the "logport": a pub/sub object exposed by the logging system. When a Tahoe client or server was running in a base directory The client could also subscribe to Incidents, or retrieve previously stored Incidents. We built a "log gatherer" service, and configured our fleet of storage servers to I grew really fond of being able to "plug in" to the logport in a running process and find out what it had been doing for the last few minutes. The logport FURL was persistent (the process read an existing file before generating a new random swissnum), so once I'd cut-and-pasted the Managing the Attention Budget, Retroactive InterestingnessFoolscap's logging system derives from my experience with embedded systems, where crashes are infrequent and hard to debug. I once worked at a company whose product (a router) crashed in a datacenter in Europe. We asked the customer to install some debug firmware which had more logging enabled, in the hopes of learning enough to diagnose the problem. They were unhappy to hear us suggest that we wanted to make it crash a second time, whatever the reason. As the machine was on a private network in this datacenter, we couldn't log into it directly. We had one engineer on-site reading hexdumps over the phone to a team in the US (and that team was sleeping in a conference room during the day so they could be awake on EU hours). I vowed, never again. Debuggers are great, if you can use them. But sometimes you can't easily reproduce the problem, or interrupt a system long enough to step through it. In those situations, you want to record enough information to figure things out afterwards. When we're deep in the debugging process, and we have identified a number of interesting points in the code, where knowing the state of a variable (or merely whether it made it that far or not) is vital information, we have built up a lot of mental context about the system. We've done a lot of careful thinking, and the output is a set of maximally-useful But that's discarding a lot of work. It would be better to suppress the output, but retain the location and arguments of those calls. There are a variety of costs which leads us to remove or disable
and the amount of detail we retain influences the cost. Recording a full stack trace of an unexpected exception is helpful, especially in a utility function that might be called by dozens of unrelated callers. But many invariants happen in functions which are only ever used for one purpose, so knowing who called it doesn't add very much information. Recording the arguments of each stack frame might be far more valuable, but gets expensive quickly. Foolscap's Logging systems like Log4J typically have control knobs to set generation/recording thresholds, set through an environment variable or some other API. This allows no-longer-helpful log statements to be retained in the code (saving the programmer effort that was spent in deciding where to put them), and enables them to be re-enabled with a relaunch (without a recompile, and without editing the source to uncomment the log calls), but reduce the costs of human attention, disk space, and some portion of the CPU performance (there is still some cost to deciding to not log an event). But if something goes wrong, you still have to re-run the process (this time with logging enabled), and find a way to reproduce the failure. It seems useful to having multiple thresholds here. If the main cost is human attention, then simply hiding the messages is sufficient to avoid it. If the messages are recorded in a place where they can be easily retrieved later, then we don't need to be afraid of suppressing their display. If disk is the expensive thing, then storing messages in RAM for a few minutes, and only copying them to disk if something unusual happens afterwards (making them retroactively interesting) can reduce that cost while not entirely losing the data. During early development of a feature, we would record (and maybe display) everything. After deployment, we want to display nothing (leaving the console available for actual UI) and be conscientious in how much we store. When we're debugging a problem, we have a hunch that some otherwise-boring variable in fact is critically important. In normal operation, we wouldn't bother recording its value. The interestingness of that variable depends upon what happens later. Foolscap's circular buffers formed a nice compromise of storage vs utility. I added intermediate severity levels (CURIOUS, WEIRD) to capture events that maybe didn't indicate an unrecoverable problem, but if something did go off the rails shortly afterwards, they might provide a significant clue as to the cause. In Tahoe we used these to report integrity-hash violations, which usually indicated a malicious/faulty storage server that was feeding us corrupted data. We reacted by looking for a better share elsewhere, but if all shares were corrupted, that indicated a problem in our decoding/decryption logic. By recording the earlier suspicious events, we had more to work on if/when the full download failed. |
Google's loggingGoogle has an incredibly sphisticated and complex logging system that pervades everything, and whose output is used for many purposes beyond logging. I'm sure I never used more than 1% of the functionality, but I was exposed to some parts because the Transparency Report used parts of its output for reporting on internet traffic. The important bit of background is that the system developed the way it did because of an early decision to handle billing for ads based on the data in the log files. This led to the logging system turning into a production-critical system with requirements based on marketing, legal, and every product line in addition to debugging needs. The basic format is that log messages are a protobuf with probably on the order of a hundred fields, which can't be modified lightly because many different products and tools rely on it. There are several generations of Google's database and analysis tools designed around it because all the corporate data is there. There are processes and controls in place to limit and track who has access to which fields, to redact certain info for some uses, and to sunset sensitive data while retaining data that might be needed for the long term. Just requesting new access to existing data puts your project at the back of a queue which might be 6 months behind. Even though the data is known by all parties to be business critical, teams do still make changes to the data they record, and they have no way to find all the tools that might be dependent on them. Transparency Report lost data a few times while I was on the team because someone announced an upcoming change on a list we didn't have access to. Two lessons I take away are:
Log4JJava's log4j has served as the template for several later logging systems in Java and other languages. There are some features I want to highlight. At a high level that not everyone notices, log4j defines a polymorphic API for producers and consumers of data. This makes it possible to have config files that specify that different levels of logging from different sources are sent to persistent logs, circular buffers, network devices, or other configurable outputs defined later. The normal usage of log4j is that in each class, a distinct logger would be created that adds an identifier for the class to the log record. Since java classes and packages have a hierarchical name space this made it simple when configuring log destinations or grepping throgh logs to restrict your attention to a particular class or package, but less easy to focus on a particular task that crossed multiple packages. I don't know of a design that makes it straightforward to identify a group of log message sources that one wants to enable, other than by containing file. The general approach in log4j is that you declare a logger within a class file, and logging there is done by individual log requests that specify the level, in a way that's now familiar in javascript. The general logging interface doesn't specify that the data is text descriptions, so in Zocalo, I was able to use it to log structured data about different kinds of events. This differed from the Google logging described earlier, because each record was designed for a particular usage, and was intended to be handled by tools that knew precisely what data to expect. |
This discussion is Javascript-specific. I don't want to invent a whole new system, but rather take the best-practices and tailor them to our needs. "Focus"When I'm developing something, whether it is a new Zoe contract or a Golang feature, I want to see all the messages corresponding to a specific context. I like the convention I've seen in other JS projects, of setting a
With the idea that Documenting which tags are in use is important, and gives people a simple way with minimal tooling to dump the logs that they know they need, or that we recommend to them. Collecting logsI like loosely-coupled systems like FluentD for aggregating and shipping the log output of a program to different backends. The bottom line with these kinds of tools is that a structured format (like JSON) is used for each message, and can thus be dispatched according to its contents. slog(...) vs. console.*IMO, it is important to distinguish between logs that need analysis, and simple messages that are clues to normal users. The latter is a subset of the former. The JS
All of those Besides the These sensibilities are reflected in the simple implementation I have for a contextual Spans@kriskowal probably has more to say about this, but like @warner's And now for something completely different...Take a look at Rookout if you want to see something cool, but I'm not sure how to implement it securely. They use your source code, and implement "non-breaking breakpoints" that give you information on execution without interrupting or restarting the program. This reminds me of GDB's "noninvasive mode" that was useful during Hurd development (i.e. attach to a program and browse around without interrupting execution). My takeawaysAlways define a context-dependent Enable structured logging with a command-line flag, and punt to other systems for filtering, aggregation, and analysis. The different systems to which people want to send their logs is much more diverse than the ones you can anticipate. Support distributed tracing/spans. I don't have experience with this, but I think it's useful. |
bump I just updated my presentation above. Now it's the way I wanted it. |
structured logging in a clinical data warehouse ETL processI lead development of an ETL process where each month, we take a hospital's worth of patient data and put it into a star schema for ad-hoc query. The process ran for ~40 hours back in 2011 or so; now it's up to a couple weeks. structured_logging.py is what I came up with.
Our ETL process overall is showing its age... I gather normal best practice in this area is a big pile of stored procedures... but when I think of switching to something that lacks this sort of log, I get cold feet. GROUSE ETL (70B facts) and eliotThe GROUSE ETL process deals with an order of magnitude more data than the above process. I had it to do over again, and I came up with eventlog.py. JSON rather than CSV this time. Man, jq is cool! I also made the output compatible with eliot; I'm not sure how much of a win that was. how not to use log4jThe ad-hoc query system that we feed it into is i2b2. It uses log4j, but in a decade of use, nobody in our shop has managed to find the right level of logging. In normal production mode, it doesn't log each SQL statement. We found a DEBUG log level that does log every SQL statement, but when we enable it, it produces so much output that it overwhelms the enterprise log aggregators. |
I don't know enough about logging systems to have anything to contribute, but I'd love to learn more. :) |
As @michaelfig mentioned, I had some experience with span logging/tracing at Uber. Our transport protocol (TChannel) provided affordances for distributed context propagation and tracing at the protocol layer. The design was based on Google’s Dapper paper. The working groups for the API was called OpenTracing. The system for collecting and querying spans is called Jaeger. I believe OpenTelemetry is the next iteration on the design. Mistakes were made. I’ll describe an idealized version of what we did. Distributed context propagation is the lowest layer of the system. This is a set of conventions for copying a namespace of inbound request headers into the in-memory context token, allowing those contexts to branch in memory, allow functions to read them from context or append entries to child contexts, and then copying them to outbound request headers. So in a call graph with three “services” A calls B calls C, A can send a header to C without modification to B. This namespace is supposed to be carefully centrally managed because of its potential to magnify performance problems. The context can then carry trace headers, which are large random numbers for coördination-free collision-avoidance. These are the "trace", "span", and "parent". The scope of a span covers the entire duration of an operation on a given host, from request to response. Every request inherits the "trace" without modification from its context (or contrives one initially). Every request's "parent" is the "span" from its context. There are a lot of ways to do trace collection and they all have exotic ways of failing at scale. I don’t have a great deal of insight into how that team kept the machine running. It was a combination of push and pull. We used a combination of ELK (ElasticSearch) for application logs and Jaeger. We didn’t use Jaeger for logging as often as we could have because the ELK infrastructure was omnipresent. We did enable application logs to cross-reference Jaeger by adding spans to logs. Regarding sampling, trace headers also carried a “the user is looking” bit that prevented a trace from being downsampled in the aggregator. This is pretty important to ensure that the system works for both the “we want latency statistics and automated root cause analysis” crowds and the “we want to debug distributed systems” crowds. I think it would be really darn neat if the trace ID were treated as a sample priority figure. That way you could record a weighted running average of the lowest priority trace that got sampled and use that to extrapolate around downsampled logs and you could use the same figure coördinate load-shedding. A problem with load shedding in general is that if every call to service A generates 10 calls to service B (very common aggregation pattern), service B will shed load randomly so the probability of requests to A succeeding diminishes exponentially as load approaches a bottleneck on B. Adding a priority queue and shedding load based on the trace ID would mitigate that pathology. With that in mind, I think that trace aggregation should use a pull model. The application agent would gather traces and log entries in memory, dropping traces with lower trace numbers to make room for higher ones. For logging in general, Akshay Shah on my team wrote Zap, which has become wildly popular. The key insight is that the logger needs to strike a balance between performance and developer experience. Log sites need to be nearly free when their level or topic isn’t being observed. They need to be cheap even when they are relevant. In Go, we were able to avoid allocation entirely on the hottest paths of the networking stack because the logger didn’t allocate intermediate objects and just wrote JSON from memory. We saw a lot of cascading failures. One common mode was “system that is fast on success and slow on failure (more logging, errors larger than responses, and retries without sufficient or coördinated load shedding) is a recipe for falling hard. Dapper https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdf Second to Michael’s DEBUG topics ⨉ levels. That’s strangely more common in JS than Go. |
Causewaylogging distributed communicating event loops, for post-mortem distributed debugging. First, I'll gather all the relevant links here before I try to explain anything. Causeway referencesCauseway has been used from four language:
Original Causeway for E
The stack format in our tc39 error-stacks proposal is derived from the stacktrace portion of the causeway log format. Causeway for JSPorted, and with awesome new visualizations of essentially the same log format. Click around on the interactive demos, which still work: |
Magic Wormhole timing loggingThe discussion about "spans" reminded me of a scheme I added to investigate protocol timing in magic-wormhole. I wanted to know when various messages were sent and received, and how long we spent waiting for compute-bound tasks (the cryptographic PAKE protocol), network-bound tasks (round trip times), and user-bound tasks (waiting for code entry, or for permission to proceed). Magic-wormhole involves two user clients and a central mailbox server. The two sides are not necessarily connected at the same time, and all connections will be reestablished if/when they are lost. The protocol requires about a dozen messages overall. The API consists of a t = timing.add("encryption begins")
ciphertext = encrypt(plaintext)
t.detail(size=len(ciphertext))
t.finish() The Event records a The Event also knows how to implement Python's "context manager" protocol, which was first added to try:
f = open(filename)
data = f.read()
finally:
f.close() with: with open(filename) as f:
data = f.read() For with timing.add("encryption begins") as t:
ciphertext = encrypt(plaintext)
t.detail(size=len(ciphertext)) The Timing object is always created, and all protocol steps add data to it, however the data is discarded unless the user added Event ViewerThe UI isn't great, because my web skills are pretty poor, but it was enough to let me see where we were spending time. The fully zoomed-out view looks like: Time starts on the top and progresses downwards, with '0' being the first event on the sender's side. The left column is for UI events on the sender (including how long it takes to import certain large libraries, and perform expensive cryptographic operations). The middle-left has a "sender tx" column (for when the sending client transmits a message to the mailbox server) and "sender rx" (when the sending client receives a message from the mailbox server). The middle-right has matching columns for the receiving client, and the far right has receiving client UI operations. The middle lanes represent when a message is received by the server, and when it emits the response. The event viewer cycles through the six available lanes for each successive message, to keep the arrows from overlapping too badly. The colors are rotated too. Each time a client sends a message to the server, the server records the arrival time, then immediately emits an ACK that includes both the message and the arrival time. This lets a client know how long it took for their message to arrive. If I zoom into the first few messages transmitted by the sender: we can see that the server sent The "pake" message (blue arrow at the bottom) is the first that is broadcast to both clients (all previous messages were consumed by the server, but the server's real job is to facilitate the exchange of messages between the two clients). The server receives this first "pake" message about 20ms after it is sent, and immediately broadcasts it to all connected clients (which, so far, is only the sender: the receiver hasn't even launched their application yet), which takes another 20ms. The receiving client won't get "pake" for another 4.5s. The receiving client is launched about 1.5s after the sender, and spends the next 5 seconds waiting for the user to finish typing in the code. The user hit "tab" to complete the nameplate portion of the code twice about two seconds in: both times cause a "list" message to be sent to the server, and got a "nameplates" response back. About four seconds in the user finished with the nameplate portion, which allows the client to send "claim" and "release" while the user continues to type in the secret second half of the code. The sender's "pake" message is received a moment after "claim" is processed (at this point the server has both subscribers and will forward messages immediately instead of queueing them for later). When the user finishes typing in the code, the key exchange can complete (labelled "pake2 (crypto)"). This was completed at t=6.5s, at which point the two sides started to communicate in real-time: The receiving client (on the right) sends its own "pake" message (orange), followed by an encrypted "version" message. The receiving client appears to need 50ms to get a message to the server, while the server's response returns in perhaps 5ms. This strongly suggests clock drift between the different machines, as the two directions ought to be symmetric (barring one-way congestion). This version of the protocol doesn't try to measure or correct for the clock skew, and instead relies upon good NTP sync. Once the sending client receives the "pake" message, it knows the mailbox is now open, and no longer needs the nameplate any more, so it sends a "release" to the server (which provokes a "released" response 40ms later), and begins the second phase the PAKE computation, which takes about 15ms. Once that is complete, it can send the encrypted "version" message, and then finally the first encrypted channel message (labelled "0", in magenta), followed by another ("1", in brown), which offer the encrypted file to the receiver. The receiver then spends about 600ms waiting for permission from the user: Note that message "0" was transmitted by the receiving client at t=6.75s, but was not received by the server until t=7.37s. This indicates that the client began a blocking read of stdin (waiting for the user's approval) before allowing the transmit function to complete, losing some parallelism. In this case, there was nothing to lose, because the sender couldn't start transmitting until it received permission (contained in message "1", orange), but this chart shows clearly that something wasn't working as expected. The actual file transfer starts with the "transit-connected" marker, and ends with "exit". The offset between the sender's and receiver's notions of when transit connected is further evidence of clock skew. It appears that the receiver's clock runs about 80ms fast. Fixed diagramI fixed my clock skew problem, and now this diagram shows the expected symmetry between send-to-server and receive-from-server delays: Note that the orange "pake" message and the green "version" message are sent from the server about 2ms apart, and arrive on the receiving client about 2ms apart, but on the sending client they arrive about 10ms apart. What's happening is that the "version" message arrives in the kernel at the expected time, but the (single-threaded) process is busy with the "pake2 (crypto)" step, and doesn't read from the socket until that is complete. Utility for the Agoric stackI don't think we'll be looking at millisecond-range timing charts, but the general UI of showing messages flowing between machines seems like a good jumping-off point for more sophisticated tools. If these arrows could be labelled with popups that describe the method name being delivered, and if the response message could be visually linked to the inbound message that caused it, this might be a good way to see where pipelining stalls are happening. The ideal picture has a thick bundle of parallel messages all travelling together, with a final ACK or response message coming back in the very next block. The unnecessarily-slow picture has a lot of individual back-and-forth arrows. |
Being able to see when pipelining does and does not happen may be the best way to cure overuse of |
A bizarre implication of replay-based time travel is one can insert or enable instrumentation in the code during replay that was not present or enabled during the original play --- as long as that instrumentation is completely effect free. |
Oh, neat. Now I'm imagining a lint checker that compares the source which was run on-chain against an instrumented version that a developer submits for local debugging replay. The linter could recognize logging statements and strip them out; if the resulting AST is the same as the one that created the transcript, the replay machine would accept it without complaint. And of course the replay machine will always just run the proposed version first and see if the transcript still matches; if so, any differences are still in the heap and haven't yet provoked externally-visible behavioral changes. |
CAL (PayPal's logging system)PayPal had (and probably still has) an extensive logging and analytics system called CAL (I forget what the name stands for, if I ever knew). It was awkward to use and ugly to the bone, but what it could do was actually pretty amazing. I ended up exploring a lot of its ins and outs early in my tenure at PayPal as it proved a useful tool for understanding what all the extended pieces of the PayPal system were and how they related to each other. PayPal's backend was highly (arguably excessively) service-ified -- in 2013 something like 1,100 service deployments, exposing around 12,000 distinct service endpoints. The overwhelming majority of these endpoints were internal: services whose clients were other services in the system and not themselves open to the Internet as a whole. Every external HTTP request or response, internal service call or response, and database SQL query were logged in CAL, as were any errors that resulted while servicing those calls. These log messages were generated automatically by PayPal's various inter-service messaging and database libraries, but the services themselves could additionally generate log messages of their own with whatever content the services' developers felt appropriate. All these services ran on ~40,000 machines, with each machine feeding its log stream into CAL's data warehouse, which collected it all and organized it for reflection back to the people in the organization. In addition to request parameters, time stamps, source addresses, and all the other usual stuff that logging systems typically log, each request originating from outside the system -- typically HTTP requests from a customer's web browser or mobile app, but also web applications used by PayPal customer service reps and administrators -- was tagged at the ingress point with a unique identifier called a "correlation ID" that was then propagated downstream in the metadata associated with each service request that was subsequently generated in processing that original inbound request. In addition, customer interactions (such as payments) that involved multiple choreographed interactions between the customer's browser and the PayPal backend were assigned a unique transaction ID that the browser tagged onto the message traffic in the same way. All the various request data was laundered by the logging libraries to remove any customer PII or confidential financial information, so, for example, you could see that a request that should have contained a credit card number actually did, but without being able to see what the credit card number itself was. CAL collected an enormous volume of data, which it presented via a web interface organized by date & time, service pool (i.e., the collection of server machines dedicated to some set of service deployments), or service entry point. Although you could pick any particular service call or transaction and follow its individual causal path through the system, the UI was not optimized for doing this -- you basically had to click to follow a call chain downward, but there was no graceful way to go back up. The main thing it did was a kind of what I call "Keynesian analytics": a macro scale aggregation that would show you, within a time window of your choosing, which other services a given service invoked, and how many times. However, what you got was not really a call tree -- if you saw that there were 5,000 calls to service A that resulted in 5,000 calls to service B and and 5,000 calls to service C, you could not conclude that each call to service A resulted in it calling service B and service C once each. The pattern could instead be that it called services A and B twice on some execution paths and not at all on others, or that maybe on one execution path it called service A twice and on the other it called service B twice. What you got was essentially the union of all call trees in one analytics report. It also couldn't distinguish whether a service that was called a given number of times due to traffic load or due to scheduling. For example, you might have two services each of which got invoked 300 times on date X, but service A was a batch job that ran every few minutes whereas service B was reacting to some kind of user input, so that on date Y service A would still have 300 invocations while service B might have 30,000 or none. This made resource planning tricky. Another limitation of CAL was that due to the high volume of data that was collected, data was only kept for a limited time window (a month if I recall correctly) because keeping it longer would be cost prohibitive. This meant that long term analysis could only be done on data-reduced summary records. One of the things I wanted to do was to try to tease out something in between individual service call traces and the kind of keynesian aggregates I described above, in the form of a histogram of call graph patterns. The data was there but not the engineering resources. Despite these limitations, with such a large volume of data you could see emergent patterns that might have been invisible if the data sets were less comprehensive. Some of these insights were pretty coarse, like the discovery that the rough number of SQL queries to process a payment was probably two or three orders of magnitude larger than it should be. Others were more subtle, such as a case where a service bug was causing very high rates of systematic errors that had gone unnoticed, because, due to robust retry machinery, the errors didn't show up as failed transactions and therefor got overlooked by the normal monitoring and alerting apparatus. You could get lost wallowing in CAL data. Brian Marick's
|
@FUDCo refers to the tracing package that Brian Marick did at Electric Communities. It is still operational in E-on-Java. It is explained at http://www.erights.org/elib/Tracing.html . This page is unattributed but I think it was written by Brian Marick. Frankly, though it existed in E-on-Java from before the beginning --- having been written first for Original-E --- I never made use of any more than a teeny tiny fraction of its mechanisms or API surface. Agorics used the ELib portion of E-on-Java at scale for the SunSea project, which was invisible to me. @dtribble @Chris-Hibbert , did you ever make use of this tracing system in your use of ELib? |
It's definitely the case that Brian Marick's tracing package was overengineered. When at State Software I wrote the first draft of what ultimately became Elko I incorporated a lot of stuff from ELib, including the trace package, but like @erights I only used a tiny portion of the tracing system's affordances. When I had to rewrite essentially all of Elko from scratch (except the open source pieces) at Yahoo! I stripped out all the machinery from the tracing package that I wasn't using, in the interest of having something simple enough to maintain, and never ended up wanting to put any of the stuff I took out back in. So it's possible that my concept of what the tracing package actually is at this point is that residual part that I actually used. |
I don't think I ever intereracted with this package. |
I like the circular buffer and the logport stuff from foolscap; @michaelfig is that a good fit for the opentelemetry API? |
Do you mean: can OpenTelemetry support alternative logging backends? They are trying to be provider-neutral, so if there isn't a logging provider that meets our requirements, it might (?) be possible to implement our own that does fancy stuff like that. However, OpenTelemetry doesn't yet have an API for logging. Supporting logging is one of their goals, just tracing came first, then metrics. |
FYI: @arirubinstein |
What is the Problem Being Solved?
Distributed systems make it notoriously difficult to get a picture of what's going on. While debugging one program over here, you'd like to know what's happening to that program over there, and what messages are being exchanged. You'd like to be able to run a debugger against this one component, without pausing the entire world. We add
console.log
messages to trace code we've written, delete it when we're done because the output is too noisy, and then wish we'd left it in place when a similar problem pops up a month later. We want structured logging to record more than mere strings (and use proper analysis tools rather than justgrep
). We sometimes want statistical logging, where the time spent in each operation is interesting but we're willing to decimate or average out data if it becomes too large. But much of the time we want complete logging, where every message is important.We've all worked on systems with these sorts of problems, and we've all built or adopted logging systems to address them. I'd like to collect our favorite ideas and tools here, with an eye towards building a Grand Agoric Logging System. Our platform is fairly unique in that we have deterministic execution of inbound messages. So if we log the right things, we should be able to e.g. refer to a message that was delivered on the chain last week, fetch the receiving vat's previous state from a number of chain explorer databases, check that state for accuracy against validator announcements, rebuild a local vat to the exact same JS heap, put it under the control of a debugger, and then re-deliver the message to see what happens. Like time-travelling debugging except better.
This ticket is to colect ideas and systems. Please attach comments with presentation-like writeups of:
The text was updated successfully, but these errors were encountered: