-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect warmup trace (using L0_filter) followed by a full instruction trace #5199
Comments
For an online trace, how are the two phases distinguished in analysis tools? Is a new instance of drcachesim spun up for the switch? Or are the trace entries just run together -- with some kind of marker in between like the offline file type marker?? The same question for an offline trace: though here it's easier to have completely separate output files. I assume that's the plan. |
Yes, for offline traces we will have separate files. I have been focusing on offline traces and hadn't thought about online trace. I think the marker idea sounds good -- that way the analyzer can have control on what it wants to do with data from the different phases. |
While switching instrumentation, I need to ensure that the regs are restored. Therefore, I tried to change This is the change I made
It looks like the generated code is bigger now due to the additional gpr restore code and the jnle instruction is using an 8-bit encoding and no longer able to address the displacement. Do we need to rewrite the instruction with a different jnle variant that can handle this displacement? Have we run into a similar problem before? |
No, there won't be any problems with application branches. It is a tool-inserted short jump. You can see it here:
Probably that is the tracer skipping the clean call unless the buffer is full. Xref #56 but that requires flags. |
Thanks! I have set short_reaches to false for now. Now I am running into cases where some GPRs appear to be clobbered. For example, the value in rdx here is clearly bad.
Looking back a few blocks, rdx was native here
and the next block doesn't touch it
It is used by instrumentation in the next block but is restored
Full log attached: log.0.55106.html.txt |
I haven't looked at the details but if there is internal instrumentation control flow be sure that all paths are parallel, typically by putting all spills before any branches and all restores after all branches. That is one potential source of problems. |
Ok I see now that rdx is not restored when this jrcxz branch is taken. So ideally it should be restored after the jrcxz+jmp and before it gets to jbe?
Without DR_CLEANCALL_READS_APP_CONTEXT, rcx and rdx are not restored before the clean call (which makes sense), but both are restored before the last instruction in the basic block. With DR_CLEANCALL_READS_APP_CONTEXT, rcx and rdx are restored before the clean call, but only rcx is restored before the last instruction which leaves the tool value in rdx causing a segfault in the next basic block where it is used. Is this due to incorrect handling in the tracer or in the instrumentation api?
Without DR_CLEANCALL_READS_APP_CONTEXT
|
In the regular case, All of this happens inside Regular case:
With DR_CLEANCALL_READS_APP_CONTEXT
|
Looks like I was missing DR_CLEANCALL_MULTIPATH |
I am seeing an assert in translate.c where it hits Any idea what this assert means?
|
Translation failure. If it is reproducible, I would run with logging and find the instruction that fails to translate. Is it DR mangling and a case is missing in the translation code? Is it tool instrumentation? |
It doesn't look like it is failing to translate a specific instruction. I think it is unable to find the mcontext->pc in the ilist. I have attached the log: translate-failure.txt |
I was clearing Clearing |
I now have some basic runs start going through. I can start sending some PRs if the approach below looks fine. It is not perfect but we could add this now and improve it gradually.
|
My main question here would be: should this feature use drbbdup for the two phases, rather than flushing? We are currently implementing #3995 using the drbbdup library to provide multiple instrumentation cases dispatched depending on the current phase: count instructions, or produce a trace. This changes the current delayed tracing feature to no longer use a flush and use drbbdup instead. Some of the problems are similar between the two:
How do other threads know a flush is in progress? Do they throw away the data and so there is a gap between the warmup and the full phase? We are currently discussing how best to handle this for #3995. Will update that issue with the current plan.
Not sure I understand this: I thought every thread's half-full buffer was discarded? Do you mean a thread might have never written its first buffer-full and so have empty output? |
Yes I had considered that initially (#3995 (comment)) and we had discussed it also (https://groups.google.com/g/dynamorio-users/c/fn6LBUxTd2o/m/98l0bK1BAQAJ). At the time we decided that flushing was okay for now since there was some work needed for bbdup. Sounds like it is much better shape now?
They check a flag when they are in the
In this case the thread was mostly idle, so the buffer had a header and discarding it completely would create an incorrectly formatted trace. |
PR #5393 added drbbdup to drmemtrace for delayed tracing; PR #5402 is adding alternating trace-notrace windows. Per #5393 (comment) the plan is to then refactor tracer.cpp to split into more manageable pieces, if you were interested in switching to a drbbdup-based scheme. |
Add a separate L0_filter mode to enable switching from warmup/L0_filter mode to the unfiltered mode. Issue: #5199
I made a change to introduce an 'L0_filter' mode so that we can switch from warmup trace to the full trace. I am only creating this if flag |
I will take a look at the code. The drbbdup modes are very convenient but note that there is still some missing stability work in getting drbbdup scaled up: there is one known stability issue #5686 for scatter/gather plus at least one more similar one not yet figured out. Has drbbdup worked for your workload targets or have you also seen stability issues? |
@prasun3 In your design, the final trace will be a combination of the warmup trace and full trace. Would you also want to split the two parts into two separate traces? Or would you continue to have a single combined trace? |
Added 'L0_warmup_refs' cmd line option and changes to address review feedback. Issue: #5199
Changed cmdline option from L0_warmup_refs to L0_filter_until_instrs plus minor related changes. Issue: #5199
@abhinav92003 In my first implementation I was flushing the code cache. I am redoing the code changes to use drbbdup. In the older implementation I am creating separate traces. The first thread to reach the warmup limit initiates the switch. The other threads drop any buffers that contain a mix of warmup and full trace. Due to this we have a gap in between the warmup and full trace. Since I am re-doing this with drbbdup, I am open to either keeping it combined or split it. The problem with keeping a single file is that raw2trace gets confused between the two types of entries. I saw something about a FILTER_ENDPOINT marker in #5742. I'd be interested in learning how to insert that marker precisely in all the thread buffers.
@derekbruening sorry forgot to reply to this. We haven't done much testing with this mode. |
The record_filter tool implemented in #5675 allows filtering records in a stored offline trace. It has a feature to perform cache filtering till a certain timestamp is reached in the trace, which could be useful to create a warm-up trace preceding the full trace. In #5738, I was extending the record_filter tool to also split the trace into two parts (warmup and full trace), but we decided to go a different direction where the trace reader figures out the end of the warmup region using the new FILTER_ENDPOINT marker inserted during filtering (this obviates the need to have actually separated trace files). Since all of this filtering is done after raw2trace, we avoid some complexities related to raw2trace getting confused. I see that your design does the filtering online. One benefit I can think of is that your design also reduces tracing overhead while collecting the warm up trace (due to writing out fewer trace entries to memory). |
Ok, so in your case you have one 'full trace' raw file which is post-processed to produced a trace.gz file. Then this trace.gz file is further filtered to produce a new trace.gz file containing the following:
Is this correct? Yes, by filtering online we want to reduce tracing overhead as well as disk usage which can be quite large at larger thread counts. When we change |
Documentation, help string and other minor updates. Issue: #5199
- use filter mode with align_endpoint also - don't increment bytes_written in filter mode - documentation update Issue: #5199
Removed a duplicate change. Disabling warmup tests (which are known to fail) to ensure nothing else is failing. Issue: #5199
Updated test regex to handle case where tracing windows occur after the "Hello, World!" message. Issue: #5199
Updated test regex to handle case where tracing windows occur after the "Hello, World!" message. Issue: #5199
- moved code that compares tracing_window to local window to a separate function - use this function to check for tracing_mode change also - made saw_filter_endpoint_ per thread - added tests for multi-threaded app - added an optional parameter increase the run duration of pthreads test Issue: #5199
Disabled warmup-pthreads on Windows since the pthreads exe is not built. Issue: #5199
Updated regex to make "Hit max global refs" message optional since we cannot guarantee the test won't end before that message is shown. Issue: #5199
Comments and other minor changes to address review comments. Issue: #5199
Fixed build error caused due to upstream namespace change. Issue: #5199
Removed saw_filter_endpoint_ flag Issue: #5199
Instead of discarding mixed buffers that contain both filtered and unfiltered records, look for the first unfiltered PC record and insert the marker there. Issue: #5199
- added comments clarifying the FILTER_ENDPOINT insertion code - added namespace to doxygen comments - changed write_trace_data to output_buffer - use TIDFMT Issue: #5199
Clarified why we might have filtered entries in the buffer during mode switch. Issue: #5199
Add a separate L0_filter mode to enable switching from warmup/L0_filter mode to the unfiltered mode. Added an option (-L0_filter_until_instrs) which specifies the number of instructions for warmup trace In warmup mode, we filter accesses through the -L0{D,I}_filter caches. If neither -L0D_filter nor -L0I_filter are specified then both are assumed to be true. The size of these can be specified using -L0{D,I}_size. The filter instructions come after the -trace_after_instrs count and before the instruction trace. This is intended to be used together with other trace options (e.g., -trace_for_instrs, -exit_after_tracing, -max_trace_size etc.) but with the difference that a filter trace is also collected. The filter trace and full trace are stored in a single file separated by a TRACE_MARKER_TYPE_FILTER_ENDPOINT marker. When used with windows (i.e., -retrace_every_instrs), each window contains a filter trace and a full trace. When we have windows contained in a single-file, the TRACE_MARKER_TYPE_WINDOW_ID markers indicate start of filtered records. The following items are not addressed in this PR - `-align_endpoints` test is not added - new OFFLINE_FILE_TYPE_BIMODAL type to indicate that the file has both filtered and unfiltered entries (#6164) - items not tested - handling windows with single-file all-in-one traces - online trace - ifilter only or dfilter only - use cases without `-trace_after_instrs`: tracing always starts in full trace mode, so the tests fail if we use this option; need to understand why this is happening Issue: #5199
…6368) For BIMODAL traces (see #5199), the normal transition is from COUNT to L0_FILTER to TRACE. But some threads may transition from COUNT to TRACE directly e.g., if they were not scheduled during L0_FILTER mode. In this case, the FILTER_ENDPOINT marker is not inserted currently, which leads to raw2trace failures. An example cmd line that reproduces the bug is shown below. Note that there is only one worker thread which transitions through L0_FILTER and TRACE mode while the main thread is still in COUNT mode. ``` bin64/drrun -t drcachesim -offline \ -trace_after_instrs 2200K -L0_filter_until_instrs 10K -max_trace_size 10K -- ./threadsig 1 1000 ``` Issue: #6367
Add a pointer to any prior users list discussion.
https://groups.google.com/g/dynamorio-users/c/fn6LBUxTd2o/m/aevZa4ZRAAAJ
Is your feature request related to a problem? Please describe.
It might be useful to have a "pre-trace" that can be used to warm up large caches before we start simulating instruction traces.
Describe the solution you'd like
We can have an "L0 warmup" phase just like how we have a fast forward phase (
-trace_after_instrs
) currently. During this phase, the L0_filter would be active. Once we reach the specified instruction count, we would switch to the regular trace instrumentation.An example run might look something like this:
drrun -t drcachesim -trace_after_instrs 1000M -L0_warmup_refs 1M --offline -- ../app
Do you have any implementation in mind for this feature?
This could be implemented with an approach similar to the 'trace_after_instrs' feature. We start off with the L0_filter instrumentation and then flush the code cache and enable regular trace instrumentation when the switching criteria is reached.
Describe alternatives you've considered
We could just collect a really long trace but that might have higher tracing overhead, higher disk usage etc. We could use DrBBDup (#4134) but there are some TODOs still open for this feature.
Additional context
The text was updated successfully, but these errors were encountered: