-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[drcachesim] trim trace to avoid staggered attach and detach partial-thread anomalies at ends of trace #2039
Comments
This problem occurs on attach as well: the staggered attach means that in an app with thousands of threads the threads taken over early start producing trace data well before the rest of the threads are taken over, resulting in an inaccurate initial picture of the workload. An alternative to discarding after the fact based on timestamps is to use drbbdup modes to have a non-tracing alternative mode. Only once attach has taken over all threads would we flip a switch and have the tracing mode start, simultaneously in all threads. We would use a load-acquire read of the mode flag which is fairly performant. While this is more overhead than fixing up later, it fits well with schemes that pre-populate the code cache in a lower-overhead no-tracing mode before starting tracing. There are two ways to trigger the mode switches for statically linked drmemtrace: new DR events for post-attach and pre-detach, or annotations. Annotations are more flexible as when pre-populating the tool can separate attach from tracing. We would want to implement annotations for aarchxx #1672, and solve the raw2trace issues with annotations #4141. |
Adds two new DR events: post-attach and pre-detach. These help tools align instrumentation with points where all threads are under DR control, avoiding uneven thread execution during the incremental staggered attach and detach processes. Adds some sanity tests that the events are called where they should be. Adding them to drmemtrace will be done separately. Issue: #2039
Adds two new DR events: post-attach and pre-detach. These help tools align instrumentation with points where all threads are under DR control, avoiding uneven thread execution during the incremental staggered attach and detach processes. Adds some sanity tests that the events are called where they should be, for both internal and external attach. Adding them to drmemtrace will be done separately. Issue: #2039
Adds support for multiple app copies in rseq mangling finalization's search for labels to fill in code cache PC's. Adds support for multiple rseq regions in one fragment_t by expanding the stored rseq_cs per fragment to become a list. Adds "-trace_after_instrs 5K" to the tool.drcacheoff.rseq test to exercise rseq with drbbdup. Augments the attach/detach api.rseq test to use drmemtrace, which with the forthcoming #2039 mode switches also hits this multi-rseq case. Fixes #5659
Adds support for multiple app copies in rseq mangling finalization's search for labels to fill in code cache PC's. Adds support for multiple rseq regions in one fragment_t by expanding the stored rseq_cs per fragment to become a list. Adds "-trace_after_instrs 5K" to the tool.drcacheoff.rseq test to exercise rseq with drbbdup. Augments the attach/detach api.rseq test to use drmemtrace, which with the forthcoming #2039 mode switches also hits this multi-rseq case. Fixes #5659
Fixes PR #5663 which fixed drreg restoration for drbbdup but broke it for non-drbbdup. bb->instr was NULL so the preinsert was an append, and the label as the last instruction moved the drreg restore point to after the native sequence! Here, we explicitly append to avoid confusion, and we move the native sequence to the point of the barrier instead of at the final app instr, which is where the registers are actually native. Tested on larger apps using rseq which reliably crashed without this fix. This fix also seems to fully fix drbbdup usage with drmemtrace for #2039. Issue: #5658, #2039
Fixes PR #5663 which fixed drreg restoration for drbbdup but broke it for non-drbbdup. bb->instr was NULL so the preinsert was an append, and the label as the last instruction moved the drreg restore point to after the native sequence! Here, we explicitly append to avoid confusion, and we move the native sequence to the point of the barrier instead of at the final app instr, which is where the registers are actually native. Tested on larger apps using rseq which reliably crashed without this fix. This fix also seems to fully fix drbbdup usage with drmemtrace for #2039. Issue: #5658, #2039
Adds drmemtrace_replace_file_ex_ops() with an expanded file opening function which takes in the thread id and window id, to better support the external file opener with delayed opens due to nop mode and windows. Adds a test of the function to burst_replaceall. Adds additional threads to the test and checks that the tid for each thread was seen. Issue: #2039
Adds drmemtrace_replace_file_ex_ops() with an expanded file opening function which takes in the thread id and window id, to better support the external file opener with delayed opens due to nop mode and windows. The new function takes in a struct to make it easier to extend further in the future. Adds a test of the function to burst_replaceall. Adds additional threads to the test and checks that the tid for each thread was seen, but disables that part on Windows due to #2040. Issue: #2039
Adds a new drbbdup mode to drmemtrace which performs zero instrumentation (except drwrap cleanup). This mode is used when attaching for the period prior to full control of all threads and when detaching prior to starting to let threads go native, to avoid uneven thread instrumentation during these incremental staggered processes. The mode is initially under an off-by-default new option -align_endpoints while drbbdup stability is being worked on (i#5686). Threads that did nothing during the trace-mode period are now omitted from the trace. Adds a test by adding 4 idle threads to burst_threads and ensuring they do not show up in the trace. Alignment itself was tested manually by running larger applications and analyzing the timestamp ranges in the trace. Fixes the burst_replace test to work properly with .zip output (the output regex still matched despite a printed error from raw2trace, it seems). Timestamps during detach require further work as they inaccurately imply executing after tracing mode was turned off. The next part will address this issue. Issue: #2039, #5686
Adds a new drbbdup mode to drmemtrace which performs zero instrumentation (except drwrap cleanup). This mode is used when attaching for the period prior to full control of all threads and when detaching prior to starting to let threads go native, to avoid uneven thread instrumentation during these incremental staggered processes. The mode is initially under an off-by-default new option -align_endpoints while drbbdup stability is being worked on (i#5686). Threads that did nothing during the trace-mode period are now omitted from the trace. Adds a test by adding 4 idle threads to burst_threads and ensuring they do not show up in the trace. Alignment itself was tested manually by running larger applications and analyzing the timestamp ranges in the trace. Fixes the burst_replace test to work properly with .zip output (the output regex still matched despite a printed error from raw2trace, it seems). Timestamps during detach require further work as they inaccurately imply executing after tracing mode was turned off. The next part will address this issue. Issue: #2039, #5686
Moves the timestamp+cpu headers to be added at buffer creation time instead of at buffer output time. Makes instru_t::frozen_timestamp_ use release-acquire semantics and sets it for -align_endpoints at detach time. Adds a 'minimum timestamp' recorded at post-attach time. At buffer output time, we examine the buffer's timestamp and if it was from before the minimum, we replace it with the current (not the minimum) timestamp. Tested on a large app where these changes result in much tighter timestamps matching the burst period. Issue: #2039
From #5704 (comment): we should add an invariant check that timestamps always increase monotonically through each trace. |
Moves the timestamp+cpu headers to be added at buffer creation time instead of at buffer output time. Makes instru_t::frozen_timestamp_ use release-acquire semantics and sets it for -align_endpoints at detach time. Adds a 'minimum timestamp' recorded at post-attach time. At buffer output time, we examine the buffer's timestamp and if it was from before the minimum, we replace it with the minimum. Tested on a large app where these changes result in much tighter timestamps matching the burst period. This also directly fixes #5537 which pointed out that the timestamps being at buffer output time is contradicted by the documentation. Issue: #2039, #5537 Fixes #5537
Part 4 added a minimum timestamp, but it only works for data that is not actually emitted until we are fully attached. For data emitted before that point, we'll use a too-early timestamp. We solve that here by dropping such data. Tested on a large app where these changes solve gaps at the start of the trace due to an early timestamp in one thread. Issue: #2039
Since the stability issues with drbbdup on larger apps seem limited to x86, we enable -align_endpoints by default for AArchXX. Issue: #2039
Part 4 added a minimum timestamp, but it only works for data that is not actually emitted until we are fully attached. For data emitted before that point, we'll use a too-early timestamp. We solve that here by dropping such data. It is not easy to make an automated test unfortunately. Issue: #2039
Since the stability issues with drbbdup on larger apps seem limited to x86, we enable -align_endpoints by default for AArchXX. This revealed a drbbdup issue on AArch64: OP_isb must end a bb. Added this to drbbdup's list of special opcodes, and to the bb event docs where it was missing. This fixes drcacheoff.gencode test failures. Adds more synch to the burst_replaceall test to ensure all threads are scheduled to avoid empty threads which mess up the expected output with -align_endpoints. Relaxes the burst_threadfilter template to allow empty unscheduled threads. Relaxes the burst_replace template to allow lazy file opening. Issue: #2039
Fixes an issue with refreshing the timestamp when filtering is on where the first (extended) header flag was cleared on a filter-discarded buffer. Tested by running "ctest -V --repeat-until-fail 100 -R burst_threadL0filter". Previously it failed on the 3rd test; now all 100 pass. Issue: #2039
Fixes an issue with refreshing the timestamp when filtering is on where the first (extended) header flag was cleared on a filter-discarded buffer. Tested by running "ctest -V --repeat-until-fail 100 -R burst_threadL0filter". Previously it failed on the 3rd test; now all 100 pass. Fixes #5774 Issue: #2039, #5774
With the stability issues with drbbdup seemingly solved, we enable -align_endpoints by default for x86 (it was already on for AArchXX). Fixes #2039
With the stability issues with drbbdup seemingly solved, we enable -align_endpoints by default for x86 (it was already on for AArchXX). Fixes #2039
With the stability issues with drbbdup seemingly solved, we enable -align_endpoints by default for x86 (it was already on for AArchXX). Fixes #2039
Xref #1729
Due to the asynchronous nature of stopping all the threads on detach, some
may run further and produce more data than we'd like for a bursty trace. I
already added a new option -max_trace_size to put a cap on the per-thread
trace size, but we probably also want a way to precisely throw out data
past the detach timestamp.
On the burst-threads test, without the -max_trace_size cap, the secondary
threads produce surprisingly large traces: 65MB, when the start/stop thread
only ran 4 iters of its loop. It may be worth investigating further to
ensure nothing is going wrong there: is it really just a few hundred more
iters that adds up to 65MB??
The text was updated successfully, but these errors were encountered: