Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect warmup trace (using L0_filter) followed by a full instruction trace #5199

Closed
prasun3 opened this issue Nov 10, 2021 · 29 comments · Fixed by #5717
Closed

Collect warmup trace (using L0_filter) followed by a full instruction trace #5199

prasun3 opened this issue Nov 10, 2021 · 29 comments · Fixed by #5717

Comments

@prasun3
Copy link
Contributor

prasun3 commented Nov 10, 2021

Add a pointer to any prior users list discussion.
https://groups.google.com/g/dynamorio-users/c/fn6LBUxTd2o/m/aevZa4ZRAAAJ

Is your feature request related to a problem? Please describe.
It might be useful to have a "pre-trace" that can be used to warm up large caches before we start simulating instruction traces.

Describe the solution you'd like
We can have an "L0 warmup" phase just like how we have a fast forward phase (-trace_after_instrs) currently. During this phase, the L0_filter would be active. Once we reach the specified instruction count, we would switch to the regular trace instrumentation.

An example run might look something like this:
drrun -t drcachesim -trace_after_instrs 1000M -L0_warmup_refs 1M --offline -- ../app

Do you have any implementation in mind for this feature?
This could be implemented with an approach similar to the 'trace_after_instrs' feature. We start off with the L0_filter instrumentation and then flush the code cache and enable regular trace instrumentation when the switching criteria is reached.

Describe alternatives you've considered
We could just collect a really long trace but that might have higher tracing overhead, higher disk usage etc. We could use DrBBDup (#4134) but there are some TODOs still open for this feature.

Additional context

@derekbruening
Copy link
Contributor

For an online trace, how are the two phases distinguished in analysis tools? Is a new instance of drcachesim spun up for the switch? Or are the trace entries just run together -- with some kind of marker in between like the offline file type marker??

The same question for an offline trace: though here it's easier to have completely separate output files. I assume that's the plan.

@prasun3
Copy link
Contributor Author

prasun3 commented Nov 11, 2021

Yes, for offline traces we will have separate files. I have been focusing on offline traces and hadn't thought about online trace. I think the marker idea sounds good -- that way the analyzer can have control on what it wants to do with data from the different phases.

@prasun3
Copy link
Contributor Author

prasun3 commented Dec 7, 2021

While switching instrumentation, I need to ensure that the regs are restored. Therefore, I tried to change clean_call to use DR_CLEANCALL_READS_APP_CONTEXT. I ran into an assert DynamoRIO usage error : encode_cti error: target beyond 8-bit reach after making this change.

This is the change I made

diff --git a/clients/drcachesim/tracer/tracer.cpp b/clients/drcachesim/tracer/tracer.cpp
index c97ed93..4071b14 100644
--- a/clients/drcachesim/tracer/tracer.cpp
+++ b/clients/drcachesim/tracer/tracer.cpp
@@ -843,8 +843,11 @@ instrument_clean_call(void *drcontext, instrlist_t *ilist, instr_t *where,
     reg_id_t reg_tmp =
         insert_conditional_skip(drcontext, ilist, where, reg_ptr, skip_call,
                                 short_reaches, app_regs_at_skip_call);
-    dr_insert_clean_call_ex(drcontext, ilist, where, (void *)clean_call,
-                            DR_CLEANCALL_ALWAYS_OUT_OF_LINE, 0);
+    dr_insert_clean_call_ex(
+        drcontext, ilist, where, (void *)clean_call,
+        static_cast<dr_cleancall_save_t>(DR_CLEANCALL_READS_APP_CONTEXT |
+                                         DR_CLEANCALL_ALWAYS_OUT_OF_LINE),
+        0);
     insert_conditional_skip_target(drcontext, ilist, where, skip_call, reg_tmp,
                                    app_regs_at_skip_call);
     insert_conditional_skip_target(drcontext, ilist, where, skip_thread, reg_thread,

It looks like the generated code is bigger now due to the additional gpr restore code and the jnle instruction is using an 8-bit encoding and no longer able to address the displacement. Do we need to rewrite the instruction with a different jnle variant that can handle this displacement? Have we run into a similar problem before?

@derekbruening
Copy link
Contributor

No, there won't be any problems with application branches. It is a tool-inserted short jump. You can see it here:

 +240  m4 @0x00007fb179b75728  e3 fe                jrcxz  @0x00007fb179b6f7c8[8byte] %rcx
...
 +403  m4 @0x00007fb179b6f7c8                       <label>

Probably that is the tracer skipping the clean call unless the buffer is full.

Xref #56 but that requires flags.

@prasun3
Copy link
Contributor Author

prasun3 commented Dec 8, 2021

Thanks! I have set short_reaches to false for now. Now I am running into cases where some GPRs appear to be clobbered.

For example, the value in rdx here is clearly bad.

before instrumentation:
TAG  0x00007fbbcc372f4b
 +0    L3 @0x00007fb988452470  48 89 14 c6          mov    %rdx -> (%rsi,%rax,8)[8byte]
 +4    L3 @0x00007fb98844e980  48 83 c2 10          add    $0x0000000000000010 %rdx -> %rdx
 +8    L3 @0x00007fb9884529d8  48 8b 02             mov    (%rdx)[8byte] -> %rax
 +11   L3 @0x00007fb9884514c8  48 85 c0             test   %rax %rax
 +14   L3 @0x00007fb98844e818  74 35                jz     $0x00007fbbcc372f90
END 0x00007fbbcc372f4b

drreg_restore_app_value @0.0x00007fbbcc372f59: reg rax already native
drreg_restore_app_value @0.0x00007fbbcc372f59: reg rdx already native
..
drreg_restore_app_value @0.0x00007fbbcc372f59: reg rsi already native

priv_mcontext_t @0x00007fb988424080
	xax = 0x000000000000000e
	xbx = 0x000000006ffffeff
	xcx = 0x000000037ffffb70
	xdx = 0x2004000800001f5b

opnd_compute_address for: (%rdx)
	base => 0x2004000800001f6b
	index,scale => 0x2004000800001f6b
	disp => 0x2004000800001f6b
For SIGSEGV at cache pc 0x00007fbb886015d2, computed target read 0x2004000800001f6b
	faulting instr: mov    (%rdx)[8byte] -> %rax
** Received SIGSEGV at cache pc 0x00007fbb886015d2 in thread 55106

Looking back a few blocks, rdx was native here

before instrumentation:
TAG  0x00007fbbcc372ea0
 +0    L3 @0x00007fb9884504e8  55                   push   %rbp %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +1    L3 @0x00007fb988450618  48 89 e5             mov    %rsp -> %rbp
 +4    L3 @0x00007fb988450550  41 57                push   %r15 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +6    L3 @0x00007fb988450468  41 56                push   %r14 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +8    L3 @0x00007fb98844dc20  41 55                push   %r13 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +10   L3 @0x00007fb98844dca0  41 54                push   %r12 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +12   L3 @0x00007fb98844e6d0  49 89 fc             mov    %rdi -> %r12
 +15   L3 @0x00007fb98844eb20  53                   push   %rbx %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +16   L3 @0x00007fb98844e0d0  48 83 ec 48          sub    $0x0000000000000048 %rsp -> %rsp
 +20   L3 @0x00007fb98844e900  0f 31                rdtsc   -> %edx %eax
 +22   L3 @0x00007fb9884503e8  48 c1 e2 20          shl    $0x0000000000000020 %rdx -> %rdx
 +26   L3 @0x00007fb988450368  89 c0                mov    %eax -> %eax
 +28   L3 @0x00007fb988450300  48 09 c2             or     %rax %rdx -> %rdx
 +31   L3 @0x00007fb98844ebb8  48 8b 05 a2 7f 22 00 mov    <rel> 0x00007fbbcc59ae68[8byte] -> %rax
 +38   L3 @0x00007fb98844ea50  48 89 15 53 78 22 00 mov    %rdx -> <rel> 0x00007fbbcc59a720[8byte]
 +45   L3 @0x00007fb98844e448  48 8d 15 94 7f 22 00 lea    <rel> 0x00007fbbcc59ae68 -> %rdx
 +52   L3 @0x00007fb98844e2c8  49 89 d6             mov    %rdx -> %r14
 +55   L3 @0x00007fb98844e9e8  4c 2b 35 22 81 22 00 sub    <rel> 0x00007fbbcc59b000[8byte] %r14 -> %r14
 +62   L3 @0x00007fb98844e980  48 85 c0             test   %rax %rax
 +65   L3 @0x00007fb98844e818  48 89 15 18 8b 22 00 mov    %rdx -> <rel> 0x00007fbbcc59ba00[8byte]
 +72   L3 @0x00007fb98844da20  4c 89 35 01 8b 22 00 mov    %r14 -> <rel> 0x00007fbbcc59b9f0[8byte]
 +79   L3 @0x00007fb98844e518  0f 84 9b 00 00 00    jz     $0x00007fbbcc372f90
END 0x00007fbbcc372ea0

after instrumentation:
...
 +274  L3 @0x00007fb98844e818  48 89 15 18 8b 22 00 mov    %rdx -> <rel> 0x00007fbbcc59ba00[8byte]
...
END 0x00007fbbcc372ea0

and the next block doesn't touch it

before instrumentation:
TAG  0x00007fbbcc372ef5
 +0    L3 @0x00007fb988452670  48 8d 3d 64 81 22 00 lea    <rel> 0x00007fbbcc59b060 -> %rdi
 +7    L3 @0x00007fb98844e518  49 b9 d0 03 00 80 03 mov    $0x00000003800003d0 -> %r9
                               00 00 00
 +17   L3 @0x00007fb9884525f0  48 b9 70 fb ff 7f 03 mov    $0x000000037ffffb70 -> %rcx
                               00 00 00
 +27   L3 @0x00007fb988451768  41 b8 ff ff ff 6f    mov    $0x6fffffff -> %r8d
 +33   L3 @0x00007fb9884518b8  41 bb ff fd ff 6f    mov    $0x6ffffdff -> %r11d
 +39   L3 @0x00007fb988451850  bb ff fe ff 6f       mov    $0x6ffffeff -> %ebx
 +44   L3 @0x00007fb988451df8  49 01 f9             add    %rdi %r9 -> %r9
 +47   L3 @0x00007fb988451c90  41 ba 31 00 00 00    mov    $0x00000031 -> %r10d
 +53   L3 @0x00007fb988451c28  48 01 cf             add    %rcx %rdi -> %rdi
 +56   L3 @0x00007fb988452570  48 8d 35 fc 8a 22 00 lea    <rel> 0x00007fbbcc59ba30 -> %rsi
 +63   L3 @0x00007fb9884524f0  eb 25                jmp    $0x00007fbbcc372f5b
END 0x00007fbbcc372ef5

...

priv_mcontext_t @0x00007fb988424080
	xax = 0x000000000000000e
	xbx = 0x0000000000000000
	xcx = 0x0000000000000000
	xdx = 0x00007fbbcc59ae68

It is used by instrumentation in the next block but is restored

before instrumentation:
TAG  0x00007fbbcc372f5b
 +0    L3 @0x00007fb9884524f0  48 83 f8 21          cmp    %rax $0x0000000000000021
 +4    L3 @0x00007fb98844e980  76 ea                jbe    $0x00007fbbcc372f4b
END 0x00007fbbcc372f5b

 +18   m4 @0x00007fb988451920  65 48 89 14 25 e8 00 mov    %rdx -> %gs:0x000000e8[8byte]

 +97   m4 @0x00007fb988451b58  65 48 8b 14 25 e8 00 mov    %gs:0x000000e8[8byte] -> %rdx

priv_mcontext_t @0x00007fb988424080
	xax = 0x000000000000000e
	xbx = 0x000000006ffffeff
	xcx = 0x000000037ffffb70
	xdx = 0x00007fbbcc59ae68

Full log attached: log.0.55106.html.txt

@derekbruening
Copy link
Contributor

I haven't looked at the details but if there is internal instrumentation control flow be sure that all paths are parallel, typically by putting all spills before any branches and all restores after all branches. That is one potential source of problems.

@prasun3
Copy link
Contributor Author

prasun3 commented Dec 14, 2021

Ok I see now that rdx is not restored when this jrcxz branch is taken. So ideally it should be restored after the jrcxz+jmp and before it gets to jbe?

 +69   m4 @0x00007fb988451700  e3 fe                jrcxz  @0x00007fb98844dee8[8byte] %rcx

 +74   m4 @0x00007fb98844dee8                       <label>
 +74   m4 @0x00007fb988452c58  e9 fb ff ff ff       jmp    @0x00007fb988452ad8[8byte]

 +219  m4 @0x00007fb988452ad8                       <label>
 +219  m4 @0x00007fb98844e818                       <label>
 +219  m4 @0x00007fb9884514c8                       <label>
 +219  m4 @0x00007fb9884529d8  65 48 8b 0c 25 e0 00 mov    %gs:0x000000e0[8byte] -> %rcx
                               00 00
 +228  L3 @0x00007fb98844e980  76 ea                jbe    $0x00007fbbcc372f4b

Without DR_CLEANCALL_READS_APP_CONTEXT, rcx and rdx are not restored before the clean call (which makes sense), but both are restored before the last instruction in the basic block.

With DR_CLEANCALL_READS_APP_CONTEXT, rcx and rdx are restored before the clean call, but only rcx is restored before the last instruction which leaves the tool value in rdx causing a segfault in the next basic block where it is used.

Is this due to incorrect handling in the tracer or in the instrumentation api?

instrument_basic_block ******************

before instrumentation:
TAG  0x00007fbbcc372f5b
 +0    L3 @0x00007fb9884524f0  48 83 f8 21          cmp    %rax $0x0000000000000021
 +4    L3 @0x00007fb98844e980  76 ea                jbe    $0x00007fbbcc372f4b
END 0x00007fbbcc372f5b

drreg_event_bb_analysis @0.0x00007fbbcc372f5f: rax=1 rcx=1 rdx=1 rbx=1 rsp=1 rbp=1 rsi=1 rdi=1 r8=1 r9=1 r10=1 r11=1 r12=1 r13=1 r14=1 r15=1 flags=287
drreg_event_bb_analysis @1.0x00007fbbcc372f5b: rax=1 rcx=1 rdx=1 rbx=1 rsp=1 rbp=1 rsi=1 rdi=1 r8=1 r9=1 r10=1 r11=1 r12=1 r13=1 r14=1 r15=1 flags=0
drreg_reserve_reg_internal @1.0x00007fbbcc372f5b: spilling rcx to slot 0
spill_reg @1.0x00007fbbcc372f5b rcx 0
mutex_delete 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock                             0                              0                               0                              0                               0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
remove_process_lock 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock                             0                              0                               0                              0                               0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
	never acquired
drreg_reserve_reg_internal @1.0x00007fbbcc372f5b: spilling rdx to slot 1
spill_reg @1.0x00007fbbcc372f5b rdx 1
drreg_unreserve_register @1.0x00007fbbcc372f5b rdx
drreg_unreserve_register @1.0x00007fbbcc372f5b rcx
drreg_reserve_reg_internal @0.0x00007fbbcc372f5f: using un-restored rcx slot 0
drreg_reserve_reg_internal @0.0x00007fbbcc372f5f: rcx already spilled to slot 0
mutex_delete 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock                             0                              0                               0                              0                               0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
remove_process_lock 0 lock 0x00007fb9883e5ad0: name=dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
rank=43 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x0000000000000000
lock                             0                              0                               0                              0                               0+2 dr_client_mutex(mutex)@/home/prasun/dynamorio/dynamorio/core/lib/instrument.c:3527
	never acquired
CLEANCALL: insert clean call to 0x00007fbb883b8898
drreg_event_clean_call_insertion: restoring for cleancall to read app regs
drreg_insert_restore_all @0.0x0000000000000000: restoring rcx for app read
spill_reg @0.0x0000000000000000 rcx 2
restore_reg @0.0x0000000000000000 rcx slot=0 release=0
restore_reg @0.0x00007fbbcc372f5f rcx slot=2 release=1
drreg_insert_restore_all @0.0x0000000000000000: lazily restoring rdx
drreg_restore_reg_now @0.0x0000000000000000: restoring rdx
restore_reg @0.0x0000000000000000 rdx slot=1 release=1
insert_parameter_preparation: 0 args, 6 in-reg, 0 pre-push, 0/0 stack
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rax already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbx already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rdi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r8 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r9 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r10 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r11 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r12 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r13 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r14 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r15 already native
drreg_unreserve_register @0.0x00007fbbcc372f5f rcx
drreg_insert_restore_all @0.0x00007fbbcc372f5f: lazily restoring rcx
drreg_restore_reg_now @0.0x00007fbbcc372f5f: restoring rcx
restore_reg @0.0x00007fbbcc372f5f rcx slot=0 release=1

after instrumentation:
TAG  0x00007fbbcc372f5b
 +0    m4 @0x00007fb988451ff8  65 48 89 0c 25 e0 00 mov    %rcx -> %gs:0x000000e0[8byte]
                               00 00
 +9    m4 @0x00007fb988451ef8  65 48 8b 0c 25 20 01 mov    %gs:0x00000120[8byte] -> %rcx
                               00 00
 +18   m4 @0x00007fb988451920  65 48 89 14 25 e8 00 mov    %rdx -> %gs:0x000000e8[8byte]                # spill rdx
                               00 00
 +27   m4 @0x00007fb988451a70  48 ba 5b 1f 00 00 08 mov    $0x2004000800001f5b -> %rdx                  # populate OFFLINE_TYPE_PC
                               00 04 20
 +37   m4 @0x00007fb98844dd98  48 89 11             mov    %rdx -> (%rcx)[8byte]                        # save to buffer
 +40   m4 @0x00007fb98844da20                       <label>
 +40   m4 @0x00007fb988451530  48 8d 49 08          lea    0x08(%rcx) -> %rcx
 +44   m4 @0x00007fb9884515b0  65 48 89 0c 25 20 01 mov    %rcx -> %gs:0x00000120[8byte]
                               00 00
 +53   m4 @0x00007fb988451e78                       <label>
 +53   L3 @0x00007fb9884524f0  48 83 f8 21          cmp    %rax $0x0000000000000021
 +57   m4 @0x00007fb988452a58  65 48 8b 0c 25 20 01 mov    %gs:0x00000120[8byte] -> %rcx
                               00 00
 +66   m4 @0x00007fb988452b58  48 8b 09             mov    (%rcx)[8byte] -> %rcx
 +69   m4 @0x00007fb988451700  e3 fe                jrcxz  @0x00007fb98844dee8[8byte] %rcx              # jumps to end without restoring rdx
 +71   m4 @0x00007fb988451618  eb fe                jmp    @0x00007fb988451680[8byte]
 +73   m4 @0x00007fb988452bd8  90                   nop
 +74   m4 @0x00007fb98844dee8                       <label>
 +74   m4 @0x00007fb988452c58  e9 fb ff ff ff       jmp    @0x00007fb988452ad8[8byte]
 +79   m4 @0x00007fb988451680                       <label>
 +79   m4 @0x00007fb988451a08  65 48 89 0c 25 f0 00 mov    %rcx -> %gs:0x000000f0[8byte]
                               00 00
 +88   m4 @0x00007fb988451af0  65 48 8b 0c 25 e0 00 mov    %gs:0x000000e0[8byte] -> %rcx
                               00 00
 +97   m4 @0x00007fb988451b58  65 48 8b 14 25 e8 00 mov    %gs:0x000000e8[8byte] -> %rdx                # restore rdx
                               00 00
 +106  m4 @0x00007fb9884519a0  65 48 a3 00 00 00 00 mov    %rax -> %gs:0x00[8byte]
                               00 00 00 00
 +117  m4 @0x00007fb988451bc0  65 48 a1 20 00 00 00 mov    %gs:0x20[8byte] -> %rax
                               00 00 00 00
 +128  m4 @0x00007fb988451d78  48 89 60 18          mov    %rsp -> 0x18(%rax)[8byte]
 +132  m4 @0x00007fb988451f78  48 8b a0 48 09 00 00 mov    0x00000948(%rax)[8byte] -> %rsp
 +139  m4 @0x00007fb988452078  65 48 a1 00 00 00 00 mov    %gs:0x00[8byte] -> %rax
                               00 00 00 00
 +150  m4 @0x00007fb988452670  48 8d a4 24 f8 f6 ff lea    0xfffff6f8(%rsp) -> %rsp
                               ff
 +158  m4 @0x00007fb988451c28  e8 0b 25 1b 00       call   $0x00007fbb885fce00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +163  m4 @0x00007fb988451c90                       <label>
 +163  m4 @0x00007fb988451df8  e8 a3 df f6 ff       call   $0x00007fbb883b8898 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +168  m4 @0x00007fb988451850  e8 0b 26 1b 00       call   $0x00007fbb885fcf00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +173  m4 @0x00007fb9884518b8  65 48 a3 00 00 00 00 mov    %rax -> %gs:0x00[8byte]
                               00 00 00 00
 +184  m4 @0x00007fb988451768  65 48 a1 20 00 00 00 mov    %gs:0x20[8byte] -> %rax
                               00 00 00 00
 +195  m4 @0x00007fb9884525f0  48 8b 60 18          mov    0x18(%rax)[8byte] -> %rsp
 +199  m4 @0x00007fb98844e518  65 48 a1 00 00 00 00 mov    %gs:0x00[8byte] -> %rax
                               00 00 00 00
 +210  m4 @0x00007fb988452cd8                       <label>
 +210  m4 @0x00007fb9884517d0                       <label>
 +210  m4 @0x00007fb988451cf8  65 48 8b 0c 25 f0 00 mov    %gs:0x000000f0[8byte] -> %rcx
                               00 00
 +219  m4 @0x00007fb988452ad8                       <label>
 +219  m4 @0x00007fb98844e818                       <label>
 +219  m4 @0x00007fb9884514c8                       <label>
 +219  m4 @0x00007fb9884529d8  65 48 8b 0c 25 e0 00 mov    %gs:0x000000e0[8byte] -> %rcx
                               00 00
 +228  L3 @0x00007fb98844e980  76 ea                jbe    $0x00007fbbcc372f4b
END 0x00007fbbcc372f5b

Without DR_CLEANCALL_READS_APP_CONTEXT

after instrumentation:
TAG  0x00007f8c480a7f5b
 +0    m4 @0x00007f8a04186e78  65 48 89 0c 25 e0 00 mov    %rcx -> %gs:0x000000e0[8byte]
                               00 00
 +9    m4 @0x00007f8a04186d78  65 48 8b 0c 25 20 01 mov    %gs:0x00000120[8byte] -> %rcx
                               00 00
 +18   m4 @0x00007f8a041867d0  65 48 89 14 25 e8 00 mov    %rdx -> %gs:0x000000e8[8byte]
                               00 00
 +27   m4 @0x00007f8a04182fd0  48 ba 5b 1f 00 00 08 mov    $0x2004000800001f5b -> %rdx
                               00 04 20
 +37   m4 @0x00007f8a04186530  48 89 11             mov    %rdx -> (%rcx)[8byte]
 +40   m4 @0x00007f8a04187bd8                       <label>
 +40   m4 @0x00007f8a041865b0  48 8d 49 08          lea    0x08(%rcx) -> %rcx
 +44   m4 @0x00007f8a041864c8  65 48 89 0c 25 20 01 mov    %rcx -> %gs:0x00000120[8byte]
                               00 00
 +53   m4 @0x00007f8a04186cf8                       <label>
 +53   L3 @0x00007f8a04187370  48 83 f8 21          cmp    %rax $0x0000000000000021
 +57   m4 @0x00007f8a041879d8  65 48 8b 0c 25 20 01 mov    %gs:0x00000120[8byte] -> %rcx
                               00 00
 +66   m4 @0x00007f8a04183750  48 8b 09             mov    (%rcx)[8byte] -> %rcx
 +69   m4 @0x00007f8a04186618  e3 fe                jrcxz  @0x00007f8a04186680[8byte] %rcx
 +71   m4 @0x00007f8a04187ad8  eb fe                jmp    @0x00007f8a04186700[8byte]
 +73   m4 @0x00007f8a04187b58  90                   nop
 +74   m4 @0x00007f8a04186680                       <label>
 +74   m4 @0x00007f8a04183ab8  e9 fb ff ff ff       jmp    @0x00007f8a04183668[8byte]
 +79   m4 @0x00007f8a04186700                       <label>
 +79   m4 @0x00007f8a04186bc0  65 48 a3 00 00 00 00 mov    %rax -> %gs:0x00[8byte]
                               00 00 00 00
 +90   m4 @0x00007f8a04186a70  65 48 a1 20 00 00 00 mov    %gs:0x20[8byte] -> %rax
                               00 00 00 00
 +101  m4 @0x00007f8a04186af0  48 89 60 18          mov    %rsp -> 0x18(%rax)[8byte]
 +105  m4 @0x00007f8a04186b58  48 8b a0 48 09 00 00 mov    0x00000948(%rax)[8byte] -> %rsp
 +112  m4 @0x00007f8a04186a08  65 48 a1 00 00 00 00 mov    %gs:0x00[8byte] -> %rax
                               00 00 00 00
 +123  m4 @0x00007f8a04186df8  48 8d a4 24 f8 f6 ff lea    0xfffff6f8(%rsp) -> %rsp
                               ff
 +131  m4 @0x00007f8a04186ef8  e8 0b 25 1b 00       call   $0x00007f8c04331e00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +136  m4 @0x00007f8a04187570                       <label>
 +136  m4 @0x00007f8a04187470  48 bf 5f 7f 0a 48 8c mov    $0x00007f8c480a7f5f -> %rdi
                               7f 00 00
 +146  m4 @0x00007f8a04186c28  e8 53 e0 f6 ff       call   $0x00007f8c040ed948 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +151  m4 @0x00007f8a04186c90  e8 0b 26 1b 00       call   $0x00007f8c04331f00 %rsp -> %rsp 0xfffffff8(%rsp)[8byte]
 +156  m4 @0x00007f8a04186850  65 48 a3 00 00 00 00 mov    %rax -> %gs:0x00[8byte]
                               00 00 00 00
 +167  m4 @0x00007f8a041868b8  65 48 a1 20 00 00 00 mov    %gs:0x20[8byte] -> %rax
                               00 00 00 00
 +178  m4 @0x00007f8a04186768  48 8b 60 18          mov    0x18(%rax)[8byte] -> %rsp
 +182  m4 @0x00007f8a041874f0  65 48 a1 00 00 00 00 mov    %gs:0x00[8byte] -> %rax
                               00 00 00 00
 +193  m4 @0x00007f8a041869a0                       <label>
 +193  m4 @0x00007f8a04186920                       <label>
 +193  m4 @0x00007f8a04183668                       <label>
 +193  m4 @0x00007f8a04187a58                       <label>
 +193  m4 @0x00007f8a04187958                       <label>
 +193  m4 @0x00007f8a04183b20  65 48 8b 0c 25 e0 00 mov    %gs:0x000000e0[8byte] -> %rcx
                               00 00
 +202  m4 @0x00007f8a041878d8  65 48 8b 14 25 e8 00 mov    %gs:0x000000e8[8byte] -> %rdx
                               00 00
 +211  L3 @0x00007f8a04183518  76 ea                jbe    $0x00007f8c480a7f4b
END 0x00007f8c480a7f5b

@prasun3
Copy link
Contributor Author

prasun3 commented Jan 7, 2022

In the regular case, drreg_event_bb_insert_late calls drreg_insert_restore_all which restores both rcx and rdx. With DR_CLEANCALL_READS_APP_CONTEXT, drreg_event_clean_call_insertion has already restored rdx, so drreg_event_bb_insert_late skips this reg.

All of this happens inside drreg code which is not aware of the control flow inserted by the tracer and the tracer is not aware of which registers were marked native by dr_insert_clean_call_ex. So how should we handle this?

Regular case:

CLEANCALL: insert clean call to 0x00007f8c040ed948
insert_parameter_preparation: 1 args, 6 in-reg, 0 pre-push, 0/0 stack
in insert_conditional_skip_target: reg: rax
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rax already native
in insert_conditional_skip_target: reg: rbx
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rbx already native
in insert_conditional_skip_target: reg: rsp
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rsp already native
in insert_conditional_skip_target: reg: rbp
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rbp already native
in insert_conditional_skip_target: reg: rsi
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rsi already native
in insert_conditional_skip_target: reg: rdi
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg rdi already native
in insert_conditional_skip_target: reg: r8
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r8 already native
in insert_conditional_skip_target: reg: r9
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r9 already native
in insert_conditional_skip_target: reg: r10
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r10 already native
in insert_conditional_skip_target: reg: r11
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r11 already native
in insert_conditional_skip_target: reg: r12
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r12 already native
in insert_conditional_skip_target: reg: r13
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r13 already native
in insert_conditional_skip_target: reg: r14
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r14 already native
in insert_conditional_skip_target: reg: r15
drreg_restore_app_value @0.0x00007f8c480a7f5f: reg r15 already native
drreg_unreserve_register @0.0x00007f8c480a7f5f rcx
drreg_insert_restore_all @0.0x00007f8c480a7f5f: lazily restoring rcx
drreg_restore_reg_now @0.0x00007f8c480a7f5f: restoring rcx
restore_reg @0.0x00007f8c480a7f5f rcx slot=0 release=1
drreg_insert_restore_all @0.0x00007f8c480a7f5f: lazily restoring rdx
drreg_restore_reg_now @0.0x00007f8c480a7f5f: restoring rdx
restore_reg @0.0x00007f8c480a7f5f rdx slot=1 release=1

With DR_CLEANCALL_READS_APP_CONTEXT

CLEANCALL: insert clean call to 0x00007fbb883b8898
drreg_event_clean_call_insertion: restoring for cleancall to read app regs
drreg_insert_restore_all @0.0x0000000000000000: restoring rcx for app read
spill_reg @0.0x0000000000000000 rcx 2
restore_reg @0.0x0000000000000000 rcx slot=0 release=0
restore_reg @0.0x00007fbbcc372f5f rcx slot=2 release=1
drreg_insert_restore_all @0.0x0000000000000000: lazily restoring rdx                |
drreg_restore_reg_now @0.0x0000000000000000: restoring rdx                          |---> rdx restored by drreg_event_clean_call_insertion
restore_reg @0.0x0000000000000000 rdx slot=1 release=1                              |
insert_parameter_preparation: 0 args, 6 in-reg, 0 pre-push, 0/0 stack
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rax already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbx already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rbp already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rsi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg rdi already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r8 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r9 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r10 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r11 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r12 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r13 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r14 already native
drreg_restore_app_value @0.0x00007fbbcc372f5f: reg r15 already native
drreg_unreserve_register @0.0x00007fbbcc372f5f rcx
drreg_insert_restore_all @0.0x00007fbbcc372f5f: lazily restoring rcx
drreg_restore_reg_now @0.0x00007fbbcc372f5f: restoring rcx
restore_reg @0.0x00007fbbcc372f5f rcx slot=0 release=1

@prasun3
Copy link
Contributor Author

prasun3 commented Jan 10, 2022

Looks like I was missing DR_CLEANCALL_MULTIPATH

@prasun3
Copy link
Contributor Author

prasun3 commented Feb 10, 2022

I am seeing an assert in translate.c where it hits ASSERT_NOT_REACHED in recreate_app_state_from_ilist . This happens when we call dr_flush_region_ex . The application is a slightly modified version of threadsig (increased data footprint, no signals). I have run with 4 and 8 threads and haven't tried other thread counts.

Any idea what this assert means?

Thread 2 "threadsig2" hit Breakpoint 1, recreate_app_state_from_ilist (tdcontext=0x7ffdb3e401c0, ilist=0x7ffdb4081be0,
    start_app=0x7ffff6275ad0 "\017\030\216@\376\377\377\017\030\216\200\375\377\377f\017oF\360f\017oN\340f\017oV\320f\017o^\300f\017of\260f\017on\240f\017ov\220f\017o~\200H\215v\200H\201", <incomplete sequence \352\200>, start_cache=0x7fffb3fc4019 "eH\211\f", <incomplete sequence \340>,
    end_cache=0x7fffb3fc480d "\314\314\314\070Y\353\263\375\177", mc=0x7ffdb3f1d020, just_pc=false, flags=16778806) at /home/prasun/dynamorio/core/translate.c:1117
1117        ASSERT_NOT_REACHED();
(gdb) where
#0  recreate_app_state_from_ilist (tdcontext=0x7ffdb3e401c0, ilist=0x7ffdb4081be0,
    start_app=0x7ffff6275ad0 "\017\030\216@\376\377\377\017\030\216\200\375\377\377f\017oF\360f\017oN\340f\017oV\320f\017o^\300f\017of\260f\017on\240f\017ov\220f\017o~\200H\215v\200H\201", <incomplete sequence \352\200>, start_cache=0x7fffb3fc4019 "eH\211\f", <incomplete sequence \340>,
    end_cache=0x7fffb3fc480d "\314\314\314\070Y\353\263\375\177", mc=0x7ffdb3f1d020, just_pc=false, flags=16778806) at /home/prasun/dynamorio/core/translate.c:1117
#1  0x00000000712064d6 in recreate_app_state_internal (tdcontext=0x7ffdb3e401c0, mcontext=0x7ffdb3f1d020, just_pc=false, owning_f=0x0, restore_memory=false) at /home/prasun/dynamorio/core/translate.c:1480
#2  0x0000000071206aa4 in recreate_app_state (tdcontext=0x7ffdb3e401c0, mcontext=0x7ffdb3f1d020, restore_memory=false, f=0x0) at /home/prasun/dynamorio/core/translate.c:1623
#3  0x00000000711dcfc1 in at_safe_spot (trec=0x7ffdb3e40dc8, mc=0x7ffdb3f1d020, desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER) at /home/prasun/dynamorio/core/synch.c:613
#4  0x00000000711de4b2 in synch_with_thread (id=152173, block=false, hold_initexit_lock=true, caller_state=THREAD_SYNCH_NONE, desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER, flags=2) at /home/prasun/dynamorio/core/synch.c:1031
#5  0x00000000711dfc42 in synch_with_all_threads (desired_synch_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER, threads_out=0x71650000 <flush_threads>, num_threads_out=0x71650008 <flush_num_threads>, cur_state=THREAD_SYNCH_NO_LOCKS_NO_XFER, flags=2)
    at /home/prasun/dynamorio/core/synch.c:1411
#6  0x0000000071080f5e in flush_fragments_synchall_start (ignored=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, exec_invalid=false) at /home/prasun/dynamorio/core/fragment.c:5953
#7  0x00000000710830fe in flush_fragments_synch_unlink_priv (dcontext=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, own_initexit_lock=false, exec_invalid=false, force_synchall=true) at /home/prasun/dynamorio/core/fragment.c:6475
#8  0x0000000071084225 in flush_fragments_in_region_start (dcontext=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, own_initexit_lock=false, free_futures=false, exec_invalid=false, force_synchall=true) at /home/prasun/dynamorio/core/fragment.c:6776
#9  0x00000000710849d0 in flush_fragments_from_region (dcontext=0x7ffdb3e3ca00, base=0x0, size=18446744073709551615, force_synchall=true, flush_completion_callback=0x7fffb3d6ea75 <change_warmup_callback(void*)>, user_data=0x0)
    at /home/prasun/dynamorio/core/fragment.c:6853
#10 0x00000000711fd774 in dr_flush_region_ex (start=0x0, size=18446744073709551615, flush_completion_callback=0x7fffb3d6ea75 <change_warmup_callback(void*)>, user_data=0x0) at /home/prasun/dynamorio/core/lib/instrument.c:6797
#11 0x00007fffb3d6ee18 in hit_warmup_threshold (next_pc=0x7ffff6275b3d "s\221H\203\372\300H\215\222\200") at /home/prasun/dynamorio/clients/drcachesim/tracer/tracer.cpp:1931
#12 0x00007fffb3d6af07 in memtrace (drcontext=0x7ffdb3e3ca00, skip_size_cap=false, next_pc=0x7ffff6275b3d "s\221H\203\372\300H\215\222\200") at /home/prasun/dynamorio/clients/drcachesim/tracer/tracer.cpp:763
#13 0x00007fffb3d6b1fb in clean_call (next_pc=0x7ffff6275b3d "s\221H\203\372\300H\215\222\200") at /home/prasun/dynamorio/clients/drcachesim/tracer/tracer.cpp:851
#14 0x00007fffb3fc47a4 in ?? ()
#15 0x00007ffff546d000 in ?? ()
#16 0x00007ffff54ee000 in ?? ()
#17 0x00007ffff5d69ef0 in ?? ()
#18 0x00007ffdb3f1e720 in ?? ()
#19 0x0000000000000000 in ?? ()

@derekbruening
Copy link
Contributor

Translation failure. If it is reproducible, I would run with logging and find the instruction that fails to translate. Is it DR mangling and a case is missing in the translation code? Is it tool instrumentation?

@prasun3
Copy link
Contributor Author

prasun3 commented Feb 11, 2022

It doesn't look like it is failing to translate a specific instruction. I think it is unable to find the mcontext->pc in the ilist.

I have attached the log: translate-failure.txt
.

@prasun3
Copy link
Contributor Author

prasun3 commented Feb 22, 2022

I was clearing op_L0_filter before calling dr_flush_region_ex. A fragment was getting re-translated on another thread while dr_flush_region_ex was in progress. Therefore the fragment length at fcache_fragment_pclookup was inconsistent with the fragment definition at recreate_app_state_from_ilist.

Clearing op_L0_filter after dr_flush_region_ex fixed the issue since the regenerated fragment is identical to the earlier fragment.

@prasun3
Copy link
Contributor Author

prasun3 commented Feb 23, 2022

I now have some basic runs start going through. I can start sending some PRs if the approach below looks fine. It is not perfect but we could add this now and improve it gradually.

  • only offline traces are considered for now
  • Added an option named L0_warmup_refs
  • added a separate suffix for warmup trace .warmup.raw instead of .raw
  • seeing some crashes with exit_after_tracing so not allowing it with L0_warmup_refs for now
  • Added a message to indicate when max_trace_size is reached (printed only once)
  • ignore size_cap in warmup mode since that applies only to instr trace
  • first thread to reach warmup threshold
    • writes footer and closes warmup trace
    • opens inst trace and writes header
    • sets flag indicating warmup completed and code cache flush started
    • initiates code cache flush
    • in the cache flush completion callback
      • clear op_L0_filter
      • clear memref_needs_full_info_
      • disables warmup instrumentation
      • enables inst trace instrumentation
      • sets flag indicating cache flush done
    • since switch to inst trace occurs in the middle of a basic block, while writing the buffer to disk we need to drop entries until we get to a PC record
  • other threads drop their buffers if they see code cache flush in progress
    • rare case where we started writing to a buffer in warmup mode and code cache flush was completed by by the time we got to the clean call to write it to disk
    • seen only for the main thread which is not doing much work and contains only header and footer
    • added handling to avoid post-processing failures
      • updated append_unit_header to look at the current value of op_L0_filter; instead it should look at how much space was reserved
      • could drop such buffers also
    • could add more detailed handling later if necessary
  • Added PC as an argument to clean_call so that we only do code cache flush if we have the PC which needs to be supplied to dr_redirect_execution in mcontext
  • disabled short_reaches for now as it was causing the encode_cti assert with warmup instrumentation
    • will need a better approach long term; we can add a flag to control this for now

@derekbruening
Copy link
Contributor

My main question here would be: should this feature use drbbdup for the two phases, rather than flushing?

We are currently implementing #3995 using the drbbdup library to provide multiple instrumentation cases dispatched depending on the current phase: count instructions, or produce a trace. This changes the current delayed tracing feature to no longer use a flush and use drbbdup instead.

Some of the problems are similar between the two:

  • How to flush other thread's buffers at the phase transition
  • How online traces handle phases

other threads drop their buffers if they see code cache flush in progress

How do other threads know a flush is in progress? Do they throw away the data and so there is a gap between the warmup and the full phase?

We are currently discussing how best to handle this for #3995. Will update that issue with the current plan.

rare case where we started writing to a buffer in warmup mode and code cache flush was completed by by the time we got to the clean call to write it to disk
seen only for the main thread which is not doing much work and contains only header and footer

Not sure I understand this: I thought every thread's half-full buffer was discarded? Do you mean a thread might have never written its first buffer-full and so have empty output?

@prasun3
Copy link
Contributor Author

prasun3 commented Feb 24, 2022

My main question here would be: should this feature use drbbdup for the two phases, rather than flushing?

Yes I had considered that initially (#3995 (comment)) and we had discussed it also (https://groups.google.com/g/dynamorio-users/c/fn6LBUxTd2o/m/98l0bK1BAQAJ). At the time we decided that flushing was okay for now since there was some work needed for bbdup. Sounds like it is much better shape now?

How do other threads know a flush is in progress? Do they throw away the data and so there is a gap between the warmup and the full phase?

They check a flag when they are in the memtrace clean call. The flag is set by the thread that initiated the switch. Yes they throw away the data and there is a gap between the two phases.

Not sure I understand this: I thought every thread's half-full buffer was discarded? Do you mean a thread might have never written its first buffer-full and so have empty output?

In this case the thread was mostly idle, so the buffer had a header and discarding it completely would create an incorrectly formatted trace.

@derekbruening
Copy link
Contributor

PR #5393 added drbbdup to drmemtrace for delayed tracing; PR #5402 is adding alternating trace-notrace windows. Per #5393 (comment) the plan is to then refactor tracer.cpp to split into more manageable pieces, if you were interested in switching to a drbbdup-based scheme.

prasun3 added a commit that referenced this issue Nov 8, 2022
Add a separate L0_filter mode to enable switching from warmup/L0_filter
mode to the unfiltered mode.

Issue: #5199
@prasun3 prasun3 linked a pull request Nov 8, 2022 that will close this issue
@prasun3
Copy link
Contributor Author

prasun3 commented Nov 8, 2022

I made a change to introduce an 'L0_filter' mode so that we can switch from warmup trace to the full trace. I am only creating this if flag need_l0_filter_mode is set. I will set it in the proposed `L0_warmup' mode. How does this sound?

@derekbruening
Copy link
Contributor

I will take a look at the code.

The drbbdup modes are very convenient but note that there is still some missing stability work in getting drbbdup scaled up: there is one known stability issue #5686 for scatter/gather plus at least one more similar one not yet figured out. Has drbbdup worked for your workload targets or have you also seen stability issues?

@abhinav92003
Copy link
Contributor

@prasun3 In your design, the final trace will be a combination of the warmup trace and full trace. Would you also want to split the two parts into two separate traces? Or would you continue to have a single combined trace?

prasun3 added a commit that referenced this issue Nov 17, 2022
Added 'L0_warmup_refs' cmd line option and changes to address review
feedback.

Issue: #5199
prasun3 added a commit that referenced this issue Nov 21, 2022
Changed cmdline option from L0_warmup_refs to L0_filter_until_instrs
plus minor related changes.

Issue: #5199
@prasun3
Copy link
Contributor Author

prasun3 commented Nov 21, 2022

@abhinav92003 In my first implementation I was flushing the code cache. I am redoing the code changes to use drbbdup. In the older implementation I am creating separate traces. The first thread to reach the warmup limit initiates the switch. The other threads drop any buffers that contain a mix of warmup and full trace. Due to this we have a gap in between the warmup and full trace.

Since I am re-doing this with drbbdup, I am open to either keeping it combined or split it. The problem with keeping a single file is that raw2trace gets confused between the two types of entries. I saw something about a FILTER_ENDPOINT marker in #5742. I'd be interested in learning how to insert that marker precisely in all the thread buffers.

Has drbbdup worked for your workload targets or have you also seen stability issues?

@derekbruening sorry forgot to reply to this. We haven't done much testing with this mode.

@abhinav92003
Copy link
Contributor

The record_filter tool implemented in #5675 allows filtering records in a stored offline trace. It has a feature to perform cache filtering till a certain timestamp is reached in the trace, which could be useful to create a warm-up trace preceding the full trace.

In #5738, I was extending the record_filter tool to also split the trace into two parts (warmup and full trace), but we decided to go a different direction where the trace reader figures out the end of the warmup region using the new FILTER_ENDPOINT marker inserted during filtering (this obviates the need to have actually separated trace files).

Since all of this filtering is done after raw2trace, we avoid some complexities related to raw2trace getting confused.

I see that your design does the filtering online. One benefit I can think of is that your design also reduces tracing overhead while collecting the warm up trace (due to writing out fewer trace entries to memory).

@prasun3
Copy link
Contributor Author

prasun3 commented Dec 9, 2022

Ok, so in your case you have one 'full trace' raw file which is post-processed to produced a trace.gz file. Then this trace.gz file is further filtered to produce a new trace.gz file containing the following:

  1. filtered entries
  2. then a FILTER_ENDPOINT marker
  3. and then followed by the full trace entries

Is this correct?

Yes, by filtering online we want to reduce tracing overhead as well as disk usage which can be quite large at larger thread counts.

When we change tracing_mode inside a clean call, when do we start executing from the new instrumentation? Do we 'return' from the clean call to the new instrumentation, or does the switch happen at the next dispatch? When do the other threads do the switch? I need to understand this in order to insert the endpoint marker at the appropriate point in the trace.

prasun3 added a commit that referenced this issue Dec 29, 2022
Documentation, help string and other minor updates.

Issue: #5199
prasun3 added a commit that referenced this issue Dec 29, 2022
prasun3 added a commit that referenced this issue Dec 30, 2022
- use filter mode with align_endpoint also
- don't increment bytes_written in filter mode
- documentation update

Issue: #5199
prasun3 added a commit that referenced this issue Jan 11, 2023
prasun3 added a commit that referenced this issue Feb 27, 2023
Removed a duplicate change.
Disabling warmup tests (which are known to fail) to ensure nothing else
is failing.

Issue: #5199
prasun3 added a commit that referenced this issue Apr 17, 2023
prasun3 added a commit that referenced this issue May 8, 2023
Enabled tests that were failing due to #5906

Issue: #5199
prasun3 added a commit that referenced this issue May 8, 2023
Updated test regex to handle case where tracing windows occur after the
"Hello, World!" message.

Issue: #5199
prasun3 added a commit that referenced this issue May 8, 2023
Updated test regex to handle case where tracing windows occur after the
"Hello, World!" message.

Issue: #5199
prasun3 added a commit that referenced this issue May 23, 2023
 - moved code that compares tracing_window to local window to a separate function
   - use this function to check for tracing_mode change also
 - made saw_filter_endpoint_ per thread
 - added tests for multi-threaded app
 - added an optional parameter increase the run duration of pthreads test

Issue: #5199
prasun3 added a commit that referenced this issue May 23, 2023
prasun3 added a commit that referenced this issue Jun 7, 2023
prasun3 added a commit that referenced this issue Jun 7, 2023
prasun3 added a commit that referenced this issue Jun 8, 2023
Disabled warmup-pthreads on Windows since the pthreads exe is not built.

Issue: #5199
prasun3 added a commit that referenced this issue Jun 19, 2023
Updated regex to make "Hit max global refs" message optional since we
cannot guarantee the test won't end before that message is shown.

Issue: #5199
prasun3 added a commit that referenced this issue Jun 22, 2023
Comments and other minor changes to address review comments.

Issue: #5199
prasun3 added a commit that referenced this issue Jun 27, 2023
Fixed build error caused due to upstream namespace change.

Issue: #5199
prasun3 added a commit that referenced this issue Jun 27, 2023
Removed saw_filter_endpoint_ flag

Issue: #5199
prasun3 added a commit that referenced this issue Jun 30, 2023
Instead of discarding mixed buffers that contain both filtered and
unfiltered records, look for the first unfiltered PC record and insert
the marker there.

Issue: #5199
prasun3 added a commit that referenced this issue Jul 4, 2023
prasun3 added a commit that referenced this issue Jul 4, 2023
prasun3 added a commit that referenced this issue Jul 11, 2023
 - added comments clarifying the FILTER_ENDPOINT insertion code
 - added namespace to doxygen comments
 - changed write_trace_data to output_buffer
 - use TIDFMT

Issue: #5199
prasun3 added a commit that referenced this issue Jul 12, 2023
Clarified why we might have filtered entries in the buffer during mode
switch.

Issue: #5199
github-merge-queue bot pushed a commit that referenced this issue Jul 14, 2023
Add a separate L0_filter mode to enable switching from warmup/L0_filter
mode to the unfiltered mode.

Added an option (-L0_filter_until_instrs) which specifies the number of
instructions for warmup trace

In warmup mode, we filter accesses through the -L0{D,I}_filter caches.
If neither -L0D_filter nor -L0I_filter are specified then both are
assumed to be true. The size of these can be specified using
-L0{D,I}_size.

The filter instructions come after the -trace_after_instrs count and
before the instruction trace. This is intended to be used together with
other trace options (e.g., -trace_for_instrs, -exit_after_tracing,
-max_trace_size etc.) but with the difference that a filter trace is
also collected. The filter trace and full trace are stored in a single
file separated by a TRACE_MARKER_TYPE_FILTER_ENDPOINT marker.

When used with windows (i.e., -retrace_every_instrs), each window
contains a filter trace and a full trace. When we have windows contained
in a single-file, the TRACE_MARKER_TYPE_WINDOW_ID markers indicate start
of filtered records.

The following items are not addressed in this PR
- `-align_endpoints` test is not added
- new OFFLINE_FILE_TYPE_BIMODAL type to indicate that the file has both
filtered and unfiltered entries (#6164)
- items not tested
  - handling windows with single-file all-in-one traces
  - online trace
  - ifilter only or dfilter only
- use cases without `-trace_after_instrs`: tracing always starts in full
trace mode, so the tests fail if we use this option; need to understand
why this is happening

Issue: #5199
prasun3 added a commit that referenced this issue Oct 20, 2023
…6368)

For BIMODAL traces (see
#5199), the normal
transition is from COUNT to L0_FILTER to TRACE. But some threads may
transition from COUNT to TRACE directly e.g., if they were not scheduled
during L0_FILTER mode. In this case, the FILTER_ENDPOINT marker is not
inserted currently, which leads to raw2trace failures.

An example cmd line that reproduces the bug is shown below. Note that
there is only one worker thread which transitions through L0_FILTER and
TRACE mode while the main thread is still in COUNT mode.

```
bin64/drrun -t drcachesim -offline \
-trace_after_instrs 2200K -L0_filter_until_instrs 10K -max_trace_size 10K -- ./threadsig 1 1000
```

Issue: #6367
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants