Replies: 1 comment
-
Instruction markers with GCC vs MSVC x86Theory: Function match quality can be estimated by finding instruction "markers", save their relative location (in %) and compare it with the other side. The more markers are similarly placed on both sides, the better the match quality. function lengthThe function length can be a matching quality factor. function proximityThe function proximity to other functions can be a matching quality factor. mov or push global string offsetsStatic strings are typically identical.
mov or push constant value(the closer the better)The closer the value, the better the match!
Perhaps need to ignore small number < 8 or so. Compilers generate noise with small numbers. mov global data offsets into registers
It would help massively to know the symbol of the data, but can we know it? mov, cvtsi2ss, fild dword ptr [register + offset(the closer the better)]The closer the offset value, the better the match!
call C functionsCalling a C function which we may know the symbol for:
call 4 byte addressCalling a 4 byte address to somewhere:
It would help massively to know the function symbol name, but that is chicken egg problem... call dword ptr [register + offset(the closer the better)]The closer the value, the better the match! Offsets are very close:
Call operands can be optimized differently: Not straight forward to compare.
jumpsNumber of jump instructions (any kind) in locations can be a matching quality factor.
|
Beta Was this translation helpful? Give feedback.
-
We need a way to discover all function symbols in a target executable from a source executable.
Possible directions are:
Mach-O to x86 Release.
x86 Release to x86 Debug.
Step 1
Discover addresses to functions in the target executable(s). This simply gives a long list of function addresses without knowing what these functions are.
Potential approach 1
Crawl over assembly instructions beginning from the entry point and enter all calls and jumps once to try discover all function addresses.
Potential approach 2
Crawl over all assembly instructions from top to bottom and detect function headers and footers of functions that have them.
Potential approach 3
Export a list of function addresses from a disassembler tool if it can generate it already.
Check quality
To verify the quality of the address list, compare it with the addresses known from the disassembler tool.
Step 2
Discover names and properties to function addresses in the target executable(s). This gives a long list of function properties for the known function addresses.
Possible approach 1
Process the assembler instructions of the source and target executables and break them down into simplified instructions that can be matched across the executables. Pick a function of the source executable and try to match it with all functions of the target executable. Give scores for the match quality and let the best score win. Search locality and score optimizations can apply if nearby function addresses are already known.
Simplified instructions can contain, among others:
Ordered combinations of simplified universal instructions should give a function a somewhat unique print. The more complexity the function has, the better the match quality can be. Simplified instructions may not compare optimal in case compilers use different strategies to generate code, such as unrolling loops or wild jumps. In this case the matching strategy needs to be smart and lenient. In theory, even a bad matching score could still present the right winner.
Possible approach 2
Train an Ai to understand how a function body from a source executable matches to a target executable. To do this, we would need to compile programs with relevant compilers, crawl through their symbols, and train the AI to recognize how the function instructions of the source executable translate to the target executable. After the training is complete, the Ai Model can be used to match all functions from the target executable.
Check quality
To verify the quality of the matching, check for collisions. If too many matches are colliding, then it is not right.
Also, compare it with the addresses and symbols from the disassembler that have been mapped by hand already.
Beta Was this translation helpful? Give feedback.
All reactions