Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The generated parse table consumes too much memory #40

Open
travitch opened this issue May 19, 2022 · 12 comments
Open

The generated parse table consumes too much memory #40

travitch opened this issue May 19, 2022 · 12 comments

Comments

@travitch
Copy link
Contributor

travitch commented May 19, 2022

The parse tables occupy about 400MB in memory after they are constructed, as can be seen in this profile collected by @RyanGlScott: verify-RSA.saw.pdf. There are two factors to this memory consumption:

  1. The parse tables include all permutations of prefix bytes
    -- We calculate all allowed prefixes for the instruction in the first
    -- argument. This simplifies parsing at the cost of extra space.
    mkOpcodeTable :: [Def] -> ParserGen OpcodeTable
    mkOpcodeTable defs = go [] (concatMap allPrefixedOpcodes defs)
    where -- Recursive function that generates opcode table by parsing
    -- opcodes in first element of list.
  2. The parser is encoded as a graph of Haskell objects
    data OpcodeTable
    = OpcodeTable !NextOpcodeTable
    | SkipModRM !Prefixes !Def
    | ReadModRMTable !(V.Vector ModTable)
    | ReadModRMUnchecked !ModTable
    deriving (Generic)
    instance DS.NFData OpcodeTable
    -- | A NextOpcodeTable describes a table of parsers to read based on the bytes.

Addressing the former is tricky. One could use a simple DFA to parse prefix bytes separately to save an enormous amount of space. However, not all prefixes are valid for all instructions; those restrictions are currently properly encoded in the fully elaborated tables. To separate out prefix parsing, it would be necessary to add a post-parsing check to see if the parse was valid or not.

Addressing the latter might be less tricky, as we could change the representation of the tables. Another disassembler uses a mostly unboxed structure: https://github.com/travitch/dismantle/blob/48433e7ccb02924b2f4695c8c9f09fb9cfccdfc4/dismantle-tablegen/src/Dismantle/Tablegen/LinearizedTrie.hs#L34. The x86 case is a bit trickier as the parser has more states than the parsers generated by dismantle. However, we might be able to take inspiration from the more compact parse table representation and adapt it for flexdis.

@RyanGlScott
Copy link
Contributor

I've been working on a fix for (1)—that is, de-duplicating all of the different combinations of prefixes from the opcode table—for a while. I've pushed my current WIP branch here in case you'd like to follow along. That branch is a bit messy at the moment, since I'm adding a separate disassembler, named defaultX64Disassembler', that de-duplicates prefixes alongside the existing defaultX64Disassembler. I've also copied a fair bit of related functionality, giving them apostrophe suffixes as well to distinguish them from the existing functionality. Once I'm confident that the new disassembler is bug-free, I'll consolidate all of these functions.

First, here is the general approach that I'm taking in this branch:

  1. The structure of the opcode table has changed from this:

    data OpcodeTable
       = OpcodeTable !NextOpcodeTable
       | SkipModRM !Prefixes !Def
       | ReadModRMTable !(V.Vector ModTable)
       | ReadModRMUnchecked !ModTable

    To this:

    data OpcodeTable'
       = OpcodeTable' !NextOpcodeTable'
       | OpcodeTableEntry ![Def] -- Defs expecting a ModR/M byte
                          ![Def] -- Defs not expecting a ModR/M byte

    Previously, the OpcodeTable data constructor was used any time there was an instruction opcode or a prefix byte. In the new design, the OpcodeTable' data constructor is only used for instruction opcodes—prefixes are not encoded into the structure of the table at all.

    One consequence of this choice is that there can sometimes be multiple instruction definitions that use a particular set of opcodes. For instance, both the add and the vpshufb instructions use 00 as their opcode. For this reason, the OpcodeTableEntry data constructor, which represents a leaf entry in the opcode table, must contain a list of Defs rather than just a single one. In practice, these lists are quite small, and we may want to consider using a SmallArray to represent them.

    Once we have an OpcodeTableEntry, we then use the list of parsed prefixes to disambiguate among all of the potential Defs in the list. (Actually, there are two lists, since Defs with a ModR/M byte have to be disassembled in a slightly different way. In principle, we could combine these into a single list, however.) Implementing the function that performs this disambiguation is one of the main challenges of this patch—more on this in a bit.

  2. Instead of encoding all possible combinations of prefixes in the branching structure of the opcode table, we instead perform a parsing pass upfront that parses as many prefixes as possible before parsing instruction opcodes. Here is what this parsing pass looks like at the moment:

    loopPrefixBytes :: Seq.Seq Word8 -> m InstructionInstance
    loopPrefixBytes prefixBytes = do
      b <- readByte
      if |  b `elem` simplePrefixBytes
         -> loopPrefixBytes (prefixBytes Seq.|> b)
    
         |  b `elem` segPrefixBytes
         -> loopPrefixBytes (prefixBytes Seq.|> b)
    
         |  b `elem` rexPrefixBytes
         -> loopPrefixBytes (prefixBytes Seq.|> b)
    
            -- Two-byte VEX prefix
         |  b == 0xc5
         -> do b2 <- readByte
               loopPrefixBytes (prefixBytes Seq.>< Seq.fromList [b, b2])
    
            -- Three-byte VEX prefix
         |  b == 0xc4
         -> do b2 <- readByte
               b3 <- readByte
               loopPrefixBytes (prefixBytes Seq.>< Seq.fromList [b, b2, b3])
    
         |  otherwise
         -> loopOpcodes prefixBytes tr0 b -- Disassemble instruction based on opcodes

    If this looks too simplistic to work, that's because that is. More on this later.

  3. Once we have all of the prefixes and instruction opcodes, we have to use the prefixes to disambiguate among all of the possible Defs. My branch contains a validatePrefixBytes function that performs this disambiguation. The implementation is a bit too long to go into here, but among other things, it checks that:

    • An instruction actually accepts all of the parsed prefixes
    • If the instruction has a required prefix, then that prefix has been parsed
    • The instruction has appropriate address and operand sizes as dictated by the prefixes (see the validPrefix funtion)
    • etc.

    There are an enormous number of possible checks that we could put into this function—see this page for some ideas. It might be best to only implement the checks we need to disambiguate instructions and add more later if they become necessary.

  4. Remove some ugly hacks in the XML file representing all x86_64 instructions (see data/optable.xml). For instance, there are quite a few nop definitions that have dummy 66 prefixes to accommodate the way opcode table parsing currently works, accompanied by this comment:

    flexdis86/data/optable.xml

    Lines 5249 to 5254 in 7109bdc

    <!-- TODO: this is a dumb hack to deal with multi-byte nops
    get rid of it when general prefix repetition is handled -->
    <def>
    <pfx>rexr rexx rexb</pfx>
    <opc>66 66 2e 0f 1f</opc>
    <opr>M</opr>

    Since we now parse prefixes upfront, I believe this hack should no longer be necessary, so we can remove these silly 66-prefixed definitions.

    Another similar 66 prefix hack can be found in one of the definitions of xchg:

    <opc>66 90</opc>

    Similarly, I think we can just remove the 66 here now that we handle prefixes differently.

This approach was enough to get nearly the entire flexdis86 test suite to pass, save for one exception:

      pause:                                    FAIL
        Exception: TODO RGS: No parse
        CallStack (from HasCallStack):
          error, called at src/Flexdis86/Disassembler.hs:1115:10 in flexdis86-0.1.5-inplace:Flexdis86.Disassembler
        Use -p '/pause/' to rerun this test only.

Let's talk about why this happens. The prefix parsing approach that I took in step (2) makes a key assumption: bytes that can be used as prefixes will never be used as the first byte in an instruction's opcodes. After all, the only way we know how to stop disassembling prefixes is to encounter a byte that isn't in the set of known prefix bytes, which we interpret as an instruction's first opcode byte. Unfortunately, this assumption turns out not to be true. In the case of the pause instruction, we have:

flexdis86/data/optable.xml

Lines 5640 to 5645 in 7109bdc

<instruction>
<mnemonic>pause</mnemonic>
<def>
<opc>f3 90</opc>
</def>
</instruction>

But the 0xf3 byte is also used for the repz prefix:

, ("repz", ([0xf3], set prLockPrefix RepZPrefix))

As a result, we mistakenly parse pause's first instruction opcode as a prefix, which prevents us from finding pause in the opcode table later. One idea for working around this is to check if an instruction's opcodes start with bytes that could be interpreted as prefixes, and if so, "backtrack" through the list of parsed prefixes to remove the prefixes that were mistakenly classified as prefixes. This is essentially how the Haskell disassembler library handles parsing pause.

VEX prefixes (which the disassembler library does not handle, AFAICT) complicate matters further. The lds and les instructions have opcodes which clash with 0xc5 and 0xc4, the two-byte and three-byte VEX prefixes, respectively:

flexdis86/data/optable.xml

Lines 4256 to 4263 in 7109bdc

<instruction>
<mnemonic>lds</mnemonic>
<def>
<pfx>aso oso</pfx>
<opc>c5 /m=!64</opc>
<opr>Gv M</opr>
</def>
</instruction>

flexdis86/data/optable.xml

Lines 4274 to 4281 in 7109bdc

<instruction>
<mnemonic>les</mnemonic>
<def>
<pfx>aso oso</pfx>
<opc>c4 /m=!64</opc>
<opr>Gv M</opr>
</def>
</instruction>

What's more, these prefixes are expected to be followed by some number of additional bytes, so not only would need to "backtrack" over parsing the VEX prefix, we would also need to backtrack over parsing the additional bytes that follow the prefix. Moreover, since lds and les' opcodes are each one byte, the additional bytes that we mistakenly parse upfront would correspond to the operands to the instruction! Hoo boy. It's a bit scary that the flexdis86 test suite doesn't catch this.

In short, I think we need some kind of way to perform this backtracking. I have some ideas for how to do this, but before I set out on implementing this idea, I wanted to do a quick sanity check with @travitch to make sure I'm on the right path. Can you see a simpler way to solve these problems?

@travitch
Copy link
Contributor Author

I wonder if, instead of "backtracking", we could just remember the last byte in the prefixes we parsed, and incorporate those "last bytes" into the validity/opcode checks.

I also think we might need to be a bit liberal in accepting prefixes that are invalid if the parse is otherwise unambiguous (e.g., see https://repzret.org/p/repzret/)

@RyanGlScott
Copy link
Contributor

I wonder if, instead of "backtracking", we could just remember the last byte in the prefixes we parsed, and incorporate those "last bytes" into the validity/opcode checks.

I considered this, but one complication with this idea is that the opcode table encodes the invariant that every instruction is reachable by a path containing at least one OpcodeTable data constructor. If this invariant is violated, you will reach one of these error cases:

SkipModRM {} -> Left "Unexpected SkipModRM as a top-level disassemble result"
ReadModRMTable{} -> Left "Unexpected ReadModRM as a top-level disassemble result"
ReadModRMUnchecked{} -> Left "Unexpected ReadModRM as a top-level disassemble result"

(In my branch, there is a corresponding error case for OpcodeTableEntry.)

For instructions like lds and les, the opcode consists of only a single byte. If we parse this byte as a prefix, then there are no bytes remaining to use for the OpcodeTable path, so we would not be able to store these instructions in the table. The backtracking idea is the only thing I can think of for making this work. That is, encode lds and les' opcodes into the the table as normal, parse its opcode as prefix bytes, and then when we try to check that we have the lds or les instruction, backtrack over the prefix bytes so that we can look them up in the opcode table as intended.

I also think we might need to be a bit liberal in accepting prefixes that are invalid if the parse is otherwise unambiguous (e.g., see https://repzret.org/p/repzret/)

Huh, that's an interesting read. I don't think this particular example poses an issue for my branch, as it can successfully disassemble f3 c3 using the set of validity checks that I have currently. But this is a worthwhile cautionary tale to not go overboard with adding too many additional validity checks, lest we reject examples like this one.

@travitch
Copy link
Contributor Author

Regarding repz ret, I bet the data file specifies repz as an allowed prefix when it really isn't (to actually parse this correctly). I could be wrong - that is just a guess.

@RyanGlScott
Copy link
Contributor

Regarding repz ret, I bet the data file specifies repz as an allowed prefix when it really isn't (to actually parse this correctly).

This seems likely, as this prefix was added for... reasons in b625205. Nevertheless, I'm not too bothered by this, as this is a hack that would be needed in both the current and new designs. (If it were a hack that was only needed in one particular design, that would be a bit more eyebrow-raising.)

@travitch
Copy link
Contributor Author

This might be effectively backtracking, but there could be a PseudoOpcodeTable constructor that acts mostly like OpcodeTable, except it means "Dispatch on the last byte decoded from a prefix". I'm browsing through to see how to implement that

@RyanGlScott
Copy link
Contributor

RyanGlScott commented Jun 21, 2022

This might be effectively backtracking, but there could be a PseudoOpcodeTable constructor that acts mostly like OpcodeTable, except it means "Dispatch on the last byte decoded from a prefix".

That might work, although things would get complicated for instructions whose operands are parsed eagerly as prefixes due to VEX. I was thinking of instead having a newtype like this:

newtype BacktrackingByteReader m a = BacktrackingByteReader (StateT [Word8] m a)

And giving it a ByteReader instance such that when readByte is called, it will read it from the [Word8] if it is non-empty (and pop off the byte afterward), and disassemble from bytes otherwise. This would avoid needing to change any of the operand-dissembling code to be aware of backtracking, as we could continue to program against the polymorphic ByteReader interface.

@travitch
Copy link
Contributor Author

Just to record it, we talked about having a separate set of parse tables for VEX instructions that would be consulted iff the VEX prefix is in the set of parsed prefixes. That would avoid a need for backtracking in those cases.

It would be nice to know if there are cases besides pause that exist in the rest of the instruction space. pause is truly unfortunate, because it really isn't an instruction - it is just repz nop, which is treated as a special hint named pause. If there are a small set of such instructions (where they are just aliases), we may want to consider instead separating those out from the data table and just using them to guide pretty printing (rather than decoding).

Operationally, that would mean parsing f3 90 as a NOP, but noticing that it can be rendered as pause and fixing it up after the fact.

@RyanGlScott
Copy link
Contributor

RyanGlScott commented Jun 24, 2022

Just to record it, we talked about having a separate set of parse tables for VEX instructions that would be consulted iff the VEX prefix is in the set of parsed prefixes. That would avoid a need for backtracking in those cases.

As an experiment, I pushed a branch here that encodes VEX prefix bytes into the opcode table alongside the instruction opcodes to avoid any conflicts with instructions like lds and les. Here is how large the opcode table is on that branch:

λ> nextOpcodeSize' defaultX64Disassembler'
13143

For comparison, here is how large it is on the main branch:

λ> nextOpcodeSize defaultX64Disassembler
5869432

And here is how large it is after removing VEX prefixes:

λ> nextOpcodeSize' defaultX64Disassembler'
1409

@RyanGlScott
Copy link
Contributor

RyanGlScott commented Jun 24, 2022

That being said, one thing about lds and les that I did not realize until recently is that they only work in 32-bit mode, and since defaultX64Disassembler filters out instructions that don't work in 64-bit mode, they aren't even included in the table to begin with. As a result, I'm forced to partially retract my "hoo boy" comment from #40 (comment).

Speaking of which, are there other instructions whose opcodes conflict with bytes used for prefixes? Thankfully, les and lds are the only ones that conflict with VEX prefixes. What about the remaining 16 bytes used for prefixes? I audited data/optable.xml, and here are the results:

Segment prefixes

No conflicts in any of 0x26, 0x2e, 0x36, 0x3e, 0x64, or 0x65, thankfully enough.

"Simple" prefixes

  • 0x66: As noted in The generated parse table consumes too much memory #40 (comment), there were several hacky occurrences of nop and xchg that had 0x66 placed in front of their opcodes to work around limitations of the current approach to disassembling, but these hacks can be removed with the new approach.

    Another conflict is vpcmpgtd:

    <instruction>
        <mnemonic>vpcmpgtd</mnemonic>
        <class>avx</class>
        <def>
            <opc>/vex=NDS.128.66.0F.WIG 66</opc>
            <opr>Vx Hx Ux</opr>
        </def>
    </instruction>
  • 0x67: No conflicts

  • 0xf0: No conflicts

  • 0xf2: The vpslld instruction has some conflicts:

    <instruction>
      <mnemonic>vpslld</mnemonic>
      <class>avx</class>
      <def>
        <opc>/vex=NDS.128.66.0F.WIG F2</opc>
        <opr>Vx Hx Wx</opr>
      </def>
      ...
      <def>
        <opc>/vex=NDS.256.66.0F.WIG F2</opc>
        <opr>Vx Hx Wx</opr>
      </def>
      ...
    </instruction>
  • 0xf3: endbr32 and endbr64 conflict:

    <instruction>
      <mnemonic>endbr32</mnemonic>
      <def>
        <opc>F3 0F 1E FB</opc>
      </def>
    </instruction>
    
    <instruction>
      <mnemonic>endbr64</mnemonic>
      <def>
        <opc>F3 0F 1E FA</opc>
      </def>
    </instruction>

    As does vpsllq:

    <instruction>
        <mnemonic>vpsllq</mnemonic>
        <class>avx</class>
        <def>
            <opc>/vex=NDS.128.66.0F.WIG F3</opc>
            <opr>Vx Ux Wx</opr>
        </def>
        ...
    </instruction>

    And our old friend pause:

    <instruction>
        <mnemonic>pause</mnemonic>
        <def>
            <opc>f3 90</opc>
        </def>
    </instruction>

REX prefixes (64-bit mode only)

The inc instruction can have opcodes 0x40 through 0x47 and the dec instruction can have opcodes 0x48 through 0x4f, all of which are REX prefixes. Each instruction has different opcodes in 64-bit mode, however, so this issue shouldn't show up in the flexdis86 test suite for the reasons described above.

The vpclmulqdq also conflicts with the 0x44 REX prefix:

<instruction>
  <mnemonic>vpclmulqdq</mnemonic>
  <class>avx</class>
  <def>
    <opc>/vex=NDS.128.66.0F3A.WIG 44</opc>
    <opr>Vdq Hdq Wdq Ib</opr>
  </def>
</instruction>

Several of these issues would be avoided by encoding VEX prefixes into the opcode table. Moreover, the inc and dec instructions wouldn't pose issues in 64-bit mode. By my count, that only leaves endbr32, endbr64, and pause as potential sources of conflict.

@travitch
Copy link
Contributor Author

endbr32 and endbr64 are also special no-ops

RyanGlScott added a commit that referenced this issue Jun 29, 2022
Previously, the opcode lookup table would encode every possible permutation of
allowable prefixes for each instruction as a separate path. This is expensive
in both space and time, as observed in #40. The new approach taken in this
patch, as described in `Note [x86_64 disassembly]` in `Flexdis86.Disassembler`,
is to only encode the VEX prefixe and opcode bytes in the lookup table, leaving
out all other forms of prefix bytes entirely. Instead, disassembly will start
by eagerly parsing as many prefix bytes as possible, proceeding to parse opcode
bytes after the first non-prefix byte is encountered. After identifying the
possible instructions from the opcode, we will then narrow down exactly which
instruction it is by validating them against the set of parsed prefixes.

As noted in `Note [x86_64 disassembly]`, we had to add some special cases for
`nop`-like instructions—namely, `endbr32`, endbr64`, `pause`, and `xchg`—to
avoid some prefix byte–related ambiguity. The new handling for `xchg` is
more accurate than it was before, so this patch fixes #42 as a side effect.
This patch also addresses part (1) of #40 in that it should reduce the amount
of memory usage that the lookup table takes, although there is potentially more
work to be done (see part (2) of #40).
RyanGlScott added a commit that referenced this issue Jun 30, 2022
Previously, the opcode lookup table would encode every possible permutation of
allowable prefixes for each instruction as a separate path. This is expensive
in both space and time, as observed in #40. The new approach taken in this
patch, as described in `Note [x86_64 disassembly]` in `Flexdis86.Disassembler`,
is to only encode the VEX prefixe and opcode bytes in the lookup table, leaving
out all other forms of prefix bytes entirely. Instead, disassembly will start
by eagerly parsing as many prefix bytes as possible, proceeding to parse opcode
bytes after the first non-prefix byte is encountered. After identifying the
possible instructions from the opcode, we will then narrow down exactly which
instruction it is by validating them against the set of parsed prefixes.

As noted in `Note [x86_64 disassembly]`, we had to add some special cases for
`nop`-like instructions—namely, `endbr32`, endbr64`, `pause`, and `xchg`—to
avoid some prefix byte–related ambiguity. The new handling for `xchg` is
more accurate than it was before, so this patch fixes #42 as a side effect.
This patch also addresses part (1) of #40 in that it should reduce the amount
of memory usage that the lookup table takes, although there is potentially more
work to be done (see part (2) of #40).
RyanGlScott added a commit that referenced this issue Jul 1, 2022
Previously, the opcode lookup table would encode every possible permutation of
allowable prefixes for each instruction as a separate path. This is expensive
in both space and time, as observed in #40. The new approach taken in this
patch, as described in `Note [x86_64 disassembly]` in `Flexdis86.Disassembler`,
is to only encode the VEX prefixe and opcode bytes in the lookup table, leaving
out all other forms of prefix bytes entirely. Instead, disassembly will start
by eagerly parsing as many prefix bytes as possible, proceeding to parse opcode
bytes after the first non-prefix byte is encountered. After identifying the
possible instructions from the opcode, we will then narrow down exactly which
instruction it is by validating them against the set of parsed prefixes.

As noted in `Note [x86_64 disassembly]`, we had to add some special cases for
`nop`-like instructions—namely, `endbr32`, endbr64`, `pause`, and `xchg`—to
avoid some prefix byte–related ambiguity. The new handling for `xchg` is
more accurate than it was before, so this patch fixes #42 as a side effect.
This patch also addresses part (1) of #40 in that it should reduce the amount
of memory usage that the lookup table takes, although there is potentially more
work to be done (see part (2) of #40).
@RyanGlScott
Copy link
Contributor

At long last, #43 takes care of part (1). Part (2) may also be worth doing, but I imagine (1) alone will be enough to knock out most of the egregious memory usage, especially in light of #43 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants