Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llvm-pretty-bc-parser drops constant ptrtoint expression #266

Closed
RyanGlScott opened this issue Jan 29, 2024 · 3 comments · Fixed by #267
Closed

llvm-pretty-bc-parser drops constant ptrtoint expression #266

RyanGlScott opened this issue Jan 29, 2024 · 3 comments · Fixed by #267
Labels

Comments

@RyanGlScott
Copy link
Contributor

If you take this program:

@global_var = external constant [1 x i8]

define i64 @h() {
  br i1 icmp ne (i32 ptrtoint ([1 x i8]* @global_var to i32), i32 1), label %pc_1, label %pc_1
pc_1:
  ret i64 0
}

Compile it to bitcode using llvm-as and roundtrip it through llvm-pretty-bc-parser, you'll get:

λ> parseBitCodeFromFile "test2.bc" >>= either print (print . ppLLVM38 ppModule)
source_filename = "test2.ll"
target triple = "unknown-unknown-unknown-unknown-"
@global_var = external default constant [1 x i8]
define default i64 @h() {
; <label>: 0
  br i1 icmp ne ([1 x i8]* @global_var, i32 1), label %pc_1, label %pc_1
pc_1:
  ret i64 0
}

Notice that h no longer uses a constant ptrtoint expression in the first argument to icmp. This results in ill-typed LLVM code, which llvm-as rejects:

$ llvm-as test2-rt.ll 
llvm-as: test2-rt.ll:6:9: error: compare operands must have the same type
  br i1 icmp ne ([1 x i8]* @global_var, i32 1), label %pc_1, label %pc_1
        ^

This was originally noticed in #262 (comment), with a slightly different test case that uses opaque pointers.

@RyanGlScott
Copy link
Contributor Author

I suspect the culprit involves how llvm-pretty-bc-parser parses constant icmp expressions:

-- [opty, opval, opval, pred]
17 -> label "CST_CODE_CE_CMP" $ do
let field = parseField r
opty <- getType =<< field 0 numeric
op0 <- getConstantFwdRefAdjustedId t opty =<< field 1 numeric
op1 <- getConstantFwdRefAdjustedId t opty =<< field 2 numeric
let isFloat = isPrimTypeOf isFloatingPoint
cst <- if isFloat opty || isVectorOf isFloat opty
then do op <- field 3 fcmpOp
return (ConstFCmp op op0 op1)
else do op <- field 3 icmpOp
return (ConstICmp op op0 op1)
return (getTy, Typed (PrimType (Integer 1)) (ValConstExpr cst):cs)

The suspicious part is the use of the getConstantFwdRefAdjustedId function. I'm not entirely convinced that the IDs which field returns are actually adjusted. Moreover, these are the only two uses of getConstantFwdRefAdjustedId in the entire module; all other constant expressions use forwardRef to parse their arguments instead. Indeed, if I make this change:

diff --git a/src/Data/LLVM/BitCode/IR/Constants.hs b/src/Data/LLVM/BitCode/IR/Constants.hs
index 24e1600..942037e 100644
--- a/src/Data/LLVM/BitCode/IR/Constants.hs
+++ b/src/Data/LLVM/BitCode/IR/Constants.hs
@@ -8,7 +8,6 @@ module Data.LLVM.BitCode.IR.Constants where
 
 import qualified Data.LLVM.BitCode.Assert as Assert
 import           Data.LLVM.BitCode.Bitstream
-import           Data.LLVM.BitCode.IR.Values
 import           Data.LLVM.BitCode.Match
 import           Data.LLVM.BitCode.Parse
 import           Data.LLVM.BitCode.Record
@@ -345,9 +344,12 @@ parseConstantEntry t (getTy,cs) (fromEntry -> Just r) =
   -- [opty, opval, opval, pred]
   17 -> label "CST_CODE_CE_CMP" $ do
     let field = parseField r
-    opty <- getType                  =<< field 0 numeric
-    op0  <- getConstantFwdRefAdjustedId t opty =<< field 1 numeric
-    op1  <- getConstantFwdRefAdjustedId t opty =<< field 2 numeric
+    opty <- getType =<< field 0 numeric
+    ix0  <- field 1 numeric
+    ix1  <- field 2 numeric
+    cxt  <- getContext
+    let op0 = forwardRef cxt ix0 t
+    let op1 = forwardRef cxt ix1 t
 
     let isFloat = isPrimTypeOf isFloatingPoint
     cst <- if isFloat opty || isVectorOf isFloat opty

Then round-tripping through llvm-pretty-bc-parser suceeds. I'm not sure if I 100% understand why that fixes the issue (as I'm not entirely sure what an "adjusted ID" means in this context), but there's likely something to this.

@RyanGlScott
Copy link
Contributor Author

RyanGlScott commented Jan 30, 2024

I did a little more investigation into this:

  • The term "adjusted ID" refers to the particular encoding for a value's ID in the value table. There are two types of encodings: absolute encodings and relative (or adjusted) encodings, both of which are described in the LLVM documentation here. The absolute encoding assigns each value in a function a globally unique ID, whereas the relative encoding assigns each value in a function an ID relative to the current instruction. For example, the first operand in an instruction would be given a relative ID of 1, the second operand would be given a relative ID of 2, and so on.

  • To make things more confusing, not all parts of an LLVM bitcode file use the same encoding. As noted above, instruction operands definitely use the relative encoding. This is why the getValue and getValueTypePair functions (used pervasively throughout Data.LLVM.BitCode.IR.Function to parse instruction operands) are defined in terms of adjustId, which is llvm-pretty-bc-parser's implementation of the relative encoding.

    But what about constant expressions? The docs are somewhat unclear on this point, so I needed to do some further digging.

  • The commit which introduced the use of getConstantFwdRefAdjustedId to parse the operands of a constant comparison expression was a4b683e. Unfortunately, that commit doesn't have any test cases or accompanying comments explaining what the thinking was at the time. Looking at the implementations of getConstantFwdRef and getConstantFwdRefAdjustedId, however, it's easy to see that getConstantFwdRef had a glaring flaw:

    -- | Return a forward reference if the value is not in the incremental table.
    getConstantFwdRef :: ValueTable -> Type -> Int -> Parse (Typed PValue)
    getConstantFwdRef t ty n = label "getConstantFwdRef" $
    adjustId n >>= getConstantFwdRefAdjustedId t ty
    getConstantFwdRefAdjustedId :: ValueTable -> Type -> Int -> Parse (Typed PValue)
    getConstantFwdRefAdjustedId t ty n' = label "getConstantFwdRefAdjustedId" $ do
    mb <- lookupValue n'
    case mb of
    Just tv -> return tv
    -- forward reference
    Nothing -> do
    cxt <- getContext
    let ref = forwardRef cxt n' t
    return (Typed ty (typedValue ref))

    getConstantFwdRefAdjustedId is implemented in terms of lookupValue, which transitively invokes a function that performs the relative encoding. getConstantFwdRef calls adjustedId and then getConstantFwdRefAdjustedId, which means that it is performing the relative encoding twice! This is clearly wrong, and indeed, the parser would simply crash on the old version of the code that used getConstantFwdRef. While I don't think using getConstantFwdRefAdjustedId is completely correct, it's definitely more correct than using getConstantFwdRef. (We should just delete getConstantFwdRef, as I can't ever foresee a situation in which you'd want to use it.)

  • OK, back to the original question: what encoding do we use for parsing constant expressions? As mentioned above, the LLVM documentation is silent on this point, and the LLVM source code doesn't make this terribly obvious. Here is what llvm-bcanalyzer -dump has to say about the h function's constants block:

      <CONSTANTS_BLOCK NumWords=4 BlockCodeSize=4>
        <SETTYPE abbrevid=4 op0=8/>
        <CE_CAST abbrevid=6 op0=9 op1=2 op2=0/>
        <INTEGER abbrevid=5 op0=2/>
        <SETTYPE abbrevid=4 op0=7/>
        <CE_CMP op0=8 op1=2 op2=3 op3=33/>
        <SETTYPE abbrevid=4 op0=3/>
        <NULL/>
      </CONSTANTS_BLOCK>

    Here, CE_CMP corresponds to the constant icmp expression, and the IDs of the two constant expressions being compared are op1=2 and op2=3 (debug-tracing the code in Data.LLVM.BitCode.IR.Constants confirms this). Here is what the value table looks like after parsing the constants block:

      valueEntries =
        fromList
          [ ( 0
            , Typed
                { typedType = PtrTo (Array 1 (PrimType (Integer 8)))
                , typedValue = ValSymbol (Symbol "global_var")
                }
            )
          , ( 1
            , Typed
                { typedType = PtrTo (FunTy (PrimType (Integer 64)) [] False)
                , typedValue = ValSymbol (Symbol "h")
                }
            )
          , ( 2
            , Typed
                { typedType = PrimType (Integer 32)
                , typedValue =
                    ValConstExpr
                      (ConstConv
                         PtrToInt
                         Typed
                           { typedType = PtrTo (Array 1 (PrimType (Integer 8)))
                           , typedValue = ValSymbol (Symbol "global_var")
                           }
                         (PrimType (Integer 32)))
                }
            )
          , ( 3
            , Typed
                { typedType = PrimType (Integer 32) , typedValue = ValInteger 1 }
            )
          , ( 4
            , Typed
                { typedType = PrimType (Integer 1)
                , typedValue =
                    ValConstExpr
                      (ConstICmp
                         Ine
                         Typed
                           { typedType = PrimType (Integer 32)
                           , typedValue =
                               ValConstExpr
                                 (ConstConv
                                    PtrToInt
                                    Typed
                                      { typedType = PtrTo (Array 1 (PrimType (Integer 8)))
                                      , typedValue = ValSymbol (Symbol "global_var")
                                      }
                                    (PrimType (Integer 32)))
                           }
                         Typed
                           { typedType = PrimType (Integer 32) , typedValue = ValInteger 1 })
                }
            )
          , ( 5
            , Typed
                { typedType = PrimType (Integer 64) , typedValue = ValInteger 0 }
            )
          ]

    And wouldn't you know it, the entries with IDs 2 and 3 correspond exactly to i32 ptrtoint ([1 x i8]* @global_var to i32) and i32 1, which suggests that these IDs use the absolute encoding. The fact that getConstantFwdRefAdjustedId successfully parsed any values at all (the wrong values, but values nonetheless) is a fluke.

Based on these findings, I feel much more confident in concluding that the operands of constant expressions use the absolute encodings, which suggests that they should use forwardRef (which is absolute encoding–based) instead of getConstantFwdRefAdjustedId (which is relative encoding–based). I'll submit a PR with a fix.

RyanGlScott added a commit that referenced this issue Jan 30, 2024
This function is unused and, as noted in
#266 (comment),
is almost certainly incorrect.
RyanGlScott added a commit that referenced this issue Jan 30, 2024
As noted in
#266 (comment),
the operand IDs in constant expressions use an absolute encoding. This means
that parsing constant `icmp` expressions with `getConstantFwdRefAdjustedId`
(which assumes a relative encoding) is incorrect. All forms of constant
expressions besides constant `icmp` expressions use `forwardRef` (which assumes
an absolute encoding) to parse their value operands, so constant `icmp`
expressions are the odd one out.

This patch fixes the parsing of constant `icmp` expression operands by (1)
removing the `getConstantFwdRefAdjustedId` function, and (2) switching to using
`forwardRef`.

Fixes #266.
@kquick
Copy link
Member

kquick commented Jan 30, 2024

Nice analysis. I wish the in-file encoding indicated which type of encoding was being presented so that we could deterministically parse one or the other instead of just hoping we do it the same way that llvm is doing it.

RyanGlScott added a commit that referenced this issue Jan 30, 2024
This function is unused and, as noted in
#266 (comment),
is almost certainly incorrect.
RyanGlScott added a commit that referenced this issue Jan 30, 2024
As noted in
#266 (comment),
the operand IDs in constant expressions use an absolute encoding. This means
that parsing constant `icmp` expressions with `getConstantFwdRefAdjustedId`
(which assumes a relative encoding) is incorrect. All forms of constant
expressions besides constant `icmp` expressions use `forwardRef` (which assumes
an absolute encoding) to parse their value operands, so constant `icmp`
expressions are the odd one out.

This patch fixes the parsing of constant `icmp` expression operands by (1)
removing the `getConstantFwdRefAdjustedId` function, and (2) switching to using
`forwardRef`.

Fixes #266.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants