`llvm-pretty-bc-parser` drops constant `ptrtoint` expression #266

RyanGlScott · 2024-01-29T11:35:51Z

If you take this program:

@global_var = external constant [1 x i8]

define i64 @h() {
  br i1 icmp ne (i32 ptrtoint ([1 x i8]* @global_var to i32), i32 1), label %pc_1, label %pc_1
pc_1:
  ret i64 0
}

Compile it to bitcode using llvm-as and roundtrip it through llvm-pretty-bc-parser, you'll get:

λ> parseBitCodeFromFile "test2.bc" >>= either print (print . ppLLVM38 ppModule)
source_filename = "test2.ll"
target triple = "unknown-unknown-unknown-unknown-"
@global_var = external default constant [1 x i8]
define default i64 @h() {
; <label>: 0
  br i1 icmp ne ([1 x i8]* @global_var, i32 1), label %pc_1, label %pc_1
pc_1:
  ret i64 0
}

Notice that h no longer uses a constant ptrtoint expression in the first argument to icmp. This results in ill-typed LLVM code, which llvm-as rejects:

$ llvm-as test2-rt.ll 
llvm-as: test2-rt.ll:6:9: error: compare operands must have the same type
  br i1 icmp ne ([1 x i8]* @global_var, i32 1), label %pc_1, label %pc_1
        ^

This was originally noticed in #262 (comment), with a slightly different test case that uses opaque pointers.

The text was updated successfully, but these errors were encountered:

RyanGlScott · 2024-01-29T11:36:23Z

I suspect the culprit involves how llvm-pretty-bc-parser parses constant icmp expressions:

llvm-pretty-bc-parser/src/Data/LLVM/BitCode/IR/Constants.hs

Lines 345 to 360 in 39b4a5f

    
           -- [opty, opval, opval, pred] 
        
           17 -> label "CST_CODE_CE_CMP" $ do 
        
             let field = parseField r 
        
             opty <- getType                  =<< field 0 numeric 
        
             op0  <- getConstantFwdRefAdjustedId t opty =<< field 1 numeric 
        
             op1  <- getConstantFwdRefAdjustedId t opty =<< field 2 numeric 
        
             let isFloat = isPrimTypeOf isFloatingPoint 
        
             cst <- if isFloat opty || isVectorOf isFloat opty 
        
                       then do op <- field 3 fcmpOp 
        
                               return (ConstFCmp op op0 op1) 
        
                       else do op <- field 3 icmpOp 
        
                               return (ConstICmp op op0 op1) 
        
             return (getTy, Typed (PrimType (Integer 1)) (ValConstExpr cst):cs)

The suspicious part is the use of the getConstantFwdRefAdjustedId function. I'm not entirely convinced that the IDs which field returns are actually adjusted. Moreover, these are the only two uses of getConstantFwdRefAdjustedId in the entire module; all other constant expressions use forwardRef to parse their arguments instead. Indeed, if I make this change:

diff --git a/src/Data/LLVM/BitCode/IR/Constants.hs b/src/Data/LLVM/BitCode/IR/Constants.hs
index 24e1600..942037e 100644
--- a/src/Data/LLVM/BitCode/IR/Constants.hs
+++ b/src/Data/LLVM/BitCode/IR/Constants.hs
@@ -8,7 +8,6 @@ module Data.LLVM.BitCode.IR.Constants where
 
 import qualified Data.LLVM.BitCode.Assert as Assert
 import           Data.LLVM.BitCode.Bitstream
-import           Data.LLVM.BitCode.IR.Values
 import           Data.LLVM.BitCode.Match
 import           Data.LLVM.BitCode.Parse
 import           Data.LLVM.BitCode.Record
@@ -345,9 +344,12 @@ parseConstantEntry t (getTy,cs) (fromEntry -> Just r) =
   -- [opty, opval, opval, pred]
   17 -> label "CST_CODE_CE_CMP" $ do
     let field = parseField r
-    opty <- getType                  =<< field 0 numeric
-    op0  <- getConstantFwdRefAdjustedId t opty =<< field 1 numeric
-    op1  <- getConstantFwdRefAdjustedId t opty =<< field 2 numeric
+    opty <- getType =<< field 0 numeric
+    ix0  <- field 1 numeric
+    ix1  <- field 2 numeric
+    cxt  <- getContext
+    let op0 = forwardRef cxt ix0 t
+    let op1 = forwardRef cxt ix1 t
 
     let isFloat = isPrimTypeOf isFloatingPoint
     cst <- if isFloat opty || isVectorOf isFloat opty

Then round-tripping through llvm-pretty-bc-parser suceeds. I'm not sure if I 100% understand why that fixes the issue (as I'm not entirely sure what an "adjusted ID" means in this context), but there's likely something to this.

RyanGlScott · 2024-01-30T15:24:06Z

I did a little more investigation into this:

The term "adjusted ID" refers to the particular encoding for a value's ID in the value table. There are two types of encodings: absolute encodings and relative (or adjusted) encodings, both of which are described in the LLVM documentation here. The absolute encoding assigns each value in a function a globally unique ID, whereas the relative encoding assigns each value in a function an ID relative to the current instruction. For example, the first operand in an instruction would be given a relative ID of 1, the second operand would be given a relative ID of 2, and so on.
To make things more confusing, not all parts of an LLVM bitcode file use the same encoding. As noted above, instruction operands definitely use the relative encoding. This is why the getValue and getValueTypePair functions (used pervasively throughout Data.LLVM.BitCode.IR.Function to parse instruction operands) are defined in terms of adjustId, which is llvm-pretty-bc-parser's implementation of the relative encoding.

But what about constant expressions? The docs are somewhat unclear on this point, so I needed to do some further digging.

The commit which introduced the use of getConstantFwdRefAdjustedId to parse the operands of a constant comparison expression was a4b683e. Unfortunately, that commit doesn't have any test cases or accompanying comments explaining what the thinking was at the time. Looking at the implementations of getConstantFwdRef and getConstantFwdRefAdjustedId, however, it's easy to see that getConstantFwdRef had a glaring flaw:

llvm-pretty-bc-parser/src/Data/LLVM/BitCode/IR/Values.hs

Lines 22 to 37 in 39b4a5f

    
           -- | Return a forward reference if the value is not in the incremental table. 
        
           getConstantFwdRef :: ValueTable -> Type -> Int -> Parse (Typed PValue) 
        
           getConstantFwdRef t ty n = label "getConstantFwdRef" $ 
        
               adjustId n >>= getConstantFwdRefAdjustedId t ty 
        
           getConstantFwdRefAdjustedId :: ValueTable -> Type -> Int -> Parse (Typed PValue) 
        
           getConstantFwdRefAdjustedId t ty n' = label "getConstantFwdRefAdjustedId" $ do 
        
             mb <- lookupValue n' 
        
             case mb of 
        
               Just tv -> return tv 
        
               -- forward reference 
        
               Nothing -> do 
        
                 cxt <- getContext 
        
                 let ref = forwardRef cxt n' t 
        
                 return (Typed ty (typedValue ref))

getConstantFwdRefAdjustedId is implemented in terms of lookupValue, which transitively invokes a function that performs the relative encoding. getConstantFwdRef calls adjustedId and then getConstantFwdRefAdjustedId, which means that it is performing the relative encoding twice! This is clearly wrong, and indeed, the parser would simply crash on the old version of the code that used getConstantFwdRef. While I don't think using getConstantFwdRefAdjustedId is completely correct, it's definitely more correct than using getConstantFwdRef. (We should just delete getConstantFwdRef, as I can't ever foresee a situation in which you'd want to use it.)

OK, back to the original question: what encoding do we use for parsing constant expressions? As mentioned above, the LLVM documentation is silent on this point, and the LLVM source code doesn't make this terribly obvious. Here is what llvm-bcanalyzer -dump has to say about the h function's constants block:

  <CONSTANTS_BLOCK NumWords=4 BlockCodeSize=4>
    <SETTYPE abbrevid=4 op0=8/>
    <CE_CAST abbrevid=6 op0=9 op1=2 op2=0/>
    <INTEGER abbrevid=5 op0=2/>
    <SETTYPE abbrevid=4 op0=7/>
    <CE_CMP op0=8 op1=2 op2=3 op3=33/>
    <SETTYPE abbrevid=4 op0=3/>
    <NULL/>
  </CONSTANTS_BLOCK>

Here, CE_CMP corresponds to the constant icmp expression, and the IDs of the two constant expressions being compared are op1=2 and op2=3 (debug-tracing the code in Data.LLVM.BitCode.IR.Constants confirms this). Here is what the value table looks like after parsing the constants block:

  valueEntries =
    fromList
      [ ( 0
        , Typed
            { typedType = PtrTo (Array 1 (PrimType (Integer 8)))
            , typedValue = ValSymbol (Symbol "global_var")
            }
        )
      , ( 1
        , Typed
            { typedType = PtrTo (FunTy (PrimType (Integer 64)) [] False)
            , typedValue = ValSymbol (Symbol "h")
            }
        )
      , ( 2
        , Typed
            { typedType = PrimType (Integer 32)
            , typedValue =
                ValConstExpr
                  (ConstConv
                     PtrToInt
                     Typed
                       { typedType = PtrTo (Array 1 (PrimType (Integer 8)))
                       , typedValue = ValSymbol (Symbol "global_var")
                       }
                     (PrimType (Integer 32)))
            }
        )
      , ( 3
        , Typed
            { typedType = PrimType (Integer 32) , typedValue = ValInteger 1 }
        )
      , ( 4
        , Typed
            { typedType = PrimType (Integer 1)
            , typedValue =
                ValConstExpr
                  (ConstICmp
                     Ine
                     Typed
                       { typedType = PrimType (Integer 32)
                       , typedValue =
                           ValConstExpr
                             (ConstConv
                                PtrToInt
                                Typed
                                  { typedType = PtrTo (Array 1 (PrimType (Integer 8)))
                                  , typedValue = ValSymbol (Symbol "global_var")
                                  }
                                (PrimType (Integer 32)))
                       }
                     Typed
                       { typedType = PrimType (Integer 32) , typedValue = ValInteger 1 })
            }
        )
      , ( 5
        , Typed
            { typedType = PrimType (Integer 64) , typedValue = ValInteger 0 }
        )
      ]

And wouldn't you know it, the entries with IDs 2 and 3 correspond exactly to i32 ptrtoint ([1 x i8]* @global_var to i32) and i32 1, which suggests that these IDs use the absolute encoding. The fact that getConstantFwdRefAdjustedId successfully parsed any values at all (the wrong values, but values nonetheless) is a fluke.

Based on these findings, I feel much more confident in concluding that the operands of constant expressions use the absolute encodings, which suggests that they should use forwardRef (which is absolute encoding–based) instead of getConstantFwdRefAdjustedId (which is relative encoding–based). I'll submit a PR with a fix.

This function is unused and, as noted in #266 (comment), is almost certainly incorrect.

As noted in #266 (comment), the operand IDs in constant expressions use an absolute encoding. This means that parsing constant `icmp` expressions with `getConstantFwdRefAdjustedId` (which assumes a relative encoding) is incorrect. All forms of constant expressions besides constant `icmp` expressions use `forwardRef` (which assumes an absolute encoding) to parse their value operands, so constant `icmp` expressions are the odd one out. This patch fixes the parsing of constant `icmp` expression operands by (1) removing the `getConstantFwdRefAdjustedId` function, and (2) switching to using `forwardRef`. Fixes #266.

kquick · 2024-01-30T16:58:31Z

Nice analysis. I wish the in-file encoding indicated which type of encoding was being presented so that we could deterministically parse one or the other instead of just hoping we do it the same way that llvm is doing it.

This function is unused and, as noted in #266 (comment), is almost certainly incorrect.

As noted in #266 (comment), the operand IDs in constant expressions use an absolute encoding. This means that parsing constant `icmp` expressions with `getConstantFwdRefAdjustedId` (which assumes a relative encoding) is incorrect. All forms of constant expressions besides constant `icmp` expressions use `forwardRef` (which assumes an absolute encoding) to parse their value operands, so constant `icmp` expressions are the odd one out. This patch fixes the parsing of constant `icmp` expression operands by (1) removing the `getConstantFwdRefAdjustedId` function, and (2) switching to using `forwardRef`. Fixes #266.

RyanGlScott added the bug label Jan 29, 2024

RyanGlScott mentioned this issue Jan 29, 2024

Invalid pretty output ptr* #262

Open

RyanGlScott added a commit that referenced this issue Jan 30, 2024

Remove getConstantFwdRef function

3d25190

This function is unused and, as noted in #266 (comment), is almost certainly incorrect.

RyanGlScott mentioned this issue Jan 30, 2024

Parse constant cmp operand IDs using absolute encoding #267

Merged

RyanGlScott added a commit that referenced this issue Jan 30, 2024

Remove getConstantFwdRef function

8e787d9

This function is unused and, as noted in #266 (comment), is almost certainly incorrect.

RyanGlScott closed this as completed in #267 Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`llvm-pretty-bc-parser` drops constant `ptrtoint` expression #266

`llvm-pretty-bc-parser` drops constant `ptrtoint` expression #266

RyanGlScott commented Jan 29, 2024

RyanGlScott commented Jan 29, 2024

RyanGlScott commented Jan 30, 2024 •

edited

Loading

kquick commented Jan 30, 2024

llvm-pretty-bc-parser drops constant ptrtoint expression #266

llvm-pretty-bc-parser drops constant ptrtoint expression #266

Comments

RyanGlScott commented Jan 29, 2024

RyanGlScott commented Jan 29, 2024

RyanGlScott commented Jan 30, 2024 • edited Loading

kquick commented Jan 30, 2024

`llvm-pretty-bc-parser` drops constant `ptrtoint` expression #266

`llvm-pretty-bc-parser` drops constant `ptrtoint` expression #266

RyanGlScott commented Jan 30, 2024 •

edited

Loading