Possible alternative to rust-cexpr #1762

jyn514 · 2020-04-10T14:48:13Z

I noticed that you commented on jethrogb/rust-cexpr#23 that you depend on cexpr and don't want it to be unmaintained. As another alternative, I'm currently working on a C compiler in Rust which doesn't have any C dependencies on LLVM. It handles

It does not currently support cross-compiling, but it will soon™.

Let me know if you're interested! I don't currently have the exact API cexpr exposes (preprocessor_expr or something like that), but I could add it without much trouble.

The text was updated successfully, but these errors were encountered:

jyn514 · 2020-04-10T14:50:12Z

This would almost certainly fix #316 and #1594 as well.

emilio · 2020-04-13T00:39:41Z

That looks nice. It would probably depend on how much bloat does it bring in compared to cexpr, but probably could be an opt-in feature otherwise if needed.

I think the biggest issues we have with macros are a bit more intrinsic to macros, like "we don't know if a name refers to a named constant because that's only really known at the time of expansion".

Also note that we only use cexpr for preprocessing so we don't really need or want full compilation. We use libclang for that.

I think the biggest use-case that's missing from cexpr (apart from maybe trying to understand some types, but that's another of the "can't really know until expansion" bits, which seem hard to solve because there may not be any expansion at all) is the "invoking a functional macro with a constant argument". That seems like it could be reasonably handled.

Can you elaborate on how bindgen would use your compiler, and which enhancements or regressions can we expect from that?

jyn514 · 2020-04-13T01:20:39Z

It would probably depend on how much bloat does it bring in compared to cexpr, but probably could be an opt-in feature otherwise if needed.

Yes, that's the major complaint I've had. I can feature-gate all the codegen, but the fewest dependencies I have right now still have a dependency on thiserror which in turn depends on syn and quote:

rcc v0.8.0 (/home/joshua/src/rust/rcc)
├── codespan v0.8.0
│   └── unicode-segmentation v1.6.0
├── counter v0.4.3
│   └── num-traits v0.2.11
│       [build-dependencies]
│       └── autocfg v1.0.0
├── hexponent v0.2.3
├── lazy_static v1.4.0
├── string-interner v0.7.1
├── target-lexicon v0.10.0
└── thiserror v1.0.14
    └── thiserror-impl v1.0.14
        ├── proc-macro2 v1.0.10 (*)
        ├── quote v1.0.3 (*)
        └── syn v1.0.17 (*)

Also note that we only use cexpr for preprocessing so we don't really need or want full compilation. We use libclang for that.

Ok, if you use default-features = false it won't pull in cranelift.

(apart from maybe trying to understand some types, but that's another of the "can't really know until expansion" bits, which seem hard to solve because there may not be any expansion at all)

Hmm that sounds interesting ... I could try expanding object macros and seeing if they form a valid expression and giving you back that type. It won't work all the time but it might work more than it does now? For function macros I could try setting all the arguments to Int(0) or something like that.

"invoking a functional macro with a constant argument". That seems like it could be reasonably handled.

That's handled now, yes.

Can you elaborate on how bindgen would use your compiler, and which enhancements or regressions can we expect from that?

Sure thing. I'd expect bindgen to use rcc to expand C macros into Rust const and statics in contexts that cexpr can't currently handle. If you give rcc the type information from clang, it will also be able to expand casts to typedefs, pointers, arrays, and structs. Note that rcc is a C compiler, not C++, so some of the types will not be possible to translate (mostly classes). rcc also can't handle bitfields, it just ignores them.

Advantages

I listed most of the advantages above but I can go into more detail:

The main advantage would be having a full preprocessor available, which would allow expanding function and object macros (possibly recursively) (Functional C macros are not expanded. #753). From looking at Provide support for C functional macros jethrogb/rust-cexpr#3 it looks like this isn't really possible with cexpr.
rcc would be more resilient to failure than rust-cexpr since it handles overflow/divide by zero/etc. in the constant folding. There's no reason this couldn't be added to cexpr, but it doesn't have these feature currently.
rcc can parse arbitrary C types, so things like #define MPI_REQUEST_NULL ((MPI_Request)0x2c000000) (which is probably some sort of typedef struct, from Provide support for C functional macros jethrogb/rust-cexpr#3 (comment)) can be parsed without trouble. I don't think this would be possible with cexpr.
rcc can lex hexadecimal floats.
rcc distinguishes between signed and unsigned integers.

Regressions

The parsing currently isn't very smart and will stack overflow if you have expressions that are too nested (more than ~200 nested parentheses). This will hopefully be fixed soon.
If rcc has a bug bindgen will also have a bug. For example [ICE] reached unreachable: bug in const_fold or parser jyn514/saltwater#291, [ICE] peek doesn't consider escaped newlines. jyn514/saltwater#314, [ICE] unput doesn't play well with consume_whitespace jyn514/saltwater#361 would cause crashes. These are all edge cases and I doubt they'll appear in real code, but they are possible. The ones I listed here are mostly asserts and I can change them to debug_assert if you like, but then the parsed expression will be incorrect.
Wide characters (L"str") aren't implemented and will currently give an error. I can change this to ignore the L and give back "str" pretty easily, but making a separate wchar_t type would be a bit more work.

I'm not sure how you currently pass expressions to cexpr, but note that if you want rcc to handle backslashes before newlines you might have to pass in multiple lines of C source code. If clang handles this it won't be an issue.

Things that won't change

Token pasting (#, ##) and __VA_ARGS__ aren't currently implemented. If these are important for bindgen I can prioritize them but otherwise I wasn't planning to implement them soon.
Integer suffixes are discarded. This would require some work on my end (I need to change around the Literal type) but is definitely feasible to fix.

In general rcc has been a hobby project up until now so I certainly don't expect you to switch away from cexpr tomorrow. I'm definitely interested in seeing it used more, though :)

emilio · 2020-04-13T02:13:27Z

Well bindgen depends on syn / quote as well so that in particular is not a massive issue. cranelift would be though :)

What we do is here:

rust-bindgen/src/clang.rs

Line 700 in f686751

pub fn cexpr_tokens(self) -> Vec<cexpr::token::Token> {

i.e., we pass the already-tokenized clang input to cexpr. That function has a single caller. We keep the already-parsed macros here:

rust-bindgen/src/ir/context.rs

Line 355 in d65100e

parsed_macros: StdHashMap<Vec<u8>, cexpr::expr::EvalResult>,

We're technically running the preprocessor with clang already, so a third option would be to somehow add / expose the same "try to expand this now" capability to libclang.

But using rcc is also interesting... I'd be curious of how it compares as-is to cexpr. I suspect it wouldn't be particularly hard to tweak those two bits and their uses to fit whatever API you want to expose in rcc :)

jyn514 · 2020-04-13T02:33:50Z

Ok great! Yes cranelift is definitely optional, that shouldn't be an issue.

Here's an idea for what rcc_tokens would look like:

   /// Gets the tokens that correspond to that cursor as  `rcc` tokens.
    pub fn rcc_tokens(self) -> Vec<rcc::Token> {
        use rcc::Token;

        self.tokens()
            .iter()
            .filter_map(|token| {
                let rcc_token = match token.kind {
                    // this looks a little tedious to write, if you like I can implement `TryFrom<&str> for Token`
                    CXToken_Punctuation => match token.spelling() {
                        "+" => Token::Plus,
                        "-" => Token::Minus,
                        "*" => Token::Star,
                        // and so on
                        other => panic!("unknown token {}", other),
                    }
                    CXToken_Literal => rcc::Lexer::new(token.spelling()).next().unwrap(),
                    CXToken_Identifier => Token::Id(token.spelling().into()),
                    CXToken_Keyword => rcc::lex::KEYWORDS.get(token.spelling()).unwrap(),
                    // rcc does not have a comment token
                    CXToken_Comment => return None,
                    _ => {
                        error!("Found unexpected token kind: {:?}", token);
                        return None;
                    }
                };
                Some(rcc_token)
            })
            .collect()
    }

It's a little different since my Tokens aren't just a span, they include the parsed value as well. I'm seeing that this makes literals a little unpleasant since you have to re-lex them. Unfortunately I probably can't change this in the short term, it's a little hard-wired into the parser.

The parsed_macros map would turn into a HashMap<InternedStr, Definition>, where Definition is a custom enum: https://docs.rs/rcc/0.8.0/src/rcc/lex/cpp.rs.html#145. Other than that it would probably look the same.

Note that this is all psuedocode because most of the API you'd need isn't currently public, I can change that in the next release of rcc.

Let me know if there's anything else you're looking for!

jyn514 · 2020-05-14T01:15:38Z

I was working on this a little bit and ran on a problem that I'm not sure the right way to solve. In rcc, the parser takes a Locatable<Token> stream, where Locatable attaches a Location to the Token. I was looking around ClangToken, but I don't see anything like a location there.

For context, locations are used for providing highlighting for errors in the C code (really they should be called Spans). Does bindgen have anything like this or should I redesign the API? I'm not sure what it would look like ... do you care about error messages at all or would you be essentially calling unwrap() on all the results?

emilio · 2020-05-14T02:10:00Z

Clang tokens do have locations though they may not be exposed right now. Should be straight-forward to expose it (cursors already expose a location) see the docs for clang_getTokenLocation.

kulp · 2022-03-16T01:23:19Z

I am inferring from saltwater@097c72d30 that this particular issue is no longer live.

Consequently I am closing the issue.

jyn514 · 2022-07-31T15:34:15Z

FWIW, I still think this is useful - would be willing to work on it some more if I could have some hope reviews would take less than 3 months between each look ...

pythongirl325 mentioned this issue Apr 26, 2020

Query-based Architecture jyn514/saltwater#382

Open

jyn514 mentioned this issue Apr 27, 2020

The lexer should only return a span and a token-kind jyn514/saltwater#384

Closed

jyn514 mentioned this issue May 13, 2020

Make the top-level API much easier to use jyn514/saltwater#431

Merged

jyn514 mentioned this issue May 14, 2020

[WIP] add saltwater backend #1782

Closed

kulp closed this as completed Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible alternative to rust-cexpr #1762

Possible alternative to rust-cexpr #1762

jyn514 commented Apr 10, 2020

jyn514 commented Apr 10, 2020 •

edited

Loading

emilio commented Apr 13, 2020

jyn514 commented Apr 13, 2020 •

edited

Loading

emilio commented Apr 13, 2020

jyn514 commented Apr 13, 2020

jyn514 commented May 14, 2020

emilio commented May 14, 2020

kulp commented Mar 16, 2022

jyn514 commented Jul 31, 2022

Possible alternative to rust-cexpr #1762

Possible alternative to rust-cexpr #1762

Comments

jyn514 commented Apr 10, 2020

jyn514 commented Apr 10, 2020 • edited Loading

emilio commented Apr 13, 2020

jyn514 commented Apr 13, 2020 • edited Loading

Advantages

Regressions

Things that won't change

emilio commented Apr 13, 2020

jyn514 commented Apr 13, 2020

jyn514 commented May 14, 2020

emilio commented May 14, 2020

kulp commented Mar 16, 2022

jyn514 commented Jul 31, 2022

jyn514 commented Apr 10, 2020 •

edited

Loading

jyn514 commented Apr 13, 2020 •

edited

Loading