Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible alternative to rust-cexpr #1762

Closed
jyn514 opened this issue Apr 10, 2020 · 9 comments
Closed

Possible alternative to rust-cexpr #1762

jyn514 opened this issue Apr 10, 2020 · 9 comments

Comments

@jyn514
Copy link
Member

jyn514 commented Apr 10, 2020

I noticed that you commented on jethrogb/rust-cexpr#23 that you depend on cexpr and don't want it to be unmaintained. As another alternative, I'm currently working on a C compiler in Rust which doesn't have any C dependencies on LLVM. It handles

It does not currently support cross-compiling, but it will soon™.

Let me know if you're interested! I don't currently have the exact API cexpr exposes (preprocessor_expr or something like that), but I could add it without much trouble.

@jyn514
Copy link
Member Author

jyn514 commented Apr 10, 2020

This would almost certainly fix #316 and #1594 as well.

@emilio
Copy link
Contributor

emilio commented Apr 13, 2020

That looks nice. It would probably depend on how much bloat does it bring in compared to cexpr, but probably could be an opt-in feature otherwise if needed.

I think the biggest issues we have with macros are a bit more intrinsic to macros, like "we don't know if a name refers to a named constant because that's only really known at the time of expansion".

Also note that we only use cexpr for preprocessing so we don't really need or want full compilation. We use libclang for that.

I think the biggest use-case that's missing from cexpr (apart from maybe trying to understand some types, but that's another of the "can't really know until expansion" bits, which seem hard to solve because there may not be any expansion at all) is the "invoking a functional macro with a constant argument". That seems like it could be reasonably handled.

Can you elaborate on how bindgen would use your compiler, and which enhancements or regressions can we expect from that?

@jyn514
Copy link
Member Author

jyn514 commented Apr 13, 2020

It would probably depend on how much bloat does it bring in compared to cexpr, but probably could be an opt-in feature otherwise if needed.

Yes, that's the major complaint I've had. I can feature-gate all the codegen, but the fewest dependencies I have right now still have a dependency on thiserror which in turn depends on syn and quote:

rcc v0.8.0 (/home/joshua/src/rust/rcc)
├── codespan v0.8.0
│   └── unicode-segmentation v1.6.0
├── counter v0.4.3
│   └── num-traits v0.2.11
│       [build-dependencies]
│       └── autocfg v1.0.0
├── hexponent v0.2.3
├── lazy_static v1.4.0
├── string-interner v0.7.1
├── target-lexicon v0.10.0
└── thiserror v1.0.14
    └── thiserror-impl v1.0.14
        ├── proc-macro2 v1.0.10 (*)
        ├── quote v1.0.3 (*)
        └── syn v1.0.17 (*)

Also note that we only use cexpr for preprocessing so we don't really need or want full compilation. We use libclang for that.

Ok, if you use default-features = false it won't pull in cranelift.

(apart from maybe trying to understand some types, but that's another of the "can't really know until expansion" bits, which seem hard to solve because there may not be any expansion at all)

Hmm that sounds interesting ... I could try expanding object macros and seeing if they form a valid expression and giving you back that type. It won't work all the time but it might work more than it does now? For function macros I could try setting all the arguments to Int(0) or something like that.

"invoking a functional macro with a constant argument". That seems like it could be reasonably handled.

That's handled now, yes.

Can you elaborate on how bindgen would use your compiler, and which enhancements or regressions can we expect from that?

Sure thing. I'd expect bindgen to use rcc to expand C macros into Rust const and statics in contexts that cexpr can't currently handle. If you give rcc the type information from clang, it will also be able to expand casts to typedefs, pointers, arrays, and structs. Note that rcc is a C compiler, not C++, so some of the types will not be possible to translate (mostly classes). rcc also can't handle bitfields, it just ignores them.

Advantages

I listed most of the advantages above but I can go into more detail:

  • The main advantage would be having a full preprocessor available, which would allow expanding function and object macros (possibly recursively) (Functional C macros are not expanded. #753). From looking at Provide support for C functional macros jethrogb/rust-cexpr#3 it looks like this isn't really possible with cexpr.
  • rcc would be more resilient to failure than rust-cexpr since it handles overflow/divide by zero/etc. in the constant folding. There's no reason this couldn't be added to cexpr, but it doesn't have these feature currently.
  • rcc can parse arbitrary C types, so things like #define MPI_REQUEST_NULL ((MPI_Request)0x2c000000) (which is probably some sort of typedef struct, from Provide support for C functional macros jethrogb/rust-cexpr#3 (comment)) can be parsed without trouble. I don't think this would be possible with cexpr.
  • rcc can lex hexadecimal floats.
  • rcc distinguishes between signed and unsigned integers.

Regressions

I'm not sure how you currently pass expressions to cexpr, but note that if you want rcc to handle backslashes before newlines you might have to pass in multiple lines of C source code. If clang handles this it won't be an issue.

Things that won't change

  • Token pasting (#, ##) and __VA_ARGS__ aren't currently implemented. If these are important for bindgen I can prioritize them but otherwise I wasn't planning to implement them soon.
  • Integer suffixes are discarded. This would require some work on my end (I need to change around the Literal type) but is definitely feasible to fix.

In general rcc has been a hobby project up until now so I certainly don't expect you to switch away from cexpr tomorrow. I'm definitely interested in seeing it used more, though :)

@emilio
Copy link
Contributor

emilio commented Apr 13, 2020

Well bindgen depends on syn / quote as well so that in particular is not a massive issue. cranelift would be though :)

What we do is here:

pub fn cexpr_tokens(self) -> Vec<cexpr::token::Token> {

i.e., we pass the already-tokenized clang input to cexpr. That function has a single caller. We keep the already-parsed macros here:

parsed_macros: StdHashMap<Vec<u8>, cexpr::expr::EvalResult>,

We're technically running the preprocessor with clang already, so a third option would be to somehow add / expose the same "try to expand this now" capability to libclang.

But using rcc is also interesting... I'd be curious of how it compares as-is to cexpr. I suspect it wouldn't be particularly hard to tweak those two bits and their uses to fit whatever API you want to expose in rcc :)

@jyn514
Copy link
Member Author

jyn514 commented Apr 13, 2020

Ok great! Yes cranelift is definitely optional, that shouldn't be an issue.

Here's an idea for what rcc_tokens would look like:

   /// Gets the tokens that correspond to that cursor as  `rcc` tokens.
    pub fn rcc_tokens(self) -> Vec<rcc::Token> {
        use rcc::Token;

        self.tokens()
            .iter()
            .filter_map(|token| {
                let rcc_token = match token.kind {
                    // this looks a little tedious to write, if you like I can implement `TryFrom<&str> for Token`
                    CXToken_Punctuation => match token.spelling() {
                        "+" => Token::Plus,
                        "-" => Token::Minus,
                        "*" => Token::Star,
                        // and so on
                        other => panic!("unknown token {}", other),
                    }
                    CXToken_Literal => rcc::Lexer::new(token.spelling()).next().unwrap(),
                    CXToken_Identifier => Token::Id(token.spelling().into()),
                    CXToken_Keyword => rcc::lex::KEYWORDS.get(token.spelling()).unwrap(),
                    // rcc does not have a comment token
                    CXToken_Comment => return None,
                    _ => {
                        error!("Found unexpected token kind: {:?}", token);
                        return None;
                    }
                };
                Some(rcc_token)
            })
            .collect()
    }

It's a little different since my Tokens aren't just a span, they include the parsed value as well. I'm seeing that this makes literals a little unpleasant since you have to re-lex them. Unfortunately I probably can't change this in the short term, it's a little hard-wired into the parser.

The parsed_macros map would turn into a HashMap<InternedStr, Definition>, where Definition is a custom enum: https://docs.rs/rcc/0.8.0/src/rcc/lex/cpp.rs.html#145. Other than that it would probably look the same.

Note that this is all psuedocode because most of the API you'd need isn't currently public, I can change that in the next release of rcc.

Let me know if there's anything else you're looking for!

@jyn514
Copy link
Member Author

jyn514 commented May 14, 2020

I was working on this a little bit and ran on a problem that I'm not sure the right way to solve. In rcc, the parser takes a Locatable<Token> stream, where Locatable attaches a Location to the Token. I was looking around ClangToken, but I don't see anything like a location there.

For context, locations are used for providing highlighting for errors in the C code (really they should be called Spans). Does bindgen have anything like this or should I redesign the API? I'm not sure what it would look like ... do you care about error messages at all or would you be essentially calling unwrap() on all the results?

@emilio
Copy link
Contributor

emilio commented May 14, 2020

Clang tokens do have locations though they may not be exposed right now. Should be straight-forward to expose it (cursors already expose a location) see the docs for clang_getTokenLocation.

@kulp
Copy link
Member

kulp commented Mar 16, 2022

I am inferring from saltwater@097c72d30 that this particular issue is no longer live.

Consequently I am closing the issue.

@kulp kulp closed this as completed Mar 16, 2022
@jyn514
Copy link
Member Author

jyn514 commented Jul 31, 2022

FWIW, I still think this is useful - would be willing to work on it some more if I could have some hope reviews would take less than 3 months between each look ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants