Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line directive emitted in middle of multi-line string which contains an ifdef or ifndef #225

Closed
njnobles opened this issue Sep 25, 2024 · 17 comments

Comments

@njnobles
Copy link
Contributor

Hi, I'm running into an issue where a line directive is being emitted in the middle of a multi-line string if there's an ifdef or ifndef directive in the middle of the string. The line directive was not emitted in Boost 1.77.0.
I suspect this issue was introduced in this PR: #140
After reverting that change, I no longer see line directives emitted in the middle of the multi-line string.
Here's an example to illustrate the issue:
test.cpp

"1
#ifdef A
    2a"
#else
    2b
    3"
#endif

The following are using default_preprocessing_hooks.
Boost 1.77.0 preprocessed output:

"1
    2 b
    3"

Boost 1.86.0 preprocessed output:

"1
    #line 5 "t_2_032.cpp"
2 b
    3"

Boost 1.86.0 output is identical to Boost 1.77.0 output if I revert #140

And for comparison, here's output from Visual Studio's cl.exe preprocessor:

#line 1 "test.cpp"
"1



    2b
    3"
#line 8 "test.cpp"
@jefftrull
Copy link
Collaborator

I'm wondering why the tokenizer doesn't catch that one... Thanks for the report!

@njnobles
Copy link
Contributor Author

To be fair, it is a strange case to run into. But my partners have a knack for abusing preprocessor functionality :)

I've created #226 which reverts the code change from #140. I left the new test from 140 in however and, interestingly, it seems to still pass in my local environment.

@jefftrull
Copy link
Collaborator

Unfortunately changes to test code aren't properly propagated in the build system - try touching test.cfg to rerun.

@njnobles
Copy link
Contributor Author

I did notice that occurring and figured out that trick. This particular test is run as default_hooks.exe. I was able to trigger it to rerun by removing the default_hooks.run and default_hooks.test files. I also ran it directly to make sure it was actually running. Seems like it passed in the PR automation as well.

@jefftrull
Copy link
Collaborator

jefftrull commented Sep 26, 2024

I was able to identify the commit that separately (from #140) fixed the problem it was addressing - it's #170. Now to understand why...

@jefftrull
Copy link
Collaborator

Interestingly it seems that multi-line string literals are not allowed in C - at least not like that. You can have raw string literals with a different syntax. Not that Wave produces an error or anything... but gcc does, in my experiments. What's your use case?

@jefftrull
Copy link
Collaborator

Some additional insight into what is happening here: in #140 I made handling of #ifdef and #ifndef consistent with if - and that's actually the issue - #if had this behavior (inserting line directives inside strings) before, and still has it, though it's not reflected in your test case at the moment.

The big question for me now is what the correct behavior should be given that multi-line string constants as you have them may be errors. Can you use raw string literals instead? Or string concatenation? Should Wave actually throw an exception?

@hkaiser
Copy link
Collaborator

hkaiser commented Sep 26, 2024

Should Wave actually throw an exception?

I'm not a language lawyer, however I'd say this is ill-formed code. https://eel.is/c++draft/lex.pptoken#2 describes what preprocessing tokens are and string literals are explicitly listed as one. OTOH, a string literal is not allowed to contain newline characters. Those can consist of "any member of the translation character set except the U+0022 quotation mark, U+005c reverse solidus, or new-line character" (see https://eel.is/c++draft/lex.string#nt:basic-s-char).

@njnobles
Copy link
Contributor Author

njnobles commented Sep 26, 2024

One of teammates recently posted an issue and fix so I'll borrow his language for our general setup:
"I'm attempting to use wave as a pre-processor for a C/PASCAL like DSL. We are trying to migrate away from a 20 year old MSVC pre-processor."

Essentially, we have a DSL that we've created and we have several hundred repos coded against that DSL. Our clients have learned to abuse the preprocessor a bit to work around some of the sort comings of our DSL. We do have some ability to fix our clients' code, but tracking down where and how often they use the preprocessor like this isn't exactly easy.

We do a test build of a handful of our clients' repos before we rollout changes to them and we caught at least one instance of them using the preprocessor like this.

Fixing this issue would certainly make my life easier 😅 But I can also understand that a 20 year old MSVC preprocessor might not have been the most spec-compliant thing 😆

@jefftrull
Copy link
Collaborator

Honestly sounds like "throw exception" might be a better action for Wave to do here - we do try to be standard compliant, and to match gcc (especially when they are standard compliant). Furthermore it would make it easier for you to track down "where and how often they use the preprocessor like this". String concatenation should give users what they need here despite the extra work. And finally, what other bugs might Wave have w.r.t. multi-line string literals? We'd be buying into a maintenance headache.

@hkaiser
Copy link
Collaborator

hkaiser commented Sep 26, 2024

I'd suggest converting your multi-line strings either into raw string literals or to convert the newlines into '\n'. Generally macro expansion is allowed to construct preprocessing tokens, those should be valid, though.

@njnobles
Copy link
Contributor Author

I think that's fair and reasonable. I would agree it's better to stay standard compliant.
The one instance of the issue in my clients' code should be fairly easy to workaround. My bigger concern are the instances we haven't found yet, but that's not a problem for you guys to worry about 😄

@jefftrull
Copy link
Collaborator

@hkaiser should we add a lexing_exception for this case? AFAICT we happily accept multiline strings, at the moment.

@hkaiser
Copy link
Collaborator

hkaiser commented Sep 28, 2024

@hkaiser should we add a lexing_exception for this case? AFAICT we happily accept multiline strings, at the moment.

I tend towards reporting this as a problem, if possible. This would have to be reported after macro expansion, however. So we would have to go back and reparse if we want to report things.

@jefftrull
Copy link
Collaborator

I feel like the problem is upstream of macro expansion, namely, that we accept multi-line strings at all. On top of that, of course, is the fact that we are currently interpreting directives inside such strings. But if the bad "string literals" themselves were errors we would never get to the expansion and generating the line directives. i.e.:

"1
    2a"

and

"1
#ifdef A
    2a"

are equal errors and should produce the same kind of "unterminated string literal" exception on the first line.

In short, I think we have a pure lexing issue. We lex the first case like this:

<UnknownToken>   (#34 ) at /tmp/bar.cpp (  1/ 1): >"<
INTLIT           (#384) at /tmp/bar.cpp (  1/ 2): >1<
NEWLINE          (#394) at /tmp/bar.cpp (  1/ 3): >\n<
SPACE            (#392) at /tmp/bar.cpp (  2/ 1): >    <
INTLIT           (#384) at /tmp/bar.cpp (  2/ 5): >2<
IDENTIFIER       (#380) at /tmp/bar.cpp (  2/ 6): >a<
<UnknownToken>   (#34 ) at /tmp/bar.cpp (  2/ 7): >"<
NEWLINE          (#394) at /tmp/bar.cpp (  2/ 8): >\n<
EOF              (#401) at /tmp/bar.cpp (  3/ 1): ><

and the second like this:

<UnknownToken>   (#34 ) at /tmp/bar.cpp (  1/ 1): >"<
INTLIT           (#384) at /tmp/bar.cpp (  1/ 2): >1<
NEWLINE          (#394) at /tmp/bar.cpp (  1/ 3): >\n<
PP_IFDEF         (#370) at /tmp/bar.cpp (  2/ 1): >#ifdef<
SPACE            (#392) at /tmp/bar.cpp (  2/ 7): > <
IDENTIFIER       (#380) at /tmp/bar.cpp (  2/ 8): >A<
NEWLINE          (#394) at /tmp/bar.cpp (  2/ 9): >\n<
SPACE            (#392) at /tmp/bar.cpp (  3/ 1): >    <
INTLIT           (#384) at /tmp/bar.cpp (  3/ 5): >2<
IDENTIFIER       (#380) at /tmp/bar.cpp (  3/ 6): >a<
<UnknownToken>   (#34 ) at /tmp/bar.cpp (  3/ 7): >"<
NEWLINE          (#394) at /tmp/bar.cpp (  3/ 8): >\n<
EOF              (#401) at /tmp/bar.cpp (  4/ 1): ><

but a proper single-line string with the same contents:

"1\n#ifdef A\n    2a"

is just:

STRINGLIT        (#390) at /tmp/bar.cpp (  1/ 1): >"1\n#ifdef A\n    2a"<
NEWLINE          (#394) at /tmp/bar.cpp (  1/22): >\n<
EOF              (#401) at /tmp/bar.cpp (  2/ 1): ><

So the lexer is getting lost right away.

@hkaiser
Copy link
Collaborator

hkaiser commented Oct 2, 2024

@jefftrull I assumed it is possible to constuct a string from macros piece by piece, i.e.

#define QUOTE "
QUOTE abc QUOTE

but other compilers don't accept that. So I think you're right, string constants must be valid before macro expansion.

@jefftrull
Copy link
Collaborator

Closing this and creating a different issue about how we accept unterminated string literals as multiple tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants