Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal]: Embedded Language Indicators for raw string literals #6247

Closed
1 of 4 tasks
333fred opened this issue Jun 27, 2022 · 48 comments
Closed
1 of 4 tasks

[Proposal]: Embedded Language Indicators for raw string literals #6247

333fred opened this issue Jun 27, 2022 · 48 comments

Comments

@333fred
Copy link
Member

333fred commented Jun 27, 2022

Embedded Language Indicators for raw string literals

  • Proposed
  • Prototype: Not Started
  • Implementation: Not Started
  • Specification: Not Started

Summary

When we were designing raw string literals, we intentionally left the door open for putting a language indicator at the end of the opening """ for the multi-line form. This proposal adds the support to do that.

Motivation

In the BCL, we added StringSyntaxAttribute for applying to parameters, which allows parameters to indicate the strings passed to them contain some form of embedded language, which is then used for syntax highlighting. However, this only works for strings passed directly to the parameter. For strings first stored in a variable, the only solution is a // lang = x comment. This means that, if the IDE wants to extract a multi-line raw string literal, it cannot neatly preserve the highlighting that was used. This syntax form is intended to help bridge that gap.

Detailed design

The existing raw string literal proposal has the following multi-line grammar:

multi_line_raw_string_literal
    : raw_string_literal_delimiter whitespace* new_line (raw_content | new_line)* new_line whitespace* raw_string_literal_delimiter
    ;

This is updated to the following:

multi_line_raw_string_literal
    : raw_string_literal_delimiter identifier? whitespace* new_line (raw_content | new_line)* new_line whitespace* raw_string_literal_delimiter
    ;

Where the identifier? token is added right after the delimiter.

Drawbacks

This form is not equally applicable to all string types, so it would only apply to multi-line raw string literals. Ideas on other forms that could be more broadly applied would be useful: maybe putting the identifier after the closing quote could work?

Alternatives

Unresolved questions

Design meetings

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Jun 27, 2022

i would not limit this to identifier, as that would not allow things like C# or even things like CS-Statement etc. Feels like it would be something akin to raw_string_literal_delimiter not-whitespace-not-new-line+ whitespace* new_line

@HaloFour
Copy link
Contributor

i would not limit this to identifier, as that would not allow things like C# or even things like CS-Statement etc.

Not to argue either way, but that doesn't seem to limit markdown.

@CyrusNajmabadi
Copy link
Member

you can use ```c# in markdown. For example:

class Foo { }

@alrz
Copy link
Member

alrz commented Jun 28, 2022

Markdown also accepts file extensions which I find easier to type.. but either way backtick is not going to be (re)considered for string literals, right?

@CyrusNajmabadi
Copy link
Member

@alrz I would only support backticks if we were actually adding support for markdown (something I do want).

@vladd
Copy link

vladd commented Jun 28, 2022

Why limit it to only language indicators? It could be anything, e. g. locale or color or any other editor hint.

@333fred
Copy link
Member Author

333fred commented Jun 28, 2022

Sure, we won't be stopping whatever you want to put there. However, my intention with this proposal is that editors will use it to drive interior language highlighting.

@CyrusNajmabadi
Copy link
Member

It's not limited to only language indicators. It's just that that's a primary consumption case.

e. g. locale or color or any other editor hint.

These are also 'language indicators' :)

@IS4Code
Copy link

IS4Code commented Jul 7, 2022

Why not this?

return """ // lang = cs
class Foo { }
""";

A single-line comment would fit after """, after all. Having a Markdown-style identifier there is neat, but I would be confused if I saw it somewhere, thinking that it would somehow affect the type of the string. A comment conveys the meaning well, and it keeps the existing format. Unless you actually want to be able to programmatically extract the language information...

@CyrusNajmabadi
Copy link
Member

Why not this?

Primarily verbosity. It seems esp. excessive given how markdown is commonly used to write ```c#.

A comment conveys the meaning well

If you prefer that, that's already supported. You can do both:

// lang=c#
return """
class Foo { }
""";

Or

return /* lang=c# */ """
class Foo { }
""";

Given that, we don't need an interior-form of this comment. But having a simple interior form that is much less verbose than the comment form would be nice.

@glen-84
Copy link

glen-84 commented Jul 8, 2022

maybe putting the identifier after the closing quote could work

Something like this?

var example = """
    SELECT * FROM table
    """sql;

var example = """SELECT * FROM table"""sql;

I prefer that TBH. It's low-importance metadata and is more "out of the way" when it's appended. It's also "outside of the string" this way, like a tag.

Is the above technically possible? Is it possible to add whitespace before the "tag"?

var example = """
    SELECT * FROM table
    """ sql;

var example = """SELECT * FROM table""" sql;

It's cleaner/less "squashed" that way.

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Jul 8, 2022

@glen-84 yes, those are potential alternatives we can consider.

However, it is unlikely as "text on outside" already has meaning today and actually affects the semantics of hte string. e.g. """X"""u8 means "this is a utf8 string". The point of "text on inside" is that it has no meaning to hte language. It's effectively trivia used for other tools to decide what to do.

@glen-84
Copy link

glen-84 commented Jul 8, 2022

I see. In a way that could also be seen as a "tag" or "metadata", so it could make sense to extend that in a more general sense, in a way that clearly indicates its user-defined nature.

// Would this be confused with a preprocessor directive?
var example = """SELECT * FROM table"""#sql;

var example = """SELECT * FROM table"""u8#sql;

This could apply equally to regular strings.

"Tagged string literals"

@CyrusNajmabadi
Copy link
Member

@glen-84 We'll keep those alternatives in mind when designing this. Thanks!

@HaloFour
Copy link
Contributor

HaloFour commented Jul 8, 2022

Not to bikeshed too much but I'm a bit torn. I do like the markdown-like approach of having the tag on the opening line of the raw literal. I think it's easier to see what the dialect is without having to find the end of the literal, plus it's familiar. But that poses a problem with single-line raw literals. Having the tag at the end for a single line literal looks nicer, but collides with decision to use a suffix to denote UTF-8 literals. The "tag" approach does solve that, but IMO isn't very attractive. Maybe a prefix?

var example1 = sql"""SELECT * FROM table;""";
var example2 = sql"""
    SELECT *
    FROM table;
""";

And while I'm partial to the feature it does feel a little weird that the syntax would only exist to facilitate tooling. Almost feels like something better served through source attributes.

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Jul 8, 2022

outside of hte string as a big problem in terms of detection and allowable syntax. Within teh string, we can literally allow anything. e.g. c# (where # could easily be a problem outside of the string.

@iam3yal
Copy link
Contributor

iam3yal commented Jul 11, 2022

I'm not sure whether something like the following would work but maybe quote the hint like this:

var example1 = """sql"SELECT * FROM table;"""";
var example2 = """sql"
    SELECT *
    FROM table;
"""";

If not then maybe quoting it like this:

var example1 = """"sql"SELECT * FROM table;""";
var example2 = """"sql"
    SELECT *
    FROM table;
""";

@jnm2
Copy link
Contributor

jnm2 commented Jul 11, 2022

@eyalalonn That's hard to parse visually at least. The entire meaning changes if you add one more " at the end of the single-line form, which may not be visible without horizontal scrolling as you're reading the code.

For the multiline form, I prefer not having those extra quotes.

@iam3yal
Copy link
Contributor

iam3yal commented Jul 11, 2022

@jnm2 I agree so maybe like you said we can just do without the quoting when multiline is used and have the quoting when it's needed like this:

var example1 = """sql"SELECT * FROM table;"""";
var example2 = """sql
    SELECT *
    FROM table;
""";

@glen-84
Copy link

glen-84 commented Jul 11, 2022

6 quotation marks per string is already more than enough. 😅

@iam3yal
Copy link
Contributor

iam3yal commented Jul 11, 2022

@glen-84 The more quotes you add the more power you have. 😄

@FaustVX
Copy link

FaustVX commented Jul 11, 2022

@eyalalonn
This can't work
Example:

var s = """Hello"@eyalalonn""""; // your proposal will just write: @eyalalonn (but with a language named Hello 😄 )

@jnm2
Copy link
Contributor

jnm2 commented Jul 11, 2022

@FaustVX Single-line raw string literals can't be used if the first or last character in the string is a double quote.

@FaustVX
Copy link

FaustVX commented Jul 11, 2022

@jnm2 Ok, I was just using @eyalalonn example

@jnm2
Copy link
Contributor

jnm2 commented Jul 11, 2022

@FaustVX Given Eyal's suggested rules, we would expect it to write @eyalalonn with a language named Hello. Can you explain more on what you mean by "this can't work"?

@FaustVX
Copy link

FaustVX commented Jul 11, 2022

I didn't know the fact that a double-quote at the end of the string doesn't compile.
So I thought his proposal will compile but produces something different that raw string literal proposal.
That's what i wanted to say by "This can't work"

@Randy-Buchholz
Copy link

Whatever the format, I think the indicator should proceed the content. In a case like reading from a stream, the indicator would allow redirecting the read process right away. Otherwise you would need to buffer the content until you found out what its context/language is.

@mitchdenny
Copy link
Member

mitchdenny commented Nov 13, 2022

I'm curious about the benefits of this. In languages like JavaScript you can use this backtick syntax which allows you to have that string passed into a function that can perform some kind of operation on the data.

For example:

html`<p>I am some HTML</p>`

Is it being proposed that C# will allow this kind of scenario or is the string purely for decoration purposes (and can't be interrogated at runtime). Note I'm not necessarily advocating for this as I find that you end up with dubious benefits over actually just passing the string into a method ;) e.g. htmlFunc("<p>I am some HTML</p>")

I guess I'm just wondering what problem is being solved?

edit: I guess the IDE could interpret them and provide an improved experience.

@CyrusNajmabadi
Copy link
Member

@mitchdenny the op lists the motivations. :-)

@Korporal
Copy link

is the grammar for multi_line_raw_string_literal defined in some Antlr file? or is the grammar represented using other tooling?

@CyrusNajmabadi
Copy link
Member

@Korporal
Copy link

It's defined here: https://github.com/dotnet/csharplang/blob/main/proposals/csharp-11.0/raw-string-literal.md

That's documenting the grammar, I was wondering if there is a formal definition file, machine readable like an Antlr .g4 file or something.

Is there no tooling used to generate code from the grammar or facilitate development and experimentation of the grammar?

Antlr has a repo with a cshap grammar files but they are only up to C# 6.

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Dec 22, 2022

That is the formal definition. The spec (and then proposals) are the formal definition.

machine readable like an Antlr .g4 file or something.

We have a g4 file. But it has no intent or desire to be usable by tooling of any sort. You can see it here: https://github.com/dotnet/roslyn/blob/main/src/Compilers/CSharp/Portable/Generated/CSharp.Generated.g4

It would likely not have anything related to this construct as this construct is defined by the spec I just linked. (Note: I'm the author here, so I can answer any questions you have on this topic).

However, as per above, there is no desire or goal of any sort of tooling generated off of these lexical specifications. We do generate our syntax model. But nothing more than that. As this is not part of it syntax model, it's not included.

@Korporal
Copy link

Thanks. So what is the g4 generated from? Is there no machine readable lexer/parser definitions that you guys consume as an input? Are you not able like edit a grammar or lexer rule and quickly assess the impact? see if it works or leads to grammatical inconsistencies or ambiguities?

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Dec 22, 2022

Thanks. So what is the g4 generated from?

There's a Syntax.xml file that contains our syntactic definitions.

Is there no machine readable lexer/parser definitions that you guys consume as an input?

No. There is not. :-) Such a thing isn't really useful for us. Our language is much more about what we want it to be, not about feeding into tools with their restrictions.

Are you not able like edit a grammar or lexer rule and quickly assess the impact?

No. :-) Because it's not really relevant for our compiler design. We don't want to limit ourselves to limitations often inherent in particular grammar models.

see if it works or leads to grammatical inconsistencies or ambiguities?

I don't know what a grammatical inconsistency is. Ambiguities are interesting, but easy to find. We don't really care though as lots of stuff are ambiguous in our language and we're ok with that. :-)

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Dec 23, 2022

@Korporal if you have any questions about the grammar and/or syntax, def come to discord and we can totally help you out there. Thanks!

@KyouyamaKazusa0805
Copy link

KyouyamaKazusa0805 commented Apr 3, 2023

I hope there would be some extra considerations for user-defined string syntax rules, meaning raw string literals will be highlighted not only for built-in kinds defined in StringSyntaxAttribute by defining such indicators.
Today I can only use VSIX project to write an extension for highlighting for just a really simple syntax rule, which is too heavy to be used, and very unfriendly with us starters for Visual Studio extensions.

@CyrusNajmabadi
Copy link
Member

@SunnieShine can you give an example?

@KyouyamaKazusa0805
Copy link

KyouyamaKazusa0805 commented Apr 4, 2023

@CyrusNajmabadi Sorry for not fully described and late.

For example:

using System;

string s =
    """C#
    // A test snippet for C# language
    Console.WriteLine("Hello, world!");
    """;

Console.WriteLine(s);

If we can add indicator such as "C#", the string literal will be highlighted as C# syntax rule, alright?

I want to have a mechanism to make indicators and its syntax highlighting rules not only limited to some "commonly-used" ones. Instead, we can use some indicators such as "abc". Although "abc" is not a valid indicator, we can define it by using Roslyn APIs (if available), to support for syntax highlighting for strings marked as indicator "abc".

string s =
    """abc
    A test string that can be highlighted as "abc" rule,
    which can be defined by us using Roslyn APIs.
    """;

Write Visual Studio extensions for highlighting this is too difficult for me because it may produce a high complexity of implementation.

https://github.com/dotnet/roslyn/blob/db722874de8c49c326e463a71dab9c2a572aa64f/src/Features/Core/Portable/EmbeddedLanguages/RegularExpressions/LanguageServices/RegexClassifier.cs#L30

I found that Roslyn uses RegexClassifier to highlight for regular expression pattern strings. To be honest I am not familiar with IClassifier so it may bring me a lot of difficulty.

It is good if C# language or Roslyn APIs (language level or complier level) has a same but easier way to achieve this.

@CyrusNajmabadi
Copy link
Member

@SunnieShine ... That's exactly what this proposal is :)

@KyouyamaKazusa0805
Copy link

Ah... 🤣

Sorry. I might miss the point for this proposal.

@KennethHoff
Copy link

KennethHoff commented Apr 4, 2023

@SunnieShine very understandable. The issue description is incredibly cryptic - the "detailed design" is practically impossible to understand for people who don't speak fluent csharp-language-standards and it does not contain any examples of what it's proposing

@HaloFour
Copy link
Contributor

HaloFour commented Apr 4, 2023

I don't know that the proposal suggests anything that allowing some kind of identifier to be embedded in the raw string literal. That literal may be used to indicate a "type" of the raw string which can be used to influence tooling like syntax highlighting, but nothing in the proposal suggests how that would work, or even that it would be a part of the Roslyn compiler aside from metadata.

@denis-taran
Copy link

denis-taran commented Apr 24, 2023

This feature would be extremely beneficial if the IDE could offer syntax highlighting, autocomplete, and validation for various data types / languages such as XML, JSON, and SQL.

@Eli-Black-Work
Copy link

Eli-Black-Work commented May 17, 2023

Another use case for this:

VS 2022 17.6 just released a new "spellchecker" feature that marks misspelled words in code. The spellchecker has exclusion lists that differ by language:

  • In C#, the spellchecker recognizes var as a valid word.
  • In SQL, the spellchecker recognizes ROWCOUNT as a valid word.

However, a problem arises when SQL code is embedded in C# string literals:

// Warning: "ROWCOUNT is misspelled"
string sqlQuery = "SELECT ROWCOUNT FROM users";

This proposal should solve that, as VS would know what language is in the literal string 🙂

// No spellcheck warning!
string sqlQuery = """
SELECT ROWCOUNT FROM users
"""sql;

(Although this would probably require that we be able to differentiate between different types of SQL, as different variants have different built-in functions and reserved words)

@dersia
Copy link

dersia commented Aug 13, 2024

as far as I understood, there is lot of favor for the original proposed syntax:

var s = """c#
        var t = 5;
        """;

but there is a concern about single line raw string literals.
what if it would only be supported in multi line raw string literals? for single line devs could still use the "comment"-option after or before the line. since it does not add to the source but is only used by the tooling, I think its fine with it being "out of view".

@sharpchen
Copy link

sharpchen commented Aug 13, 2024

Will Roslyn provide semantic tokens for embedded syntax? Does it provide for StringSyntaxAttribute nowadays? Not a language question though, I wish it could be out-of-box so we can get semantic highlighting for all IDE/editors as long as they support semantic highlighting.

EDIT: we do have semantic tokens for StringSyntaxAttribute, but it's kind of poor.

@colejohnson66
Copy link

StringSyntaxAttribute is an attribute on a method parameter, not the string. But you can access it by examining calls to methods containing parameters attributed with it. For example, the constructor Regex(string) can be retrieved, then you examine callsites to that method. The strings in those callsites are regex strings.

@dotnet dotnet locked and limited conversation to collaborators Nov 19, 2024
@CyrusNajmabadi CyrusNajmabadi converted this issue into discussion #8653 Nov 19, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests