Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(js_formatter): explore embedded language formatting #3228

Closed
wants to merge 15 commits into from

Conversation

ah-yu
Copy link
Contributor

@ah-yu ah-yu commented Jun 17, 2024

Summary

Test Plan

@github-actions github-actions bot added A-CLI Area: CLI A-Project Area: project A-Formatter Area: formatter A-Tooling Area: internal tools L-JavaScript Language: JavaScript and super languages L-CSS Language: CSS L-JSON Language: JSON and super languages labels Jun 17, 2024
@@ -147,15 +147,16 @@ const StyledComponent1 = styled.div`
}
`;

const StyledComponent2 = styled.div`
${anInterpolation}
// TODO: reformat issue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

codspeed-hq bot commented Jun 17, 2024

CodSpeed Performance Report

Merging #3228 will not alter performance

Comparing ah-yu:embed_fmt (b10c615) with main (519e316)

Summary

✅ 90 untouched benchmarks

@denbezrukov
Copy link
Contributor

😻😻😻

@ematipico
Copy link
Member

Wow @ah-yu!! You're the man!!! 👑

Let us know if you need any help

@Sec-ant
Copy link
Member

Sec-ant commented Jun 17, 2024

Hello @ah-yu,

Great job on taking on this task!

I've had some experience with embedded language formatting in Prettier, so please allow me to share a few thoughts.

Personally, I'm not a fan of the --embedded-language-formatting option in Prettier. When I first learned about it, I couldn't find any documentation specifying which blocks are considered "embedded" and which languages are supported. It took me a long time digging through its source code to understand the supported languages and the way they are embedded. This made me realize that the auto option doesn't provide developers with enough information to have a clear expectation of how the code will be formatted.

Moreover, due to this lack of transparency and the fact that there are only auto and off choices, this option's functionality is quite limited. In fact, there are several popular issues on Prettier's GitHub where users have suggested enhancing the configurability of this option (see prettier/prettier#4424 and prettier/prettier#5588).

Given this background, and motivated by the need to format XML embedded within JavaScript back then, I developed a Prettier plugin, prettier-plugin-embed, to address this issue. Allow me to quote from its README:

What?

This Prettier plugin (namely prettier-plugin-embed) provides a configurable solution for formatting embedded languages in the form of template literals within JavaScript or TypeScript files.

Why?

Prettier has introduced an option for formatting embedded languages, named embedded-language-formatting. However, this option only offers two modes: auto and off. This limits its functionality, as it does not permit individual formatting adjustments for specific languages. Additionally, it lacks support for customizing which languages require formatting, as well as specifying block comments or tags for embedded language identification. These constraints hinder the feature's overall usability. For in-depth discussions on this matter, see the GitHub issues: prettier/prettier#4424 and prettier/prettier#5588.

In simple terms, this plugin explicitly configures which parts of the code are considered embedded languages. In my plugin, I specified that only tagged templates like css`div { color: red; }` and template literals with preceding block comments like /* css */ `div { color: red; }` are treated as embedded language candidates. The former is a common pattern in css-in-js libs and sql-in-js libs, while the latter is flexible enough to be used for language annotation. Additionally, this plugin provides rich configurable options for users to specify which tag names or block comments correspond to which languages. For details, see https://github.com/Sec-ant/prettier-plugin-embed?tab=readme-ov-file#language-specific-options.

This plugin relies on Prettier's extensive language support plugins to successfully provide embedded code formatting for many languages. The most popular seems to be the SQL family of languages, which is also a highly requested feature by Prettier users: prettier/prettier#12139. Regarding this, I think @karlhorky can provide more insights from a developer's perspective.

Regarding your work on Biome, I see that you have referenced Prettier's approach by introducing the --embedded-language-formatting=auto|off option and hardcoding which nodes are considered embedded CSS code for initial testing. While I don't support this approach, I'm not asking you to start over. Rather, I want to provide you and the other developers with enough background knowledge to help us analyze the path we need to take more comprehensively. After all, adding a new feature is easy, but changing its behavior later is much more difficult.

One more thing, I noticed in your code comments for handling template strings, you mentioned replacing interpolation expressions with placeholders, then parsing, formatting, and finally replacing them back with the original expressions. This method is indeed what Prettier and prettier-plugin-embed use. However, I must point out that different embedded languages may require different forms of placeholders to be recognized as valid tokens by their parsers due to syntax and parser constraints. This can also be affected by the position of the interpolation. IMHO, a more reasonable approach should work like gritql's definition and handling of metavariables: https://github.com/getgrit/gritql/blob/main/CONTRIBUTING.md#language-grammars. We can treat these interpolation expressions as special tokens called metavariables and add them to the language's grammar. This way, all interpolation expressions in various positions we defined in our grammar will be valid and parsable tokens. I mentioned this in a reply to @arendjr before. Otherwise, simply replacing them with a magic string could cause issues when we want to support more languages or even just one language with multiple interpolation expressions in different places.

In all, these are my personal insights. Although I have limited knowledge of parsers and formatters, based on my experience developing the prettier-plugin-embed plugin, I am willing to offer my thoughts and suggestions on this matter. I hope Biome will carefully and thoroughly consider the implementation of this feature.

Thanks! ❤️

@karlhorky
Copy link

karlhorky commented Jun 17, 2024

Oh nice, didn't know about this PR! Cool to see more work being done in the embedded language formatting space 🙌

hardcoding which nodes are considered embedded CSS code

While I don't support this approach

Ahh yeah I would also not recommend the hardcoded / Prettier approach, it represents a pretty big downside to how Prettier is currently built.

Would be great to avoid this pitfall and offer extensibility from the start, if possible!

Regarding this, I think @karlhorky can provide more insights from a developer's perspective.

Yes, we and our students use SQL-in-JS through prettier-plugin-embed heavily, it's an essential feature for codebases which use SQL-in-JS, which is a growing use case. Lack of this feature would block us from considering migrating to Biome or any other formatter alternative.

I'm guessing there's also other popular embedded languages where the community would benefit from their inclusion, such as XML-in-JS, YAML-in-JS, Markdown-in-JS, and other supported by prettier-plugin-embed.

If Biome doesn't want to "own" and maintain the integrations of these embedded language formatters, then an alternative would be to make it easy (easier than Prettier) for community members to integrate with custom embedded languages - potentially with an approach similar to Prettier's "Support library-specific plugin" idea (which has not yet been implemented).

@Conaclos
Copy link
Member

Conaclos commented Jun 17, 2024

Oh this is a great start!

@Sec-ant

emplate literals with preceding block comments

Is this commonly used?

I definitively like the idea of configuring which template tag is associated to which language. We could have a default mapping like you did in your plugin. Have you some proposal to support this in the Biome config file?

MHO, a more reasonable approach should work like gritql's definition and handling of metavariables

This kind of variable can appear everywhere in the embedded code, it can even be part of a keyword.
How do you handle them at the CST level? Bogus nodes?

Even placeholder seems kind of erroneous. How Prettier is handling it?

@Sec-ant
Copy link
Member

Sec-ant commented Jun 17, 2024

@Conaclos

Template literals with preceding block comments

Is this commonly used?

The initiative to support this syntax is not based on how commonly this pattern is used, but rather to provide a versatile enough approach for users to annotate the embedded code in a non-intrusive way. Only supporting tag functions will force the users to introduce runtime functions simply for dev-time formatting.

Also, this VSCode extension uses block comments to add syntax highlighting support for template literals: https://marketplace.visualstudio.com/items?itemName=bierner.comment-tagged-templates

I definitively like the idea of configuring which template tag is associated to which language. We could have a default mapping like you did in your plugin. Have you some proposal to support this in the Biome config file?

I don't have one now because I see we haven't talked about it elsewhere. But I can do some research and make a proposal later. I'm not very available in the coming days.

This kind of variable can appear everywhere in the embedded code, it can even be part of a keyword.
How do you handle them at the CST level? Bogus nodes?

To be honest, I think we should limit our embedded language formatting functionality to bail out on interpolations smaller than token or node level. I would take the approach like what gritql (or astgrep) adopts to support metavariables. It seems just not reasonable to me that one would use `con${"st"} a = 42;` in their code and expect the formatter to format it properly. (But I must admit, bogus nodes are definitely worth considering, this will enable us to not fully bail out, which makes it a very competitive feature)

Even placeholder seems kind of erroneous. How Prettier is handling it?

Prettier simply uses carefully-constructed special identifier names as placeholders, or doesn't support interpolation for certain embedded languages:

@arendjr
Copy link
Contributor

arendjr commented Jun 17, 2024

One more thing, I noticed in your code comments for handling template strings, you mentioned replacing interpolation expressions with placeholders, then parsing, formatting, and finally replacing them back with the original expressions. This method is indeed what Prettier and prettier-plugin-embed use. However, I must point out that different embedded languages may require different forms of placeholders to be recognized as valid tokens by their parsers due to syntax and parser constraints. This can also be affected by the position of the interpolation. IMHO, a more reasonable approach should work like gritql's definition and handling of metavariables: https://github.com/getgrit/gritql/blob/main/CONTRIBUTING.md#language-grammars. We can treat these interpolation expressions as special tokens called metavariables and add them to the language's grammar. This way, all interpolation expressions in various positions we defined in our grammar will be valid and parsable tokens. I mentioned this in a reply to @arendjr before. Otherwise, simply replacing them with a magic string could cause issues when we want to support more languages or even just one language with multiple interpolation expressions in different places.

I can mostly second this, and indeed I expect we'll extend our parsers to support metavariables (the token kind is already there), so it would make sense if we can reuse the same ability here. But if we accept that metavariables are the placeholders, doesn't that closely match the approach already being taken here? Maybe I forgot something since we last discussed it 😅

@Sec-ant
Copy link
Member

Sec-ant commented Jun 17, 2024

But if we accept that metavariables are the placeholders, doesn't that closely match the approach already being taken here? Maybe I forgot something since we last discussed it 😅

Sorry I wasn't being clear, I meant to express that we should use placeholders defined in the grammar so the parser can correctly parse that placeholder without ambiguity or choking. Currently, the placeholders used in Prettier, or my plugin prettier-plugin-embed, and also this PR, can only be considered special chosen identifier names, which are way less powerful than metavariables that are defined in the grammar.

@ah-yu
Copy link
Contributor Author

ah-yu commented Jun 18, 2024

@Sec-ant Thank you for your feedback, really appreciate it!

Regarding your work on Biome, I see that you have referenced Prettier's approach by introducing the --embedded-language-formatting=auto|off option and hardcoding which nodes are considered embedded CSS code for initial testing

Personally I don't like the Prettier's approach, the feature shouldn't be tightly bound to some specific css-in-js libraries. And I love the approach your plugin did, using some magic comments to inform the formatter that there is a language embedded is a good idea! Looking forward to your proposal. No hurry, take your time.

We can treat these interpolation expressions as special tokens called metavariables and add them to the language's grammar.

Not sure if introducing extra complexity to the parser to serve the formatter is a good approach, as it seems like a form of coupling to me

I have never used a css-in-js library and have no experience with embedded formatting, so I am not aware of the pain points. You definitely have a better understanding than I do. If you have any ideas you want to validate, feel free to push new commits to the branch.

@Sec-ant
Copy link
Member

Sec-ant commented Jun 18, 2024

Not sure if introducing extra complexity to the parser to serve the formatter is a good approach, as it seems like a form of coupling to me

This is to provide a general and reliable way to address the template string interpolation. For example, you use biome-placeholder-0 in your code to transform this

const style = css`div {
  background: ${myColor};
}`;

into this

div {
  background: biome-placeholder-0;
}

which is still valid CSS syntax after transformation.

But if the user write this code

const style = css`div:nth-child(${myNth}) {
  background: green;
}`;

then after transformation it will become:

div:nth-child(biome-placeholder-0) {
  background: green;
}

which is not syntax correct.

Additionally, if we want to support more other languages for embedded language formatting in the future, a single solution like biome-placeholder-0 will also not apply for all of them:

const script  = js`const a = ${myValue};`
const a = biome - placeholder - 0;

We can of course use differet placeholders for different languages, that's also what Prettier and my plugin does. But it still leaves us with the first problem.

More over, from our infrastructure, we have the ability to implement embedded language linting as well. While some of the placeholders may pass the parser, they may fail some linter rules, and the diagnostics will be confusing.

So I think augmenting the grammar is the only proper way to solve this problem. And we already start implementing some kind of grammar augmentation for our plugin feature.

@ah-yu
Copy link
Contributor Author

ah-yu commented Jun 18, 2024

So I think augmenting the grammar is the only proper way to solve this problem

Indeed!

@github-actions github-actions bot added the A-Parser Area: parser label Jun 26, 2024
@ah-yu
Copy link
Contributor Author

ah-yu commented Jun 26, 2024

@Sec-ant Wonder if b10c615 is close to the implementation you proposed

@Sec-ant
Copy link
Member

Sec-ant commented Jun 26, 2024

@Sec-ant Wonder if b10c615 is close to the implementation you proposed

Yes that's basically the idea 👍.

One thing I'm not sure though is the safety and future-compatibility of the CssGritMetavariable node. If we decide to support other CSS dialects, $ will have sepcial meanings in those languages, which may call for a different handling in the grammar and conflict with the metavariable node. But I'm probably overthinking it because at least right now we only support standard CSS syntax. And from gritql and ast-grep, this $... pattern seems to work fine in most of the languages without any special escaping treatment.

@ah-yu
Copy link
Contributor Author

ah-yu commented Jun 26, 2024

$ will have sepcial meanings in those languages

Maybe Grit needs to introduce new syntax for querying in Sass.

@arendjr
Copy link
Contributor

arendjr commented Jun 26, 2024

Oh, this is great, nice update @ah-yu ! This will indeed end up helping our CSS support for GritQL as well.

Good to know: Grit actually does a clever trick under the hood, where it replaces the $ metavariable indicators with μ. The actual character they use is even language-specific to avoid syntax clashes but for almost all languages it’s μ. If you replace the character I think it should work here as well.

string_interpolations.push(element.clone());
element_iter.next();
}
let placeholder = std::format!("biome-placeholder-{}", index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and another one, - is not allowed in a gritql metavariable name: https://docs.grit.io/language/patterns#metavariables, it uses $lowercase_snake_case

Comment on lines +306 to +309
// if there are any placeholders left, we treat it is a error and format the template as normal
if !placeholder_map.is_empty() {
return Err(FormatError::SyntaxError);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would envision use cases like the one below will be reasonable and may be common ones:

const index = 1;

const code = css`
div {
  background-image: url("./assets/images/img-${index}.png");
}`;

This is a sub-node level interpolation, I'm not sure whether the current replace_placeholder_with_template_element will recover such interpolations properly? I see currently we bail out if we don't restore all the interpolations from the placeholders, and format this as a normal js template string, which is good. But I guess we can keep this in mind and consider possible solutions? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether the current replace_placeholder_with_template_element will recover such interpolations properly?

Unfortunately no.

Currently, this method does an exact text match with placeholders. However, the code snippet you provided will produce a format element like text("./assets/images/img-$placeholder.png"). Maybe we can implement partial matching with placeholders and split the element into multiple segments, something like [text("./assets/images/img-"), expr, text(".png")].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can implement partial matching with placeholders and split the element into multiple segments, something like [text("./assets/images/img-"), expr, text(".png")]

👍 Yes, I think we don't need an exact match.

@ah-yu ah-yu mentioned this pull request Jul 2, 2024
3 tasks
@ah-yu ah-yu closed this Jul 2, 2024
@ah-yu ah-yu deleted the embed_fmt branch July 2, 2024 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-CLI Area: CLI A-Formatter Area: formatter A-Parser Area: parser A-Project Area: project A-Tooling Area: internal tools L-CSS Language: CSS L-JavaScript Language: JavaScript and super languages L-JSON Language: JSON and super languages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants