feat(js_formatter): explore embedded language formatting #3228

ah-yu · 2024-06-17T05:26:08Z

Summary

Test Plan

ah-yu · 2024-06-17T05:51:12Z

crates/biome_js_formatter/tests/specs/prettier/js/multiparser-css/styled-components.js

@@ -147,15 +147,16 @@ const StyledComponent1 = styled.div`
  }
 `;

-const StyledComponent2 = styled.div`
-  ${anInterpolation}
+// TODO: reformat issue


codspeed-hq · 2024-06-17T05:58:03Z

CodSpeed Performance Report

Merging #3228 will not alter performance

_{Comparing ah-yu:embed_fmt (b10c615) with main (519e316)}

Summary

✅ 90 untouched benchmarks

denbezrukov · 2024-06-17T08:17:57Z

😻😻😻

ematipico · 2024-06-17T08:27:25Z

Wow @ah-yu!! You're the man!!! 👑

Let us know if you need any help

Sec-ant · 2024-06-17T13:30:12Z

Hello @ah-yu,

Great job on taking on this task!

I've had some experience with embedded language formatting in Prettier, so please allow me to share a few thoughts.

Personally, I'm not a fan of the --embedded-language-formatting option in Prettier. When I first learned about it, I couldn't find any documentation specifying which blocks are considered "embedded" and which languages are supported. It took me a long time digging through its source code to understand the supported languages and the way they are embedded. This made me realize that the auto option doesn't provide developers with enough information to have a clear expectation of how the code will be formatted.

Moreover, due to this lack of transparency and the fact that there are only auto and off choices, this option's functionality is quite limited. In fact, there are several popular issues on Prettier's GitHub where users have suggested enhancing the configurability of this option (see prettier/prettier#4424 and prettier/prettier#5588).

Given this background, and motivated by the need to format XML embedded within JavaScript back then, I developed a Prettier plugin, prettier-plugin-embed, to address this issue. Allow me to quote from its README:

What?

This Prettier plugin (namely prettier-plugin-embed) provides a configurable solution for formatting embedded languages in the form of template literals within JavaScript or TypeScript files.

Why?

Prettier has introduced an option for formatting embedded languages, named embedded-language-formatting. However, this option only offers two modes: auto and off. This limits its functionality, as it does not permit individual formatting adjustments for specific languages. Additionally, it lacks support for customizing which languages require formatting, as well as specifying block comments or tags for embedded language identification. These constraints hinder the feature's overall usability. For in-depth discussions on this matter, see the GitHub issues: prettier/prettier#4424 and prettier/prettier#5588.

In simple terms, this plugin explicitly configures which parts of the code are considered embedded languages. In my plugin, I specified that only tagged templates like css`div { color: red; }` and template literals with preceding block comments like /* css */ `div { color: red; }` are treated as embedded language candidates. The former is a common pattern in css-in-js libs and sql-in-js libs, while the latter is flexible enough to be used for language annotation. Additionally, this plugin provides rich configurable options for users to specify which tag names or block comments correspond to which languages. For details, see https://github.com/Sec-ant/prettier-plugin-embed?tab=readme-ov-file#language-specific-options.

This plugin relies on Prettier's extensive language support plugins to successfully provide embedded code formatting for many languages. The most popular seems to be the SQL family of languages, which is also a highly requested feature by Prettier users: prettier/prettier#12139. Regarding this, I think @karlhorky can provide more insights from a developer's perspective.

Regarding your work on Biome, I see that you have referenced Prettier's approach by introducing the --embedded-language-formatting=auto|off option and hardcoding which nodes are considered embedded CSS code for initial testing. While I don't support this approach, I'm not asking you to start over. Rather, I want to provide you and the other developers with enough background knowledge to help us analyze the path we need to take more comprehensively. After all, adding a new feature is easy, but changing its behavior later is much more difficult.

One more thing, I noticed in your code comments for handling template strings, you mentioned replacing interpolation expressions with placeholders, then parsing, formatting, and finally replacing them back with the original expressions. This method is indeed what Prettier and prettier-plugin-embed use. However, I must point out that different embedded languages may require different forms of placeholders to be recognized as valid tokens by their parsers due to syntax and parser constraints. This can also be affected by the position of the interpolation. IMHO, a more reasonable approach should work like gritql's definition and handling of metavariables: https://github.com/getgrit/gritql/blob/main/CONTRIBUTING.md#language-grammars. We can treat these interpolation expressions as special tokens called metavariables and add them to the language's grammar. This way, all interpolation expressions in various positions we defined in our grammar will be valid and parsable tokens. I mentioned this in a reply to @arendjr before. Otherwise, simply replacing them with a magic string could cause issues when we want to support more languages or even just one language with multiple interpolation expressions in different places.

In all, these are my personal insights. Although I have limited knowledge of parsers and formatters, based on my experience developing the prettier-plugin-embed plugin, I am willing to offer my thoughts and suggestions on this matter. I hope Biome will carefully and thoroughly consider the implementation of this feature.

Thanks! ❤️

karlhorky · 2024-06-17T14:03:55Z

Oh nice, didn't know about this PR! Cool to see more work being done in the embedded language formatting space 🙌

hardcoding which nodes are considered embedded CSS code

While I don't support this approach

Ahh yeah I would also not recommend the hardcoded / Prettier approach, it represents a pretty big downside to how Prettier is currently built.

Would be great to avoid this pitfall and offer extensibility from the start, if possible!

Regarding this, I think @karlhorky can provide more insights from a developer's perspective.

Yes, we and our students use SQL-in-JS through prettier-plugin-embed heavily, it's an essential feature for codebases which use SQL-in-JS, which is a growing use case. Lack of this feature would block us from considering migrating to Biome or any other formatter alternative.

I'm guessing there's also other popular embedded languages where the community would benefit from their inclusion, such as XML-in-JS, YAML-in-JS, Markdown-in-JS, and other supported by prettier-plugin-embed.

If Biome doesn't want to "own" and maintain the integrations of these embedded language formatters, then an alternative would be to make it easy (easier than Prettier) for community members to integrate with custom embedded languages - potentially with an approach similar to Prettier's "Support library-specific plugin" idea (which has not yet been implemented).

Conaclos · 2024-06-17T14:19:46Z

Oh this is a great start!

@Sec-ant

emplate literals with preceding block comments

Is this commonly used?

I definitively like the idea of configuring which template tag is associated to which language. We could have a default mapping like you did in your plugin. Have you some proposal to support this in the Biome config file?

MHO, a more reasonable approach should work like gritql's definition and handling of metavariables

This kind of variable can appear everywhere in the embedded code, it can even be part of a keyword.
How do you handle them at the CST level? Bogus nodes?

Even placeholder seems kind of erroneous. How Prettier is handling it?

Sec-ant · 2024-06-17T14:38:40Z

@Conaclos

Template literals with preceding block comments

Is this commonly used?

The initiative to support this syntax is not based on how commonly this pattern is used, but rather to provide a versatile enough approach for users to annotate the embedded code in a non-intrusive way. Only supporting tag functions will force the users to introduce runtime functions simply for dev-time formatting.

Also, this VSCode extension uses block comments to add syntax highlighting support for template literals: https://marketplace.visualstudio.com/items?itemName=bierner.comment-tagged-templates

I definitively like the idea of configuring which template tag is associated to which language. We could have a default mapping like you did in your plugin. Have you some proposal to support this in the Biome config file?

I don't have one now because I see we haven't talked about it elsewhere. But I can do some research and make a proposal later. I'm not very available in the coming days.

This kind of variable can appear everywhere in the embedded code, it can even be part of a keyword.
How do you handle them at the CST level? Bogus nodes?

To be honest, I think we should limit our embedded language formatting functionality to bail out on interpolations smaller than token or node level. I would take the approach like what gritql (or astgrep) adopts to support metavariables. It seems just not reasonable to me that one would use `con${"st"} a = 42;` in their code and expect the formatter to format it properly. (But I must admit, bogus nodes are definitely worth considering, this will enable us to not fully bail out, which makes it a very competitive feature)

Even placeholder seems kind of erroneous. How Prettier is handling it?

Prettier simply uses carefully-constructed special identifier names as placeholders, or doesn't support interpolation for certain embedded languages:

CSS: https://github.com/prettier/prettier/blob/d6daeec0a382ffb9471075f839d12568a4e57100/src/language-js/embed/css.js#L18-L20
GraphQL (process the code block by block splitted by the interpolations, but bail out on certain circumstances): https://github.com/prettier/prettier/blob/d6daeec0a382ffb9471075f839d12568a4e57100/src/language-js/embed/graphql.js#L37-L40
HTML: https://github.com/prettier/prettier/blob/d6daeec0a382ffb9471075f839d12568a4e57100/src/language-js/embed/html.js#L22-L23
Markdown (do not support interpolation, only take the first literal member to the formatter): https://github.com/prettier/prettier/blob/d6daeec0a382ffb9471075f839d12568a4e57100/src/language-js/embed/markdown.js#L11-L14

arendjr · 2024-06-17T17:38:48Z

One more thing, I noticed in your code comments for handling template strings, you mentioned replacing interpolation expressions with placeholders, then parsing, formatting, and finally replacing them back with the original expressions. This method is indeed what Prettier and prettier-plugin-embed use. However, I must point out that different embedded languages may require different forms of placeholders to be recognized as valid tokens by their parsers due to syntax and parser constraints. This can also be affected by the position of the interpolation. IMHO, a more reasonable approach should work like gritql's definition and handling of metavariables: https://github.com/getgrit/gritql/blob/main/CONTRIBUTING.md#language-grammars. We can treat these interpolation expressions as special tokens called metavariables and add them to the language's grammar. This way, all interpolation expressions in various positions we defined in our grammar will be valid and parsable tokens. I mentioned this in a reply to @arendjr before. Otherwise, simply replacing them with a magic string could cause issues when we want to support more languages or even just one language with multiple interpolation expressions in different places.

I can mostly second this, and indeed I expect we'll extend our parsers to support metavariables (the token kind is already there), so it would make sense if we can reuse the same ability here. But if we accept that metavariables are the placeholders, doesn't that closely match the approach already being taken here? Maybe I forgot something since we last discussed it 😅

Sec-ant · 2024-06-17T17:48:24Z

But if we accept that metavariables are the placeholders, doesn't that closely match the approach already being taken here? Maybe I forgot something since we last discussed it 😅

Sorry I wasn't being clear, I meant to express that we should use placeholders defined in the grammar so the parser can correctly parse that placeholder without ambiguity or choking. Currently, the placeholders used in Prettier, or my plugin prettier-plugin-embed, and also this PR, can only be considered special chosen identifier names, which are way less powerful than metavariables that are defined in the grammar.

ah-yu · 2024-06-18T02:46:55Z

@Sec-ant Thank you for your feedback, really appreciate it!

Regarding your work on Biome, I see that you have referenced Prettier's approach by introducing the --embedded-language-formatting=auto|off option and hardcoding which nodes are considered embedded CSS code for initial testing

Personally I don't like the Prettier's approach, the feature shouldn't be tightly bound to some specific css-in-js libraries. And I love the approach your plugin did, using some magic comments to inform the formatter that there is a language embedded is a good idea! Looking forward to your proposal. No hurry, take your time.

We can treat these interpolation expressions as special tokens called metavariables and add them to the language's grammar.

Not sure if introducing extra complexity to the parser to serve the formatter is a good approach, as it seems like a form of coupling to me

I have never used a css-in-js library and have no experience with embedded formatting, so I am not aware of the pain points. You definitely have a better understanding than I do. If you have any ideas you want to validate, feel free to push new commits to the branch.

Sec-ant · 2024-06-18T06:13:47Z

Not sure if introducing extra complexity to the parser to serve the formatter is a good approach, as it seems like a form of coupling to me

This is to provide a general and reliable way to address the template string interpolation. For example, you use biome-placeholder-0 in your code to transform this

const style = css`div {
  background: ${myColor};
}`;

into this

div {
  background: biome-placeholder-0;
}

which is still valid CSS syntax after transformation.

But if the user write this code

const style = css`div:nth-child(${myNth}) {
  background: green;
}`;

then after transformation it will become:

div:nth-child(biome-placeholder-0) {
  background: green;
}

which is not syntax correct.

Additionally, if we want to support more other languages for embedded language formatting in the future, a single solution like biome-placeholder-0 will also not apply for all of them:

const script  = js`const a = ${myValue};`

const a = biome - placeholder - 0;

We can of course use differet placeholders for different languages, that's also what Prettier and my plugin does. But it still leaves us with the first problem.

More over, from our infrastructure, we have the ability to implement embedded language linting as well. While some of the placeholders may pass the parser, they may fail some linter rules, and the diagnostics will be confusing.

So I think augmenting the grammar is the only proper way to solve this problem. And we already start implementing some kind of grammar augmentation for our plugin feature.

ah-yu · 2024-06-18T06:58:21Z

So I think augmenting the grammar is the only proper way to solve this problem

Indeed!

ah-yu · 2024-06-26T03:48:05Z

@Sec-ant Wonder if b10c615 is close to the implementation you proposed

Sec-ant · 2024-06-26T04:52:35Z

@Sec-ant Wonder if b10c615 is close to the implementation you proposed

Yes that's basically the idea 👍.

One thing I'm not sure though is the safety and future-compatibility of the CssGritMetavariable node. If we decide to support other CSS dialects, $ will have sepcial meanings in those languages, which may call for a different handling in the grammar and conflict with the metavariable node. But I'm probably overthinking it because at least right now we only support standard CSS syntax. And from gritql and ast-grep, this $... pattern seems to work fine in most of the languages without any special escaping treatment.

ah-yu · 2024-06-26T05:16:08Z

$ will have sepcial meanings in those languages

Maybe Grit needs to introduce new syntax for querying in Sass.

arendjr · 2024-06-26T05:21:40Z

Oh, this is great, nice update @ah-yu ! This will indeed end up helping our CSS support for GritQL as well.

Good to know: Grit actually does a clever trick under the hood, where it replaces the $ metavariable indicators with μ. The actual character they use is even language-specific to avoid syntax clashes but for almost all languages it’s μ. If you replace the character I think it should work here as well.

Sec-ant · 2024-06-26T05:22:55Z

crates/biome_js_formatter/src/js/expressions/template_expression.rs

+                    string_interpolations.push(element.clone());
+                    element_iter.next();
+                }
+                let placeholder = std::format!("biome-placeholder-{}", index);


Oh, and another one, - is not allowed in a gritql metavariable name: https://docs.grit.io/language/patterns#metavariables, it uses $lowercase_snake_case

Sec-ant · 2024-06-26T07:14:51Z

crates/biome_js_formatter/src/js/expressions/template_expression.rs

+    // if there are any placeholders left, we treat it is a error and format the template as normal
+    if !placeholder_map.is_empty() {
+        return Err(FormatError::SyntaxError);
+    }


I would envision use cases like the one below will be reasonable and may be common ones:

const index = 1; const code = css` div { background-image: url("./assets/images/img-${index}.png"); }`;

This is a sub-node level interpolation, I'm not sure whether the current replace_placeholder_with_template_element will recover such interpolations properly? I see currently we bail out if we don't restore all the interpolations from the placeholders, and format this as a normal js template string, which is good. But I guess we can keep this in mind and consider possible solutions? 🤔

I'm not sure whether the current replace_placeholder_with_template_element will recover such interpolations properly?

Unfortunately no.

Currently, this method does an exact text match with placeholders. However, the code snippet you provided will produce a format element like text("./assets/images/img-$placeholder.png"). Maybe we can implement partial matching with placeholders and split the element into multiple segments, something like [text("./assets/images/img-"), expr, text(".png")].

Maybe we can implement partial matching with placeholders and split the element into multiple segments, something like [text("./assets/images/img-"), expr, text(".png")]

👍 Yes, I think we don't need an exact match.

ah-yu added 13 commits June 17, 2024 13:17

tweak testFormatLanguage trait

179fc75

change public interface of format_node

4fbc0d1

remove to_language_settings

94be1d1

rename

c127c02

format embedded language

d018dd0

add tests and run tests

b49a36d

cleanup

4e98335

add todo

61520de

add embedded_language_formatting option

ac48f2b

handle adjacent interpolations

259c5a3

cleanup

2d71a2b

add test and update snapshot

714d225

support more css-in-js conventions

a354740

github-actions bot added A-CLI Area: CLI A-Project Area: project A-Formatter Area: formatter A-Tooling Area: internal tools L-JavaScript Language: JavaScript and super languages L-CSS Language: CSS L-JSON Language: JSON and super languages labels Jun 17, 2024

ah-yu commented Jun 17, 2024

View reviewed changes

ah-yu added 2 commits June 25, 2024 13:48

Merge remote-tracking branch 'upstream/main' into embed_fmt

7be38a8

x

b10c615

github-actions bot added the A-Parser Area: parser label Jun 26, 2024

Sec-ant reviewed Jun 26, 2024

View reviewed changes

ah-yu mentioned this pull request Jul 2, 2024

📎 Embedded language formatting #3334

Open

3 tasks

ah-yu closed this Jul 2, 2024

ah-yu deleted the embed_fmt branch July 2, 2024 02:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(js_formatter): explore embedded language formatting #3228

feat(js_formatter): explore embedded language formatting #3228

ah-yu commented Jun 17, 2024

ah-yu Jun 17, 2024

codspeed-hq bot commented Jun 17, 2024 •

edited

Loading

denbezrukov commented Jun 17, 2024

ematipico commented Jun 17, 2024

Sec-ant commented Jun 17, 2024

What?

Why?

karlhorky commented Jun 17, 2024 •

edited

Loading

Conaclos commented Jun 17, 2024 •

edited

Loading

Sec-ant commented Jun 17, 2024 •

edited

Loading

arendjr commented Jun 17, 2024

Sec-ant commented Jun 17, 2024 •

edited

Loading

ah-yu commented Jun 18, 2024

Sec-ant commented Jun 18, 2024 •

edited

Loading

ah-yu commented Jun 18, 2024 •

edited

Loading

ah-yu commented Jun 26, 2024

Sec-ant commented Jun 26, 2024

ah-yu commented Jun 26, 2024

arendjr commented Jun 26, 2024

Sec-ant Jun 26, 2024

Sec-ant Jun 26, 2024

ah-yu Jun 26, 2024

Sec-ant Jun 26, 2024

feat(js_formatter): explore embedded language formatting #3228

feat(js_formatter): explore embedded language formatting #3228

Conversation

ah-yu commented Jun 17, 2024

Summary

Test Plan

ah-yu Jun 17, 2024

Choose a reason for hiding this comment

codspeed-hq bot commented Jun 17, 2024 • edited Loading

CodSpeed Performance Report

Merging #3228 will not alter performance

Summary

denbezrukov commented Jun 17, 2024

ematipico commented Jun 17, 2024

Sec-ant commented Jun 17, 2024

What?

Why?

karlhorky commented Jun 17, 2024 • edited Loading

Conaclos commented Jun 17, 2024 • edited Loading

Sec-ant commented Jun 17, 2024 • edited Loading

arendjr commented Jun 17, 2024

Sec-ant commented Jun 17, 2024 • edited Loading

ah-yu commented Jun 18, 2024

Sec-ant commented Jun 18, 2024 • edited Loading

ah-yu commented Jun 18, 2024 • edited Loading

ah-yu commented Jun 26, 2024

Sec-ant commented Jun 26, 2024

ah-yu commented Jun 26, 2024

arendjr commented Jun 26, 2024

Sec-ant Jun 26, 2024

Choose a reason for hiding this comment

Sec-ant Jun 26, 2024

Choose a reason for hiding this comment

ah-yu Jun 26, 2024

Choose a reason for hiding this comment

Sec-ant Jun 26, 2024

Choose a reason for hiding this comment

codspeed-hq bot commented Jun 17, 2024 •

edited

Loading

karlhorky commented Jun 17, 2024 •

edited

Loading

Conaclos commented Jun 17, 2024 •

edited

Loading

Sec-ant commented Jun 17, 2024 •

edited

Loading

Sec-ant commented Jun 17, 2024 •

edited

Loading

Sec-ant commented Jun 18, 2024 •

edited

Loading

ah-yu commented Jun 18, 2024 •

edited

Loading