Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linguist should understand JSX in .js files #3677

Closed
wants to merge 2 commits into from
Closed

Linguist should understand JSX in .js files #3677

wants to merge 2 commits into from

Conversation

buzinas
Copy link

@buzinas buzinas commented Jun 14, 2017

Search for .js / Search for .jsx

Search for .js with "return <div>" (the < characters are ignored, but you can see that many people uses JSX in .js files by opening a few pages).

Please, let me know what I can do to help getting this PR merged at some point.

Refs #3144

@pchaigno
Copy link
Contributor

Do you want the JSX portions of .js files to be included in statistics, or are you trying to fix the syntax highlighting?

If you're trying to recognize the JSX portions, that is not currently supported by Linguist and would require major changes. Currently, Linguist assigns a single language per file.

If you want to fix the syntax highlighting, then you should submit a pull request to atom/language-javascript, which github.com uses to highlight JavaScript code.

@Alhadis
Copy link
Collaborator

Alhadis commented Jun 14, 2017

Uh, there's nothing wrong with the syntax highlighting. At all.

JSX is not JavaScript.

Users are advised to use the correct file extension (.jsx) instead. Failing that, adding a modeline to override syntax highlighting is also feasible. But there's nothing here to "fix" concerning the actual JavaScript grammar.

@pchaigno
Copy link
Contributor

Hm, I read this a bit too quickly :/

@buzinas So you're point is that many JSX files have a .js extension and we should therefore try to recognize this extension for JSX?

@@ -1975,6 +1975,7 @@ JSX:
type: programming
group: JavaScript
extensions:
- ".js"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is .js really the primary extension for JSX? If not, please move this below .jsx as the first extension is considered the primary extension.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it's kind of impossible to do that research on Github because it ignores some important characters - but from my knowledge much more people use .js extension for writing JSX code than .jsx.

If you go to React's official repo examples, you'll see that all of them uses JSX in .js files (and the entire ecosystem does - Redux etc). Based on that, I assume that if the creators of JSX use the .js extension, it's the primary extension for the language. But I can't ensure that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for the confirmation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, another useful info: facebook/create-react-app#87

@buzinas
Copy link
Author

buzinas commented Jun 14, 2017

JSX is not JavaScript.

@Alhadis JSX is not JavaScript, agreed - but millions of people write JSX using the .js extension

So you're point is that many JSX files have a .js extension and we should therefore try to recognize this extension for JSX?

@pchaigno yeah, exactly that!

lildude
lildude previously approved these changes Jun 20, 2017
@pchaigno
Copy link
Contributor

Should we add a heuristic to distinguish JS and JSX files? I'm afraid the classifier alone might not do a very good job.

@buzinas
Copy link
Author

buzinas commented Jun 21, 2017

@pchaigno yeah, that sounds reasonable.

Can you help me by giving the directions on how to do that?

@pchaigno
Copy link
Contributor

You can follow the examples in heuristics.rb. You'll also need to add a test for that new heuristic in test_heuristics.rb.

The heuristic itself doesn't need to be perfect. It should be correct when it identifies a file as being JSX (or JS), but it doesn't need to identify all JSX (or JS) files.

@lildude lildude dismissed their stale review June 26, 2017 08:08

Question has been addressed, but final overall approval is still required.

@lildude
Copy link
Member

lildude commented Oct 5, 2017

Just a friendly nudge.

@buzinas
Copy link
Author

buzinas commented Oct 31, 2017

Can anyone jump in and help me with adding the heuristic? I'm having a hard time figuring out how to identify if a file is JSX or JS.

@lildude
Copy link
Member

lildude commented Oct 31, 2017

I don't know anything about JSX but I'm pretty good at pattern matching 😉. From a cursory glance at https://reactjs.org/docs/jsx-in-depth.html and files already on GitHub, it appears JSX files will almost always include React somewhere in them. Either in the form of something like React.createElement or import React from 'react'; or React = require('react'), though this last form does bring up a lot of files with just that in them, which is perfectly valid javascript and not really JSX from my understanding, so is not really a good indicator of JSX content.

I've also noticed that some form of render .... return <[htmltaghere]> or similar statement occurs in the JSX files I've found so that may be a better multiline string to match for in a regex.

@exalted
Copy link

exalted commented Oct 31, 2017

My two cents: in a JavaScript file, you could consider JSX anywhere you find a <tag …/> or <tag …>…</tag> that otherwise could be invalid JavaScript syntax (i.e. SyntaxError: Unexpected token '<').

For example in JavaScript const hello = <h1>Hello, world!</h1>; would be invalid, but it’s perfectly valid JSX.

I think it’s virtually impossible to determine JSX based on it’s location (whether inside a component's render() function, in a stateless function, in a variable assignment, or something like ReactDOM.render(<h1>Hello, world!</h1>, document.getElementById('root'));, etc.); instead JSX, as being a syntax extension to JavaScript, it could be detected/treated as such.

@buzinas
Copy link
Author

buzinas commented Oct 31, 2017

@exalted that's what I was trying, but the problem is: there can be a <tag> inside a string, and it would be a false positive, since it's a perfectly valid JavaScript.

@exalted
Copy link

exalted commented Oct 31, 2017

Maybe it's worth looking at how JSX is detected to be then converted into regular JavaScript in tools like babel?

@exalted
Copy link

exalted commented Oct 31, 2017

@buzinas I don’t think you can look for <tag> in a JS file and call it a day.

@buzinas
Copy link
Author

buzinas commented Oct 31, 2017

That's another thing that I tried, but since Babel parses the JavaScript code as an AST, that would be too complex for a heuristic.

@exalted
Copy link

exalted commented Oct 31, 2017

I am not sure what do you mean by heuristic in this context, however I don’t think a RegEx approach suffice, sadly.

I haven‘t worked with linguist, so I shouldn’t judge but, if an AST is not a viable option here, I think what we have left with is to take a best guess approach. I am sure that doesn't hold for long though and you’d have to add many edge cases to guess better and better…

@lildude
Copy link
Member

lildude commented Jan 12, 2018

Nudge.

@zacanger
Copy link

zacanger commented Jan 19, 2018

Hey, just a friendly bump. I currently have PRs into private projects with js linguist-language=JSX in gitattributes in the hope that will fix highlighting in diffs, but having this in would be much nicer/less hacky.

Just a suggestion: maybe detecting either an import or require of React and also any xml-like tag would be safe? React has to be imported/required in any file that has JSX. It wouldn't cover less-common cases like Preact/Inferno/Choo/FooLib that also use JSX, and has the chance of picking up JS that uses React but not JSX with tag-like things in strings, but JSX highlighting works fine for regular JS so that seems (to me) to be reasonable.

@reyronald
Copy link

Jumping in here.

I believe Linguist can already syntax JSX files correctly with no problems, as you know, everything inside a JSX files is valid JavaScript, except for the HTML-like tags, so... would it be too crazy of an idea to use the already working JSX grammar to parse JS files by default instead and call it a day? I think it would have no actual implications, correct me on this?

In fact, this is what a lot of devs are doing manually now like @zacanger just stated, not only for Linguist, but for Atom too (see https://discuss.atom.io/t/how-do-i-make-atom-recognize-a-file-with-extension-x-as-language-y/26539?u=wliu).

By the way, I'm coming from here: atom/language-javascript#220 (comment), in which I made a similar argument, just in case anyone wants to check that out.

@lildude
Copy link
Member

lildude commented Feb 11, 2018

would it be too crazy of an idea to use the already working JSX grammar to parse JS files by default instead and call it a day?

I don't think so, there is however one teeeeny weeeny heeeawg fat problem... #3044

In short, the version of language-babel we currently use is pinned to a really old version with a known issue because the maintainer changed some of the regexes in the grammar to those only supported by the Oniguruma engine used in Atom, but not GitHub.com, and isn't prepared to entertain GitHub.com compatibility because "It works in atom..."

If you know of a method of automatically converting the Oniguruma-only regexes to be PCRE-friendly or know another PCRE-friendly grammar that is actively maintained (we looked into switching to https://github.com/babel/babel-sublime but found it lacking - see #3775) we'll definitely entertain the solution to our current problem so we can consider your suggestion.

scytacki added a commit to concord-consortium/portal-pages that referenced this pull request Jun 19, 2018
github-linguist/linguist#3677
My understanding is that github uses linguist to identify what type of source file.
And there are two problems:
- The syntax highlighter used by linguist when it gets the wrong file type doesn't work with githubs regex engine.
- And Linguist doesn't correctly identify the jsx files as jsx.

I saw this gitattributes fix mentioned, so I'm giving it a try.
@JonAbrams
Copy link

I think we can all agree that it's at least common, if not standard, for projects to have jsx code in .js files. Every code editor that supports JSX, supports this just fine. What's the downside of adding JSX support for .js files?

One reason why projects haven't adopted .jsx is due to the fact that codebases often have a mix of source code files, some with JSX, some not. It's a pain to have to change the file's extension when JSX happens to be added to an existing source file. Plus, it complicates scripts to have glob/regexes that match both .js and .jsx (I do realize the irony here).

Can we plz merge this for the benefit of people that love both JSX and Github? Arguing over whether .jsx or .js is more popular, and how to measure it, is pure bike-shedding.

@pchaigno
Copy link
Contributor

pchaigno commented Sep 8, 2018

@JonAbrams Have you read the above discussion? We are not arguing over whether .js JSX files are popular enough. However, if we were to merge this pull request as-is, we would likely end up with .js files randomly recognized as JS or JSX, because the Bayesian classifier is unable to properly distinguish them.

As discussed above, we could use a heuristic rule (regular expression) as a first filter before the Bayesian classifier. There doesn't seem to be an obvious way to distinguish the two languages, but if anyone as an idea that may work, we'd welcome a pull request!

In my opinion, the best option would be that mentioned by @reyronald: we would use a single grammar for both JS and JSX. JSX files wouldn't be detected as such, but they would at least be correctly highlighted. Distinguishing between JS and JSX files inside the grammar is easier because we're not limited by the computational power of regular expressions. @Alhadis mentioned in #4030 (Comment) that he's working on such a JS/JSX grammar, but that's obviously a huge task and I don't know if he found time to make progress.

@Alhadis
Copy link
Collaborator

Alhadis commented Sep 8, 2018

@Alhadis mentioned in #4030 (Comment) that he's working on such a JS/JSX grammar, but that's obviously a huge task and I don't know if he found time to make progress.

Not just a huge task, but slow as well. Much of this has to do with the fact that Atom doesn't live reload grammars as they're being worked on†, reducing me to reloading the entire workspace to see changes. This cumbersome workflow, in addition to the predicted complexity of the grammar code, has motivated me to start work on a compiler for an intermediate grammar format optimised for reading/writing regexp-based grammars.

The long-term goal is to have a maintainable format which can be kept updated by the Babel maintainers, the TC39 committee, or whoever else is guaranteed to keep it updated. The compiler won't be limited to TextMate-style grammars alone, meaning it'll generate output for CodeMirror and Pygments as well.

If I were to continue the way I started, with frequent repetition of CSON blocks and hacky idioms, I 100% guarantee the grammar would become unmaintainable in future.

(† — A less easily-justified reason is that half this year I've been limited to working on an antique laptop running on OS which Atom doesn't support, and I only recently scabbed a MacBook from a friend to resume Atom-related projects. But enough of that)

@stale
Copy link

stale bot commented Nov 6, 2018

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Nov 6, 2018
@stale
Copy link

stale bot commented Nov 20, 2018

This pull request has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

@stale stale bot closed this Nov 20, 2018
@Alhadis Alhadis reopened this Nov 20, 2018
@stale stale bot removed the Stale label Nov 20, 2018
@Qix-
Copy link

Qix- commented Nov 20, 2018

For what it's worth, Facebook (more specifically, Dan Abramov) decrees that the .jsx extension is no longer relevant since the pre-Babel days. His reasoning is pretty solid.

facebook/create-react-app#87 (comment)

This should happen as .js files with JSX are currently monotone, of course. Having the JSX statistic would also be nice, but honestly since Javascript is an amalgamation of bizarre syntax features and proposals and whatnot, the statistics become a bit less important in my opinion.

I'd even suggest that this PR is merged to allow syntax highlighting and a new ticket is opened about fixing the statistics in a new PR. This would get the broken highlighting fixed sooner.

@Alhadis
Copy link
Collaborator

Alhadis commented Nov 21, 2018

I'm not sure why we're even bothering to differentiate between JS and JSX anymore, actually. Since it falls under the usage statistics of the former, we may as well roll it into the JavaScript entry and spare ourselves the hassle over semantics.

Since JSX is a superset of JavaScript, it should be possible for one grammar to accommodate both JS and sugary extensions like embedded XML and type annotations. Moreover, Linguist has no way of knowing if something like this is a JS or JSX file:

export const foo = "Valid ES6";

So in a large codebase, many files are likely to be flagged as ordinary JS because they don't use tag syntax, meaning users end up with a jarring classification like "20% JavaScript, 80% JSX" or what-have-you. And Linguist has no way of identifying files based on the presence of another language in the repository (which would be helpful, as it would vastly improve its interpretation of C++/C).

@stale
Copy link

stale bot commented Dec 22, 2018

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Dec 22, 2018
@Qix-
Copy link

Qix- commented Dec 22, 2018

Not stale, my god I hate these bots...

@stale stale bot removed the Stale label Dec 22, 2018
@Alhadis Alhadis self-assigned this Dec 23, 2018
@Alhadis
Copy link
Collaborator

Alhadis commented Dec 23, 2018

Self-assigning this to shut Stalebot up. 👍 See #4358

@Qix-
Copy link

Qix- commented Dec 23, 2018

Thank you @Alhadis :)

@lildude
Copy link
Member

lildude commented Jan 13, 2021

This is now resolved by moving to using the treesitter JavaScript grammar GitHub uses in #5133. GitHub.com will reflect this after the next release due some time this month.

Closing.

@lildude lildude closed this Jan 13, 2021
@buzinas buzinas deleted the jsx-for-js-extension branch January 13, 2021 15:40
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants