-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML strategy #4222
XML strategy #4222
Conversation
Now we're talking. Personally, I'd broaden the strategy a little so it leaves room for future additions, if needed. E.g., naming it Interestingly enough, this is actually what Other potential additions might be:
These are highly-distinctive patterns which could safely be considered a "last resort" if no other strategy with higher precedence matches. |
I'd prefer to start with something small (like the 132M XML files 😝 ) and extend it later if it works well. We can always change the name then since it's not exposed in the API anyway. I can see the benefit of doing the same for HTML and Roff in the future, but are there really that many PHP extensions we don't recognize? |
Ah, that changes things then. 😉
Not really, and not for PostScript either. But any unrecognised text file which begins with an unmistakable signature like |
I shouldn't think there'd be a problem... it's few lines than we currently read when searching for the modeline 😄 |
@lildude Okay to merge then? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost ready for merging. One little request: can you add a mention to this new strategy in the list in the "How Linguist Works" section of the readme.
@lildude Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Whoopsie. Just noticed failing tests appear to be ligit failures
This new strategy detects XML files using the root tag (`<?xml version=`). It runs after all other generative strategies (strategies that can generate new candidate languages, such as the Extension strategy) to avoid misclassifying files from languages that are subsets of XML.
Since strategies usually have only a couple of tests, there's few benefits from having one test file per strategy. Instead test_strategies.rb contains tests for all strategies in lib/linguist/strategies/.
app.config was classified as a Data file but should be detected as an XML file. The second test file is an example of file whose extension is not associated to XML.
@lildude The test failure were caused by a few new sample files which did not have a XML root tag ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now. Thanks for this. Much appreciated.
Following discussion in #2780, this pull request adds a new XML Strategy to detect XML files using their expected root tag
<?xml version=
.Description
The XML strategy runs after all other generative strategies (strategies that can generate new candidate languages, such as the Extension strategy) to avoid misclassifying files from languages that are subsets of XML (we have a few, such as XSLT). To that effect, the XML strategy only runs if previous strategies were unable to identify candidate languages.
This pull request will have the side effect of making Linguist read the content (which I limited to the 2 first lines) of all files whose language we previously weren't identifying. @lildude Not sure if that could be an issue on github.com's side?
The new fixture file is from danpal/OpenSAML and is licensed under Apache v2.0.
I've also merged all tests for strategies (files under
lib/linguist/strategies/
) into a single file. Each strategy only has one or two tests anyway.Fixes #2780.
/cc @jamesqo @Alhadis