Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML strategy #4222

Merged
merged 3 commits into from
Aug 16, 2018
Merged

XML strategy #4222

merged 3 commits into from
Aug 16, 2018

Conversation

pchaigno
Copy link
Contributor

@pchaigno pchaigno commented Aug 4, 2018

Following discussion in #2780, this pull request adds a new XML Strategy to detect XML files using their expected root tag <?xml version=.

Description

The XML strategy runs after all other generative strategies (strategies that can generate new candidate languages, such as the Extension strategy) to avoid misclassifying files from languages that are subsets of XML (we have a few, such as XSLT). To that effect, the XML strategy only runs if previous strategies were unable to identify candidate languages.

This pull request will have the side effect of making Linguist read the content (which I limited to the 2 first lines) of all files whose language we previously weren't identifying. @lildude Not sure if that could be an issue on github.com's side?

The new fixture file is from danpal/OpenSAML and is licensed under Apache v2.0.

I've also merged all tests for strategies (files under lib/linguist/strategies/) into a single file. Each strategy only has one or two tests anyway.

  • I am adding new or changing current functionality
    • I have added or updated the tests for the new or changed functionality.

Fixes #2780.

/cc @jamesqo @Alhadis

@pchaigno pchaigno mentioned this pull request Aug 4, 2018
4 tasks
@Alhadis
Copy link
Collaborator

Alhadis commented Aug 5, 2018

Now we're talking.

Personally, I'd broaden the strategy a little so it leaves room for future additions, if needed. E.g., naming it Strategy::Signature might be better than making it XML-specific.

Interestingly enough, this is actually what file-icons is already doing as a fallback for files which don't match any recognised file extensions. It works a treat.

Other potential additions might be:

  • HTML: /^<!DOCTYPE html>/i
  • Roff: /^'\\"\ [tre]+(?=\s|$)/
  • PHP: /^<\?php/
  • PostScript: /^%!PS/

These are highly-distinctive patterns which could safely be considered a "last resort" if no other strategy with higher precedence matches.

@pchaigno
Copy link
Contributor Author

pchaigno commented Aug 5, 2018

Personally, I'd broaden the strategy a little so it leaves room for future additions, if needed. E.g., naming it Strategy::Signature might be better than making it XML-specific.

I'd prefer to start with something small (like the 132M XML files 😝 ) and extend it later if it works well. We can always change the name then since it's not exposed in the API anyway.

I can see the benefit of doing the same for HTML and Roff in the future, but are there really that many PHP extensions we don't recognize?

@Alhadis
Copy link
Collaborator

Alhadis commented Aug 5, 2018

We can always change the name then since it's not exposed in the API anyway.

Ah, that changes things then. 😉

... but are there really that many PHP extensions we don't recognize?

Not really, and not for PostScript either. But any unrecognised text file which begins with an unmistakable signature like <?php or %!PS can be reasonably assumed to be the format in question, IMHO.

@lildude
Copy link
Member

lildude commented Aug 11, 2018

@lildude Not sure if that could be an issue on github.com's side?

I shouldn't think there'd be a problem... it's few lines than we currently read when searching for the modeline 😄

@pchaigno
Copy link
Contributor Author

@lildude Okay to merge then?

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost ready for merging. One little request: can you add a mention to this new strategy in the list in the "How Linguist Works" section of the readme.

@pchaigno
Copy link
Contributor Author

@lildude Done.

lildude
lildude previously approved these changes Aug 12, 2018
Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lildude lildude dismissed their stale review August 12, 2018 14:49

Whoopsie. Just noticed failing tests appear to be ligit failures

This new strategy detects XML files using the root tag
(`<?xml version=`). It runs after all other generative strategies
(strategies that can generate new candidate languages, such as the
Extension strategy) to avoid misclassifying files from languages that
are subsets of XML.
Since strategies usually have only a couple of tests, there's few
benefits from having one test file per strategy. Instead
test_strategies.rb contains tests for all strategies in
lib/linguist/strategies/.
app.config was classified as a Data file but should be detected as an
XML file. The second test file is an example of file whose extension
is not associated to XML.
@pchaigno
Copy link
Contributor Author

@lildude The test failure were caused by a few new sample files which did not have a XML root tag (XML/GMOculus.project.gmx and XML/obj_control.object.gmx); I added those to the whitelist.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. Thanks for this. Much appreciated.

@pchaigno pchaigno merged commit ff004ea into github-linguist:master Aug 16, 2018
@pchaigno pchaigno deleted the xml-strategy branch August 16, 2018 17:19
@smola smola mentioned this pull request Jan 28, 2019
4 tasks
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto-detecting XML files based on the header
3 participants