Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose guess_lexer() functionality to guess lexers for text. #83

Closed
mweinelt opened this issue May 28, 2020 · 9 comments
Closed

Expose guess_lexer() functionality to guess lexers for text. #83

mweinelt opened this issue May 28, 2020 · 9 comments
Labels
enhancement New feature or request
Milestone

Comments

@mweinelt
Copy link
Contributor

Syntax highlighting is very important to maintain readability, but alas people are lazy.

This is where the guess_lexer function comes in. Do you think autodetection of the lexer using this is feasible?

pygments.lexers.guess_lexer(text, **options)

    Return a Lexer subclass instance that’s guessed from the text in text. For that, the analyse_text() method of every known lexer class is called with the text as argument, and the lexer which returned the highest value will be instantiated and returned.

    pygments.util.ClassNotFound is raised if no lexer thinks it can handle the content.

Steck already guesses based on mimetype so pastes submitted via web could benefit from a similar feature? https://github.com/supakeen/steck/blob/master/steck.py#L68

@supakeen
Copy link
Owner

Mrm. I've mostly been against auto-guessing because of false positives but I guess we can give it a whirl. Would you suggest adding a 'Magic Guess' language to the dropdown and setting it as default?

You are correct that it is more consistent with steck.

@supakeen supakeen added the enhancement New feature or request label May 29, 2020
@supakeen supakeen added this to the 1.2.0 milestone May 29, 2020
@supakeen supakeen changed the title Expose guess_lexer() functionality Expose guess_lexer() functionality to guess lexers for text. May 29, 2020
@mweinelt
Copy link
Contributor Author

Yeah, a dropdown option would be what I'm after. It need not be default until we have a good feeling about the functionality though.

@supakeen
Copy link
Owner

Mrm, if not default it'd be followed by being the default very shortly afterwards since it's a solution to the problem of people not selecting lexers.

The other one would be to either have people select the lexer on view (or repaste with a certain lexer).

Do you want to take a look or shall I fix this up over the weekend?

@mweinelt
Copy link
Contributor Author

I can give this a try.

@mweinelt
Copy link
Contributor Author

I did a quick test-drive and the results were quite bad. It often did a fallback to mime (PEM certificate, short python files) and tsql (ini, toml).

--- a/pinnwand/handler/website.py
+++ b/pinnwand/handler/website.py
@@ -5,6 +5,7 @@ from datetime import datetime
 
 import docutils.core
 import tornado.web
+from pygments.lexers import guess_lexer
 
 from pinnwand import database, path, utility, error
 
@@ -174,6 +175,16 @@ class CreateAction(Base):
                 raise error.ValidationError()
 
             for (lexer, raw, filename) in zip(lexers, raws, filenames):
+                log.info(f"CreateAction.post: lexer is {lexer}")
+                if lexer == 'AUTO':
+                    try:
+                        lexer = guess_lexer(raw).aliases[0]
+                        log.info(f"CreateAction.post: guessed lexer is {lexer}")
+                    except ValueError:
+                        # Fall back to plain text
+                        log.info(f"CreateAction.post: guess lexer fallback text")
+                        lexer = "text"
+
                 if lexer not in utility.list_languages():
                     log.info("CreateAction.post: a file had an invalid lexer")
                     raise error.ValidationError()
diff --git a/pinnwand/template/part/lexer-select.html b/pinnwand/template/part/lexer-select.html
index 5e6ed35..571c4c4 100644
--- a/pinnwand/template/part/lexer-select.html
+++ b/pinnwand/template/part/lexer-select.html
@@ -1,4 +1,5 @@
 <select name="lexer">
+    <option value="AUTO">Autodetect</option>
     {% if handler.application.configuration.preferred_lexers %}
         {% for key in handler.application.configuration.preferred_lexers %}
             <option value="{{ key }}"{% if selected == key %} selected="selected"{% end %}>{{ lexers[key] }}</option>

@supakeen
Copy link
Owner

supakeen commented May 30, 2020

Does it perform better if a lower limit is added, for example 'only guess if at least n characters of text have been provided' otherwise just use the 'text' lexer?

@mweinelt
Copy link
Contributor Author

Sampled this repository using this script:

guess.py
#!/usr/bin/env python3

from pathlib import Path

from pygments.lexers import guess_lexer

def test(path):
with open(path) as handle:
try:
raw = handle.read()
except UnicodeDecodeError:
return

total = len(raw)
last = None
for i in range(total):
    lexer = None
    try:
        lexer = guess_lexer(raw[0:i])
    except ValueError:
        pass
    
    if not isinstance(lexer, type(last)):
        last = lexer
        print(f"| {str(path)} | {i} | {total} | {lexer.aliases[0]} |")

if name == "main":
print("| Path | Position | Total Length | Lexer |")
print("|------|----------|--------|-------|")

files = [path for path in Path('.').rglob('*.*')]

for path in files:
    if not path.is_file():
        continue
    if str(path).startswith('venv'):
        continue
    test(path)
Path Position Total Length Lexer
.flake8 0 96 mime
.flake8 8 96 tsql
.flake8 9 96 ini
.gitignore 0 1347 mime
.gitignore 62 1347 tsql
.pre-commit-config.yaml 0 562 mime
.pre-commit-config.yaml 141 562 tsql
.pylintrc 0 16660 mime
.pylintrc 8 16660 tsql
.pylintrc 9 16660 ini
.travis.yml 0 236 mime
.travis.yml 11 236 as3
AUTHORS.rst 0 88 mime
AUTHORS.rst 18 88 sql
AUTHORS.rst 76 88 tsql
CHANGELOG.rst 0 1593 mime
CHANGELOG.rst 22 1593 sql
CHANGELOG.rst 32 1593 mysql
README.rst 0 3676 mime
README.rst 2 3676 rst
README.rst 904 3676 mysql
README.rst 1603 3676 ssp
default.nix 0 466 mime
default.nix 12 466 python2
mypy.ini 0 745 mime
mypy.ini 6 745 tsql
mypy.ini 7 745 ini
pinnwand.service-example 0 214 mime
pinnwand.service-example 6 214 tsql
pinnwand.service-example 7 214 ini
pinnwand.toml-example 0 2292 mime
pinnwand.toml-example 43 2292 mysql
poetry.lock 0 39510 mime
poetry.lock 10 39510 tsql
poetry.lock 12 39510 ini
pyproject.toml 0 1180 mime
pyproject.toml 14 1180 ini
requirements.txt 0 1904 mime
guess.py 0 926 mime
guess.py 21 926 python
guess.py 44 926 python2
.git/hooks/applypatch-msg.sample 0 536 mime
.git/hooks/applypatch-msg.sample 67 536 bash
.git/hooks/commit-msg.sample 0 954 mime
.git/hooks/commit-msg.sample 67 954 bash
.git/hooks/fsmonitor-watchman.sample 0 4706 mime
.git/hooks/fsmonitor-watchman.sample 66 4706 perl
.git/hooks/post-update.sample 0 247 mime
.git/hooks/post-update.sample 67 247 bash
.git/hooks/pre-applypatch.sample 0 482 mime
.git/hooks/pre-applypatch.sample 67 482 bash
.git/hooks/pre-commit.sample 0 1701 mime
.git/hooks/pre-commit.sample 67 1701 bash
.git/hooks/pre-merge-commit.sample 0 474 mime
.git/hooks/pre-merge-commit.sample 67 474 bash
.git/hooks/pre-push.sample 0 1406 mime
.git/hooks/pre-push.sample 67 1406 bash
.git/hooks/pre-rebase.sample 0 5007 mime
.git/hooks/pre-rebase.sample 67 5007 bash
.git/hooks/pre-receive.sample 0 602 mime
.git/hooks/pre-receive.sample 67 602 bash
.git/hooks/prepare-commit-msg.sample 0 1703 mime
.git/hooks/prepare-commit-msg.sample 67 1703 bash
.git/hooks/update.sample 0 3693 mime
.git/hooks/update.sample 67 3693 bash
doc/autodoc.rst 0 520 mime
doc/autodoc.rst 2 520 rst
doc/changelog.rst 0 30 mime
doc/changelog.rst 2 30 rst
doc/conf.py 0 5239 mime
doc/conf.py 245 5239 sql
doc/conf.py 571 5239 python2
doc/configuration.rst 0 4418 mime
doc/configuration.rst 2 4418 rst
doc/configuration.rst 59 4418 mysql
doc/index.rst 0 2046 mime
doc/index.rst 20 2046 sql
doc/index.rst 66 2046 mysql
doc/installation.rst 0 5047 mime
doc/installation.rst 2 5047 rst
doc/installation.rst 86 5047 mysql
doc/tricks.rst 0 830 mime
doc/tricks.rst 2 830 rst
doc/tricks.rst 70 830 mysql
doc/tricks.rst 676 830 ssp
doc/usage.rst 0 8276 mime
doc/usage.rst 2 8276 rst
doc/usage.rst 51 8276 mysql
doc/usage.rst 3731 8276 ssp
doc/_static/license.svg 0 949 mime
doc/_static/license.svg 253 949 xml
pinnwand/main.py 0 42 mime
pinnwand/main.py 29 42 python2
pinnwand/command.py 0 3035 mime
pinnwand/command.py 169 3035 sql
pinnwand/command.py 175 3035 python2
pinnwand/configuration.py 0 863 mime
pinnwand/configuration.py 159 863 xml
pinnwand/database.py 0 4072 mime
pinnwand/database.py 7 4072 python2
pinnwand/error.py 0 163 mime
pinnwand/error.py 159 163 perl6
pinnwand/http.py 0 2475 mime
pinnwand/http.py 7 2475 python2
pinnwand/path.py 0 315 mime
pinnwand/path.py 83 315 python2
pinnwand/utility.py 0 3336 mime
pinnwand/utility.py 19 3336 python2
pinnwand/handler/init.py 0 79 mime
pinnwand/handler/init.py 29 79 python2
pinnwand/handler/api_curl.py 0 2202 mime
pinnwand/handler/api_curl.py 7 2202 python2
pinnwand/handler/api_deprecated.py 0 5934 mime
pinnwand/handler/api_deprecated.py 7 5934 python2
pinnwand/handler/api_v1.py 0 2788 mime
pinnwand/handler/api_v1.py 7 2788 python2
pinnwand/handler/website.py 0 14536 mime
pinnwand/handler/website.py 7 14536 python2
pinnwand/page/about.rst 0 349 mime
pinnwand/page/about.rst 14 349 sql
pinnwand/page/expiry.rst 0 112 mime
pinnwand/page/expiry.rst 16 112 sql
pinnwand/page/removal.rst 0 497 mime
pinnwand/page/removal.rst 18 497 sql
pinnwand/static/pinnwand.css 0 20093 mime
pinnwand/static/pinnwand.css 92 20093 css+lasso
pinnwand/static/pinnwand.css 198 20093 tsql
pinnwand/static/pinnwand.css 990 20093 css+lasso
pinnwand/static/pinnwand.css 6405 20093 tsql
pinnwand/static/pinnwand.js 0 2578 mime
pinnwand/static/pinnwand.js 171 2578 sql
pinnwand/static/pinnwand.js 1220 2578 tsql
pinnwand/template/create.html 0 2652 mime
pinnwand/template/create.html 10 2652 django
pinnwand/template/create.html 320 2652 xml+django
pinnwand/template/error.html 0 129 mime
pinnwand/template/error.html 10 129 django
pinnwand/template/error.html 85 129 xml+django
pinnwand/template/layout.html 0 733 mime
pinnwand/template/layout.html 58 733 django
pinnwand/template/layout.html 66 733 xml+django
pinnwand/template/restructuredtextpage.html 0 96 mime
pinnwand/template/restructuredtextpage.html 10 96 django
pinnwand/template/restructuredtextpage.html 85 96 xml+django
pinnwand/template/show.html 0 1680 mime
pinnwand/template/show.html 10 1680 django
pinnwand/template/show.html 250 1680 xml+django
pinnwand/template/part/lexer-select.html 0 637 mime
pinnwand/template/part/lexer-select.html 66 637 xml
pinnwand/template/part/lexer-select.html 130 637 xml+django
test/test_command.py 0 2396 mime
test/test_command.py 7 2396 python2
test/test_database.py 0 264 mime
test/test_database.py 21 264 python2
test/test_http_api.py 0 11677 mime
test/test_http_api.py 7 11677 python2
test/test_http_curl.py 0 6977 mime
test/test_http_curl.py 7 6977 python2
test/test_http_website.py 0 9284 mime
test/test_http_website.py 7 9284 python2
test/test_utility.py 0 98 mime
test/test_utility.py 21 98 python2

Basically I think we should just fall back to text when we find out it either mime or tsql.
Also .travis.yml is not recognized as YAML, probably because the indent is off.

@mweinelt
Copy link
Contributor Author

The same test on a per line basis:

Path Line Total Lines Lexer
.flake8 0 6 mime
.flake8 1 6 tsql
.flake8 2 6 ini
.gitignore 0 119 mime
.gitignore 3 119 tsql
.pre-commit-config.yaml 0 27 mime
.pre-commit-config.yaml 8 27 tsql
.pylintrc 0 551 mime
.pylintrc 1 551 tsql
.pylintrc 2 551 ini
.travis.yml 0 14 mime
.travis.yml 1 14 as3
AUTHORS.rst 0 6 mime
AUTHORS.rst 4 6 sql
CHANGELOG.rst 0 51 mime
CHANGELOG.rst 4 51 mysql
README.rst 0 133 mime
README.rst 1 133 rst
README.rst 29 133 mysql
README.rst 70 133 ssp
default.nix 0 24 mime
default.nix 1 24 python2
mypy.ini 0 32 mime
mypy.ini 1 32 tsql
mypy.ini 2 32 ini
pinnwand.service-example 0 10 mime
pinnwand.service-example 1 10 tsql
pinnwand.service-example 2 10 ini
pinnwand.toml-example 0 39 mime
pinnwand.toml-example 1 39 mysql
poetry.lock 0 835 mime
poetry.lock 1 835 tsql
poetry.lock 2 835 ini
pyproject.toml 0 62 mime
pyproject.toml 2 62 ini
requirements.txt 0 29 mime
guess.py 0 42 mime
guess.py 1 42 python
guess.py 3 42 python2
guess-lines.py 0 43 mime
guess-lines.py 1 43 python
guess-lines.py 3 43 python2
.git/hooks/applypatch-msg.sample 0 16 mime
.git/hooks/applypatch-msg.sample 1 16 bash
.git/hooks/commit-msg.sample 0 25 mime
.git/hooks/commit-msg.sample 1 25 bash
.git/hooks/fsmonitor-watchman.sample 0 174 mime
.git/hooks/fsmonitor-watchman.sample 1 174 perl
.git/hooks/post-update.sample 0 9 mime
.git/hooks/post-update.sample 1 9 bash
.git/hooks/pre-applypatch.sample 0 15 mime
.git/hooks/pre-applypatch.sample 1 15 bash
.git/hooks/pre-commit.sample 0 50 mime
.git/hooks/pre-commit.sample 1 50 bash
.git/hooks/pre-merge-commit.sample 0 14 mime
.git/hooks/pre-merge-commit.sample 1 14 bash
.git/hooks/pre-push.sample 0 54 mime
.git/hooks/pre-push.sample 1 54 bash
.git/hooks/pre-rebase.sample 0 170 mime
.git/hooks/pre-rebase.sample 1 170 bash
.git/hooks/pre-receive.sample 0 25 mime
.git/hooks/pre-receive.sample 1 25 bash
.git/hooks/prepare-commit-msg.sample 0 43 mime
.git/hooks/prepare-commit-msg.sample 1 43 bash
.git/hooks/update.sample 0 129 mime
.git/hooks/update.sample 1 129 bash
doc/autodoc.rst 0 42 mime
doc/autodoc.rst 1 42 rst
doc/changelog.rst 0 2 mime
doc/changelog.rst 1 2 rst
doc/conf.py 0 178 mime
doc/conf.py 9 178 sql
doc/conf.py 15 178 python2
doc/configuration.rst 0 122 mime
doc/configuration.rst 1 122 rst
doc/configuration.rst 5 122 mysql
doc/index.rst 0 66 mime
doc/index.rst 4 66 mysql
doc/installation.rst 0 135 mime
doc/installation.rst 1 135 rst
doc/installation.rst 6 135 mysql
doc/tricks.rst 0 35 mime
doc/tricks.rst 1 35 rst
doc/tricks.rst 6 35 mysql
doc/tricks.rst 30 35 ssp
doc/usage.rst 0 249 mime
doc/usage.rst 1 249 rst
doc/usage.rst 9 249 mysql
doc/usage.rst 115 249 ssp
doc/_static/license.svg 0 1 mime
pinnwand/init.py 0 1 mime
pinnwand/main.py 0 4 mime
pinnwand/main.py 1 4 python2
pinnwand/command.py 0 116 mime
pinnwand/command.py 5 116 python2
pinnwand/configuration.py 0 9 mime
pinnwand/configuration.py 3 9 xml
pinnwand/database.py 0 156 mime
pinnwand/database.py 1 156 python2
pinnwand/error.py 0 6 mime
pinnwand/error.py 5 6 perl6
pinnwand/http.py 0 83 mime
pinnwand/http.py 1 83 python2
pinnwand/path.py 0 15 mime
pinnwand/path.py 2 15 python2
pinnwand/utility.py 0 112 mime
pinnwand/utility.py 1 112 python2
pinnwand/handler/init.py 0 2 mime
pinnwand/handler/init.py 1 2 python2
pinnwand/handler/api_curl.py 0 71 mime
pinnwand/handler/api_curl.py 1 71 python2
pinnwand/handler/api_deprecated.py 0 197 mime
pinnwand/handler/api_deprecated.py 1 197 python2
pinnwand/handler/api_v1.py 0 96 mime
pinnwand/handler/api_v1.py 1 96 python2
pinnwand/handler/website.py 0 449 mime
pinnwand/handler/website.py 1 449 python2
pinnwand/page/about.rst 0 12 mime
pinnwand/page/about.rst 4 12 sql
pinnwand/page/expiry.rst 0 6 mime
pinnwand/page/expiry.rst 4 6 sql
pinnwand/page/removal.rst 0 7 mime
pinnwand/page/removal.rst 4 7 sql
pinnwand/static/pinnwand.css 0 605 mime
pinnwand/static/pinnwand.css 8 605 css+lasso
pinnwand/static/pinnwand.css 15 605 tsql
pinnwand/static/pinnwand.css 67 605 css+lasso
pinnwand/static/pinnwand.css 373 605 tsql
pinnwand/static/pinnwand.js 0 85 mime
pinnwand/static/pinnwand.js 5 85 sql
pinnwand/static/pinnwand.js 32 85 tsql
pinnwand/template/create.html 0 58 mime
pinnwand/template/create.html 1 58 django
pinnwand/template/create.html 9 58 xml+django
pinnwand/template/error.html 0 8 mime
pinnwand/template/error.html 1 8 django
pinnwand/template/error.html 4 8 xml+django
pinnwand/template/layout.html 0 26 mime
pinnwand/template/layout.html 4 26 xml+django
pinnwand/template/restructuredtextpage.html 0 8 mime
pinnwand/template/restructuredtextpage.html 1 8 django
pinnwand/template/restructuredtextpage.html 6 8 xml+django
pinnwand/template/show.html 0 51 mime
pinnwand/template/show.html 1 51 django
pinnwand/template/show.html 7 51 xml+django
pinnwand/template/part/lexer-select.html 0 13 mime
pinnwand/template/part/lexer-select.html 2 13 xml
pinnwand/template/part/lexer-select.html 3 13 xml+django
test/init.py 0 1 mime
test/test_command.py 0 90 mime
test/test_command.py 1 90 python2
test/test_database.py 0 8 mime
test/test_database.py 1 8 python2
test/test_http_api.py 0 430 mime
test/test_http_api.py 1 430 python2
test/test_http_curl.py 0 257 mime
test/test_http_curl.py 1 257 python2
test/test_http_website.py 0 324 mime
test/test_http_website.py 1 324 python2
test/test_utility.py 0 6 mime
test/test_utility.py 1 6 python2

@supakeen
Copy link
Owner

supakeen commented May 30, 2020

Some interesting results here; it seems like we can do two or three things.

  1. Ignore the mime/tsql lexers as these are rarely used on big instances.
  2. Artificially score 'up' certain lexers
  3. Do it ourselves.

I'd say 1 or 2 have the preference. There's a 4th which would be based on filename but these are rarely supplied.

We likely also want to treat Python 2 as Python (3) by default.

mweinelt added a commit to mweinelt/pinnwand that referenced this issue May 30, 2020
mweinelt added a commit to mweinelt/pinnwand that referenced this issue May 30, 2020
mweinelt added a commit to mweinelt/pinnwand that referenced this issue May 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants