Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL detection result of find method changed in v3 ? #351

Closed
sunadoi opened this issue Oct 6, 2021 · 3 comments · Fixed by #353 or #354
Closed

URL detection result of find method changed in v3 ? #351

sunadoi opened this issue Oct 6, 2021 · 3 comments · Fixed by #353 or #354
Assignees
Labels
i18n Internationalization pending-merge
Milestone

Comments

@sunadoi
Copy link

sunadoi commented Oct 6, 2021

When using find method to detect URL, I found that the detection results were different in v3 when there were no spaces before and after the URL.

v2

// URL
foo http://example.com bar
foo http://example.combar
foohttp://example.com bar
foohttp://example.combar
テストhttp://example.comテスト

v3

// URL
foo http://example.com bar
foo http://example.combar
foohttp://example.com bar

// Not URL
foohttp://example.combar
テストhttp://example.comテスト

Is this expected behavior?
If so, I would like to see the following fix, even if it’s only for multi-byte characters, because we often write like this in Japanese.

// URL
foo http://example.com bar
foo http://example.combar
foohttp://example.com bar
テストhttp://example.comテスト

// Not URL
foohttp://example.combar

ref: #315

@nfrasser
Copy link
Owner

nfrasser commented Oct 7, 2021

Hi @sunadoi, thanks for reporting.

The reasons for this regression in v3 are a bit complex related to the extended parsing I added to support Internationalized Domain Names (IDN). The parser now recognizes テスト as words, where in v2 they were treated as unknown symbols. The parser is greedy (tries to identify the longest possible tokens without backtracking) and since there is no delimiting whitespace it treats テストhttp as a word and the rest as an invalid URL.

I believe I can fix this by making a distinction in the parser between ASCII words and non-ASCII words. Unfortunately, because of ambiguity in these types of examples, the best I can get with this plugin will be the following (I used {{}} to mark which portions of text will be identified as links):


foo {{http://example.com}} bar
foo {{http://example.combar}}
foohttp://{{example.com}} bar
テスト{{http://example.comテスト}}

I hope that works for you because I unfortunately I cannot think of a good strategy to cover all edge cases like this.

@nfrasser nfrasser self-assigned this Oct 7, 2021
@nfrasser nfrasser added this to the 3.0 milestone Oct 7, 2021
@sunadoi
Copy link
Author

sunadoi commented Oct 8, 2021

@nfrasser

Thank you for your kind explanation.
The fix you suggested works for me.
I’ll be happy to see it😄

@nfrasser
Copy link
Owner

Fixed in the latest v4 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n Internationalization pending-merge
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants