Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SourceRange.prettyRange: escape sourceName #92

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

lehmacdj
Copy link

@lehmacdj lehmacdj commented Mar 5, 2022

Fixes #91 by escaping filenames. I think technically we would only have to escape - and %, but I escaped all of the special syntax to avoid inconsistency.

Comment on lines 142 to 152
-- if the source name contains special characters it can lead to ambiguity when
-- a filename exactly matches a fragment of syntax of the range
escapeSourceName :: String -> String
escapeSourceName = concatMap escapeChar
where
escapeChar '-' = "%-"
escapeChar '%' = "%%"
escapeChar ':' = "%:"
escapeChar ';' = "%;"
escapeChar x = [x]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actally thinking of percent encoding like in URIs.
But it could be just for the characters you mention. I think that's cleaner, because it's a recognizable format and people will know what's going on.
https://en.wikipedia.org/wiki/Percent-encoding

@lehmacdj lehmacdj force-pushed the escape-sourceName branch from 9ec55b0 to 50672f5 Compare March 6, 2022 00:36
@lehmacdj
Copy link
Author

lehmacdj commented Mar 6, 2022

Ah, I didn't think of that because at least - generally isn't escaped in URIs. Agree that using something more standard is better than rolling something custom here though for sure. I hand coded the encoding because I couldn't find a library that it would make sense to pull in to do this.

@jgm
Copy link
Owner

jgm commented Mar 6, 2022

Hand-coding is the way to go; I want dependencies to be light.
It would be good to measure the performance impact of this, though. And it might turn out to be faster to check the string first for the special characters, to avoid the need to concatMap in most cases.
Am I right that we could get away with just escaping two characters, % and :? - will always occur after an unescaped :, right?

I'm still not quite sure whether this sort of ad hoc URI escaping is the right approach. Alternatively we could do backslash escapes. Maybe that would be more intuitive.

@lehmacdj lehmacdj force-pushed the escape-sourceName branch from deb5672 to 09b9ebb Compare March 12, 2022 22:14
@lehmacdj
Copy link
Author

I wrote some benchmarks for measuring the impact of the source name on parsing time:

Before (on commit fa3e4e5 with the last commit applied over it)
All
  name impact
    no special characters
      1:  OK (2.04s)
        135 ms ± 8.8 ms
      5:  OK (1.79s)
        254 ms ±  23 ms
      10: OK (3.44s)
        495 ms ±  14 ms
      20: OK (2.65s)
        883 ms ±  27 ms
    special characters
      1:  OK (2.03s)
        134 ms ± 7.0 ms
      5:  OK (1.81s)
        257 ms ±  14 ms
      10: OK (3.31s)
        470 ms ±  21 ms
      20: OK (2.66s)
        889 ms ±  60 ms

All 8 tests passed (19.76s)
On the most recent commit 09b9ebb
All
  name impact
    no special characters
      1:  OK (2.05s)
        136 ms ± 9.8 ms
      5:  OK (1.83s)
        260 ms ±  18 ms
      10: OK (1.53s)
        517 ms ±  33 ms
      20: OK (2.76s)
        919 ms ±  44 ms
    special characters
      1:  OK (2.39s)
        159 ms ± 9.5 ms
      5:  OK (2.75s)
        390 ms ±  23 ms
      10: OK (2.31s)
        770 ms ±  39 ms
      20: OK (8.70s)
        1.24 s ±  48 ms

All 8 tests passed (24.35s)

So it looks like performance doesn't degrade very heavily unless there are special characters in the filename. For reference, these tests are using source names of length 50 times the number for the test, while parsing sample.md also used for the generic parse test. A filename of length 250 is about as long as can be expected, and unless there are special characters in the string there is only a 5ms performance impact relative to before from this change.

I haven't tested variants to see if the overhead can be made lower (especially for filenames involving the special characters), but can maybe do that at somepoint in the future if you still think that is necessary.

Am I right that we could get away with just escaping two characters, % and :? - will always occur after an unescaped :, right?

Yes, I believe this would work. Only escaping a few of the characters seems a little bit adhoc to me, as it means that the meaning of - or ; is dependent on when there is a previously escaped : though, which feels a little bit gross.

Backslash escapes seem fine too.

@lehmacdj
Copy link
Author

One more relevant comparison is parsing sample.md without extracting source location information which takes just 34ms:

All
  parse sample.md
    commonmark default: OK (2.25s)
       35 ms ± 3.1 ms

All 1 tests passed (2.25s)

@jgm
Copy link
Owner

jgm commented Mar 13, 2022

Thanks for checking this. I'm not too worried about it if it doesn't significantly affect the most common case, where filenames don't have these special characters.

I guess it's just a matter of figuring out what to make escapable and what sort of escaping to use:

  • backslash escapes
  • percent escapes
  • percent-encoding like in URIs
  • render filename with show (in this case we'd also have to do it if the filename contains ")

@jgm
Copy link
Owner

jgm commented Mar 13, 2022

Maybe backslash-escapes are the most straightforward. However, this would mean that backslashes in the filename would always be escaped (doubled); I suppose this could be an issue for Windows users?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

show :: SourceRange -> String is ambiguous
2 participants