SourceRange.prettyRange: escape sourceName #92

lehmacdj · 2022-03-05T21:59:43Z

Fixes #91 by escaping filenames. I think technically we would only have to escape - and %, but I escaped all of the special syntax to avoid inconsistency.

jgm · 2022-03-05T22:31:44Z

commonmark/src/Commonmark/Types.hs

+-- if the source name contains special characters it can lead to ambiguity when
+-- a filename exactly matches a fragment of syntax of the range
+escapeSourceName :: String -> String
+escapeSourceName = concatMap escapeChar
+  where
+    escapeChar '-' = "%-"
+    escapeChar '%' = "%%"
+    escapeChar ':' = "%:"
+    escapeChar ';' = "%;"
+    escapeChar x = [x]
+


I was actally thinking of percent encoding like in URIs.
But it could be just for the characters you mention. I think that's cleaner, because it's a recognizable format and people will know what's going on.
https://en.wikipedia.org/wiki/Percent-encoding

lehmacdj · 2022-03-06T00:37:43Z

Ah, I didn't think of that because at least - generally isn't escaped in URIs. Agree that using something more standard is better than rolling something custom here though for sure. I hand coded the encoding because I couldn't find a library that it would make sense to pull in to do this.

jgm · 2022-03-06T05:17:56Z

Hand-coding is the way to go; I want dependencies to be light.
It would be good to measure the performance impact of this, though. And it might turn out to be faster to check the string first for the special characters, to avoid the need to concatMap in most cases.
Am I right that we could get away with just escaping two characters, % and :? - will always occur after an unescaped :, right?

I'm still not quite sure whether this sort of ad hoc URI escaping is the right approach. Alternatively we could do backslash escapes. Maybe that would be more intuitive.

lehmacdj · 2022-03-12T22:16:00Z

I wrote some benchmarks for measuring the impact of the source name on parsing time:

Before (on commit fa3e4e5 with the last commit applied over it)

All
  name impact
    no special characters
      1:  OK (2.04s)
        135 ms ± 8.8 ms
      5:  OK (1.79s)
        254 ms ±  23 ms
      10: OK (3.44s)
        495 ms ±  14 ms
      20: OK (2.65s)
        883 ms ±  27 ms
    special characters
      1:  OK (2.03s)
        134 ms ± 7.0 ms
      5:  OK (1.81s)
        257 ms ±  14 ms
      10: OK (3.31s)
        470 ms ±  21 ms
      20: OK (2.66s)
        889 ms ±  60 ms

All 8 tests passed (19.76s)

On the most recent commit 09b9ebb

All
  name impact
    no special characters
      1:  OK (2.05s)
        136 ms ± 9.8 ms
      5:  OK (1.83s)
        260 ms ±  18 ms
      10: OK (1.53s)
        517 ms ±  33 ms
      20: OK (2.76s)
        919 ms ±  44 ms
    special characters
      1:  OK (2.39s)
        159 ms ± 9.5 ms
      5:  OK (2.75s)
        390 ms ±  23 ms
      10: OK (2.31s)
        770 ms ±  39 ms
      20: OK (8.70s)
        1.24 s ±  48 ms

All 8 tests passed (24.35s)

So it looks like performance doesn't degrade very heavily unless there are special characters in the filename. For reference, these tests are using source names of length 50 times the number for the test, while parsing sample.md also used for the generic parse test. A filename of length 250 is about as long as can be expected, and unless there are special characters in the string there is only a 5ms performance impact relative to before from this change.

I haven't tested variants to see if the overhead can be made lower (especially for filenames involving the special characters), but can maybe do that at somepoint in the future if you still think that is necessary.

Am I right that we could get away with just escaping two characters, % and :? - will always occur after an unescaped :, right?

Yes, I believe this would work. Only escaping a few of the characters seems a little bit adhoc to me, as it means that the meaning of - or ; is dependent on when there is a previously escaped : though, which feels a little bit gross.

Backslash escapes seem fine too.

lehmacdj · 2022-03-12T22:18:34Z

One more relevant comparison is parsing sample.md without extracting source location information which takes just 34ms:

All
  parse sample.md
    commonmark default: OK (2.25s)
       35 ms ± 3.1 ms

All 1 tests passed (2.25s)

jgm · 2022-03-13T00:37:11Z

Thanks for checking this. I'm not too worried about it if it doesn't significantly affect the most common case, where filenames don't have these special characters.

I guess it's just a matter of figuring out what to make escapable and what sort of escaping to use:

backslash escapes
percent escapes
percent-encoding like in URIs
render filename with show (in this case we'd also have to do it if the filename contains ")

jgm · 2022-03-13T00:38:30Z

Maybe backslash-escapes are the most straightforward. However, this would mean that backslashes in the filename would always be escaped (doubled); I suppose this could be an issue for Windows users?

jgm reviewed Mar 5, 2022

View reviewed changes

SourceRange.prettyRange: escape sourceName

50672f5

lehmacdj force-pushed the escape-sourceName branch from 9ec55b0 to 50672f5 Compare March 6, 2022 00:36

add benchmarks for SourceName escaping

09b9ebb

lehmacdj force-pushed the escape-sourceName branch from deb5672 to 09b9ebb Compare March 12, 2022 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SourceRange.prettyRange: escape sourceName #92

SourceRange.prettyRange: escape sourceName #92

lehmacdj commented Mar 5, 2022

jgm Mar 5, 2022

lehmacdj commented Mar 6, 2022

jgm commented Mar 6, 2022

lehmacdj commented Mar 12, 2022

lehmacdj commented Mar 12, 2022

jgm commented Mar 13, 2022

jgm commented Mar 13, 2022

SourceRange.prettyRange: escape sourceName #92

Are you sure you want to change the base?

SourceRange.prettyRange: escape sourceName #92

Conversation

lehmacdj commented Mar 5, 2022

jgm Mar 5, 2022

Choose a reason for hiding this comment

lehmacdj commented Mar 6, 2022

jgm commented Mar 6, 2022

lehmacdj commented Mar 12, 2022

lehmacdj commented Mar 12, 2022

jgm commented Mar 13, 2022

jgm commented Mar 13, 2022