Non-RegExp patterns #2595

RunDevelopment · 2020-10-16T20:36:11Z

Motivation

Regexes are easy to use and powerful enough for most of the highlighting problems we face. However, sometimes regexes are not powerful enough.

Since some tokens can only be matched by a context-free grammar, we usually resorted to a number of tricks ranging from supporting a finite number of recursion steps to using the usually surrounding of the token. We use these tricks out of necessity but what we really need in those cases is some more powerful than regexes.

Description

I want to propose that we relax grammars to allow regex-like objects. The basic idea is that Prism's matching algorithm only uses the exec and lastIndex properties of RegExp objects, so any object that implements those properties can be used.

Speaking in types, I want to change:

interface GrammarToken {
  pattern: RegExp;
  ...
}
interface Grammar {
  [name: string]: RegExp | GrammarToken | Array<RegExp | GrammarToken>;
}

to:

interface RegExpLike { // RegExp trivially implements this interface
  lastIndex: number;
  exec(value: string): RegExpExecArray | null;
}
interface GrammarToken {
  pattern: RegExpLike;
  ...
}
interface Grammar {
  [name: string]: RegExpLike | GrammarToken | Array<RegExpLike | GrammarToken>;
}

This will allow us to implement custom matchers that can be more powerful than regexes.

Required changes to Prism Core

~~The only thing we have to change is how we try to enable the global flag here.~~

Edit: This idea require no changes to Core, if we require a global: true property in the RegExpLike interface.

The text was updated successfully, but these errors were encountered:

Golmote · 2020-10-17T09:38:47Z

Hi! I like how this is presumably such a small change in the core mechanism while still giving the grammars more power.

I quickly scanned through the original issue that lead to this idea. But could you maybe give an explicit example of the problem it would solve and what the exec function would look like?

RunDevelopment · 2020-10-17T09:46:47Z

But could you maybe give an explicit example of the problem it would solve and what the exec function would look like?

Yes, I'll add an example of nested XML today. I originally wanted to add it yesterday but I gotta sleep at some point.

RunDevelopment · 2020-10-17T11:51:10Z

Before I show the example: Implementing the exec function isn't easy because of the required lastIndex logic. This logic has to be implemented by all regex-like matchers, so I moved it to its own function:

toRegExpLike:

function toRegExpLike(matchFn: (input: string, offset: number) => RegExpExecArray | null): RegExpLike {
	return {
		global: true,
		lastIndex: 0,
		exec(value: string): RegExpExecArray | null {
			let lastIndex: number = this.lastIndex;
			if (typeof lastIndex !== "number" || lastIndex !== lastIndex || lastIndex < 0) {
				lastIndex = 0;
			}
			lastIndex = Math.floor(lastIndex);

			if (lastIndex <= value.length) {
				const result = matchFn(value, lastIndex);
				if (result) {
					this.lastIndex = result.index + result[0].length;
					return result;
				}
			}

			this.lastIndex = 0;
			return null;
		}
	}
}

This function allows us to implement matchers as pure functions.

Here is the promised example for nested XML structures:

Example:

function toRegExpExecArray(matches: string[], input: string, index: number): RegExpExecArray {
	const array = matches as RegExpExecArray;
	array.input = input;
	array.index = index;
	return array;
}

const enum TagType {
	OPEN,
	CLOSE,
	SELF_CLOSE,
}
function getTagType(tag: string): TagType {
	if (tag[1] === "/") {
		// </ ...>
		return TagType.CLOSE;
	} else if (tag[tag.length - 2] === "/") {
		// <... />
		return TagType.SELF_CLOSE;
	} else {
		// <...>
		return TagType.OPEN;
	}
}

function matchXML(input: string, offset: number): RegExpExecArray | null {
	const tag = /<\/?(?!\d)([^\s>\/=$<%]+)(?:\s(?:\s*[^\s>\/=]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s'">=]+(?=[\s>]))|(?=[\s/>])))+)?\s*\/?>/g;

	tag.lastIndex = offset;
	let initialMatch = tag.exec(input);
	while (initialMatch) {
		let type = getTagType(initialMatch[0]);
		if (type === TagType.SELF_CLOSE) {
			return initialMatch;
		} else if (type === TagType.OPEN) {
			const stack: RegExpExecArray[] = [initialMatch];
			let m;
			while ((m = tag.exec(input))) {
				type = getTagType(m[0]);

				if (type === TagType.OPEN) {
					stack.push(m);
				} else if (type === TagType.CLOSE) {
					const open = stack.pop();
					if (open[1] !== m[1]) {
						// mismatched tag names. Abort
						break;
					}
					if (stack.length === 0) {
						// we just found the closing tag to the initial opening tag
						const matches = [
							input.substring(open.index, m.index + m[0].length) // whole match
						];
						return toRegExpExecArray(matches, input, open.index);
					}
				}
			}
		}

		// move pattern to the next index
		tag.lastIndex = initialMatch.index + 1;
		initialMatch = tag.exec(input);
	}

	return null;
}

A regex-like object is created by calling toRegExpLike(matchXML).

Example usage in NodeJS shell

> var r = toRegExpLike(matchXML)
undefined
> r.exec(`foo <foo> <br/> </foo> <div/> <<BAZ/> </g> <fan> <b> </fan> </b>`)

[
  '<foo> <br/> </foo>',
  input: 'foo <foo> <br/> </foo> <div/> <<BAZ/> </g> <fan> <b> </fan> </b>',
  index: 4
]
> r.exec(`foo <foo> <br/> </foo> <div/> <<BAZ/> </g> <fan> <b> </fan> </b>`)

[
  '<div/>',
  'div',
  index: 23,
  input: 'foo <foo> <br/> </foo> <div/> <<BAZ/> </g> <fan> <b> </fan> </b>',
  groups: undefined
]
> r.exec(`foo <foo> <br/> </foo> <div/> <<BAZ/> </g> <fan> <b> </fan> </b>`)

[
  '<BAZ/>',
  'BAZ',
  index: 31,
  input: 'foo <foo> <br/> </foo> <div/> <<BAZ/> </g> <fan> <b> </fan> </b>',
  groups: undefined
]
> r.exec(`foo <foo> <br/> </foo> <div/> <<BAZ/> </g> <fan> <b> </fan> </b>`)

null
> r.exec(`foo <foo> <br/> </foo> <div/> <<BAZ/> </g> <fan> <b> </fan> </b>`)

[
  '<foo> <br/> </foo>',
  input: 'foo <foo> <br/> </foo> <div/> <<BAZ/> </g> <fan> <b> </fan> </b>',
  index: 4
]

Both toRegExpLike and toRegExpExecArray are reusable across matchers.

Even though I'm the one proposing it, I don't like my current matcher implementation. It's not as declarative as regexes and fairly complex which makes it easy for bugs to hide. Ideally, we would have some way to create the AST of a CF grammar and a function/class that creates a matcher for that grammar.

I also want to point out that these matchers have a worst-case complexity of Ω(n^2) assuming that the formal language the matcher accepts is CF.

karlhorky · 2020-10-17T16:56:33Z

Interesting approach! I don't know enough about building tooling on language grammars to know what could possibly help here, but I do wonder if any inspiration could come from any prior art, such as the TextMate grammar approach? (eg. https://macromates.com/manual/en/language_grammars)

Shiki uses these for highlighting, such as this TSX language definition: https://github.com/shikijs/shiki/blob/master/packages/languages/data/tsx.tmLanguage.json

Repl.it demo: https://repl.it/repls/GroundedEmbellishedEmbeds

mAAdhaTTah · 2020-10-18T14:35:21Z

I think the main question I have about this, and this applies broadly to a lot of (great!) suggestions you have for Prism, is the tradeoff between bundle size & correctness. My understanding is the original implementation was done w/ regex because it would enable us to keep the project small while producing relatively accurate (but not perfect) highlighting. If we start taking advantage of these sorts of things, we potentially start blowing up the size of the language definition in order to improve highlighting correctness.

That isn't to say I'm opposed to this change, but means we should be aware of what the trade-offs are here. I think it would also be helpful, in advance of something like this, to add a GH bot that tells us the bundle size impacts of our changes so we can discuss those tradeoffs with data as they come in.

RunDevelopment · 2020-10-18T22:49:00Z

the tradeoff between bundle size & correctness.

Yeah. I usually value correctness more than bundle size, so please do hold me back if I went overboard.

That being said, when I propose (or implement) these ideas that are motivated by one or a few specific cases (e.g. #2190, #2002), I usually think in terms of how well the idea will scale. I.e. if it took 1kb to implement a logic that saves 100bytes every time it's used, then it might be worth it.

I also want to say that most of these ideas are motived from cutting down complexity. This one is too. This might seem contradictory because implementing these matchers as described above is no easy feat.
My end game with this issue to implement something like an LL parser. We would supply a CF grammar and the matcher will be generated for us. Easy to use and as declarative as regexes. Implementing this won't be easy but we only have to it once. I also intend this to replace constructs like this I created out of necessity.

add a GH bot that tells us the bundle size impacts of our changes so we can discuss those tradeoffs with data as they come in.

#1787

mAAdhaTTah · 2020-10-19T15:37:39Z

@RunDevelopment Duh, yes, I knew I had an issue about that. Let me look into doing that soon. Maybe we consider doing a migration from TravisCI -> GH Actions in the process.

RunDevelopment added enhancement language-definitions core labels Oct 16, 2020

RunDevelopment mentioned this issue Oct 16, 2020

Generics with React TSX Language Leave plain-text Token Open #2594

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-RegExp patterns #2595

Non-RegExp patterns #2595

RunDevelopment commented Oct 16, 2020 •

edited

Loading

Golmote commented Oct 17, 2020

RunDevelopment commented Oct 17, 2020

RunDevelopment commented Oct 17, 2020 •

edited

Loading

karlhorky commented Oct 17, 2020

mAAdhaTTah commented Oct 18, 2020

RunDevelopment commented Oct 18, 2020

mAAdhaTTah commented Oct 19, 2020

Non-RegExp patterns #2595

Non-RegExp patterns #2595

Comments

RunDevelopment commented Oct 16, 2020 • edited Loading

Motivation

Description

Required changes to Prism Core

Golmote commented Oct 17, 2020

RunDevelopment commented Oct 17, 2020

RunDevelopment commented Oct 17, 2020 • edited Loading

karlhorky commented Oct 17, 2020

mAAdhaTTah commented Oct 18, 2020

RunDevelopment commented Oct 18, 2020

mAAdhaTTah commented Oct 19, 2020

RunDevelopment commented Oct 16, 2020 •

edited

Loading

RunDevelopment commented Oct 17, 2020 •

edited

Loading