-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specifying a default token #88
Comments
I think we're going to need an example I'm afraid :) |
Updated the example |
I think our recommendation is to use negative lookahead here. You could use an I don't think it's trivial to extend Moo to support a "default token", because of how RegExp-based tokenizers work. 😕 |
I have a little trouble locating the section with the logic that triggers "invalid syntax" in the end so perhaps you'll forgive my ignorance in the next paragraph :). Wouldn't it be possible to add all tokens that have no matching rule 'defaultToken' buffer a token does match a rule and store it before the new rule is tokenized? |
I'm afraid I'm not sure what you mean! Does negative lookahead not work for you? |
It has to be a double negative lookahead, since I already use one to have find all connected characters except: $ ${ or #. I'm afraid my regex foo not up to that task. See here: https://regex101.com/r/rHZNr0/1 I'll try to explain my train of thought a little better. When the regex tokenizer finds a character (x) it has no match for it throws the "invalid syntax" error right? At this point I'd at least have access to the offset of the offending character so I should be able to store the offending character and advance the stream. Store the characters that match no rule until a token is found that's part of a rule. Tokenize the stored characters as a default token (text) in my case. The stream would never throw a "invalid syntax" error since all characters are valid when a |
@tjvr I believe this is just asking for a mode that switches from @moranje instead of trying to use double-negative lookahead, try making escape sequences a separate token; then you can just add their syntax (e.g., |
@nathan Thanks for helping out. The problem is that I really need a negative lookbehind (which will be coming to the ES spec) so the negative lookahead is really a hack to a sort of negative lookbehind. I can't wrap my head around how tokenizing the escape sequence would help me I've been trying another thing to get around this which didn't work either Create my own default token category |
@moranje Use the EBNF qualifiers; that's what they're there for. @{%
const moo = require('moo')
const lexer = moo.states({
main: {
open: { match: /\${/, push: 'nested' },
dollar: { match: /\$/, push: 'unnested' },
escape: /\\./,
text: { match: /(?:(?!\$|\\.)[^])+/, lineBreaks: true },
},
nested: {
int: {match: /\d+/, next: 'nested2'},
},
nested2: {
colon: {match: ':', next: 'nested3'},
close: {match: /}/, pop: 1},
},
nested3: {
text: { match: /(?:(?![$}]|\\.)[^])+/, lineBreaks: true },
escape: /\\./,
open: {match: /\${/, push: 'nested'},
close: {match: /}/, pop: 1},
},
unnested: {
int: {match: /\d+/, pop: 1},
},
})
%}
@lexer lexer
expression
-> text (subst text):* {% ([first, rest]) => [first, ...[].concat(...rest)] %}
text
-> tp:* {% ([parts]) => ({type: 'text', data: parts.join('')}) %}
tp
-> %text {% ([t]) => t %}
| %escape {% ([e]) => e.value.charAt(1) %}
subst
-> %dollar %int {% ([_, data]) => ({type: 'subst', data: +data.value}) %}
| %open %int (%colon expression):? %close {% ([_, data, [__, alternate]]) => ({type: 'subst', data: +data.value, alternate}) %} |
I can't for the life of me figure out how to make this work... I have tried every concievable way to parse the following text.
I know this isn't really realted to the issue anymore, but I'll share the code with you, maybe I could lean on you knowledge to close this is issue for the alternative you provided if I can get it working. Thanks for both your patience. # grammer.ne
@preprocessor typescript
@{%
import lexer from './lexer';
import {
snippet,
tabstop,
placeholder,
text as textNode,
escaped
} from './grammer-helper';
%}
# Use moo tokenizer
@lexer lexer
# Snippet ::= Element:+
Snippet ->
Element:+ {% snippet %}
# Element ::= Tabstop | Placeholder | Text
Element ->
Tabstop {% id %}
| Placeholder {% id %}
| Text {% id %}
# Tabstop ::= "$" [0-9]+
Tabstop ->
# No modifier
%dollar %int {% tabstop %}
# Placeholder ::= "${" ( [0-9]+ ":" Snippet ) "}"
Placeholder ->
%open %int %colon Snippet %close {% placeholder%}
# Text ::= .:+
Text ->
TextPartial:+ {% textNode %}
TextPartial ->
%text {% id %}
| %escape {% escaped %} // lexer.ts
import moo from 'moo';
const lexer = moo.states({
main: {
open: { match: /\${/, push: 'nested' },
dollar: { match: /\$/, push: 'unnested' },
escape: /\\./,
// Matches any character except "$" and "\\."
text: { match: /(?:(?!\$|\\.)[^])+/, lineBreaks: true }
},
unnested: {
int: { match: /[0-9]+/, pop: true }
},
nested: {
open: { match: /\${/, push: 'nested' },
dollar: { match: /\$/, push: 'unnested' },
colon: { match: /:/, next: 'args' },
int: /[0-9]+/
},
args: {
open: { match: /\${/, push: 'nested' },
dollar: { match: /\$/, push: 'unnested' },
close: { match: /\}/, pop: true },
colon: /:/,
escape: /\\./,
// Matches any character except "$", "\\.", ":" and "}"
text: { match: /(?:(?!\$|\\.|:|\})[^])+/, lineBreaks: true }
}
});
export default lexer; // grammer-helper.ts
import { cloneDeep } from 'lodash';
import uniqid from 'uniqid';
let tracker = {
list: [],
queue: []
};
// *********************************
// * Main functions
// *********************************
export function snippet([body]) {
// Prevent tracker object to be reused between parsing calls
let copy = cloneDeep(tracker);
tracker.list = [];
tracker.queue = [];
return {
type: 'Snippet',
body,
tracker: copy
};
}
export function tabstop([dollar, int]) {
let trackerId = track('list');
return astNode(
'Tabstop',
{
trackerId,
modifier: null,
int: int.value
},
dollar,
int
);
}
export function placeholder([open, int, colon, snippet, close]) {
let trackerId = track('list');
storeNestedTrackers(snippet.tracker);
return astNode(
'Placeholder',
{
trackerId,
modifier: null,
int: int.value,
body: snippet.body
},
open,
close
);
}
export function text([text]) {
let first = text[0];
let last = text[text.length - 1];
return astNode(
'Text',
{
value: escape(text.map(partial => partial.value).join(''))
},
first,
last
);
}
export function escaped([escaped]) {
// Unescape token
return Object.assign(escaped, { value: escaped.value.charAt(1) });
}
// *********************************
// * Helper functions
// *********************************
function track(type: string) {
let trackerId = uniqid();
tracker[type].push(trackerId);
return trackerId;
}
function storeNestedTrackers(nestedTracker) {
tracker.list.push(...nestedTracker.list);
tracker.queue.push(...nestedTracker.queue);
}
function astNode(type, object: any, first, last?) {
if (!last) last = first;
return Object.assign({ type }, object, location(first, last));
}
function location(start, end) {
return {
start: start.offset,
end: start.offset + end.value.length,
loc: {
start: {
line: start.line,
column: start.col
},
end: {
line: end.line,
column: end.col + end.value.length
}
}
};
}
function escape(json: string) {
return json
.replace(/\n/g, '\\n')
.replace(/\t/g, '\\t')
.replace(/\"/g, '\\"');
} |
The example I posted above parses that text: [ [ { type: 'text', data: 'Text \\' },
{ type: 'subst', data: 1 },
{ type: 'text', data: ' Text.' } ] ] It shouldn't be too hard to modify your own code to do the same thing. |
Closing this, as the workaround @nathan supplied solved my grievences (for which I am grateful). I still feel that the default token is a more elegant solution to this problem. Again, thanks for your help! Martien |
Reopening (#98). |
Hi I was wondering if there is a way/you'd consider implementing to have a default token (similar to VSCode's Monarch).
I have written a tokenizer for a text-based language, so every token that isn't tokenized is (part of) a text token. However, it's hard to express "anything not tokenized" in a regular expression.
So something along the lines of
I'd be willing to give a more in-depth example if that would help, but since this is a conceptual question I think this may be enough.
Use case
I have written a parser for a snippet language, which might look like this.
The tokenizer to accomplish looks like this
What I'm trying to accomplish is having everything that's not parsed as one songle token until one of my other rules is matched, without the catch all rule "eating away" at my other rules. So with the above rule in place I expect to get the following text tokens.
So far so good. This works for now. However if a user of this language would like to use a regular dollar sign in his text, I would like a way to tokenize escaped dollars as text. See example:
The latter is hard to do without breaking the first requirement, but much easier when you have a
defaultToken
option (with negative lookbehind or lookahead).Martien
The text was updated successfully, but these errors were encountered: