Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying a default token #88

Closed
moranje opened this issue Jul 2, 2018 · 14 comments
Closed

Specifying a default token #88

moranje opened this issue Jul 2, 2018 · 14 comments

Comments

@moranje
Copy link
Contributor

moranje commented Jul 2, 2018

Hi I was wondering if there is a way/you'd consider implementing to have a default token (similar to VSCode's Monarch).

I have written a tokenizer for a text-based language, so every token that isn't tokenized is (part of) a text token. However, it's hard to express "anything not tokenized" in a regular expression.

So something along the lines of

const lexer = moo.states({
  main: {
	defaultToken: 'text';
	// ... rules
  },

  context: {
	defaultToken: 'text';
	// ... rules
  }
})

I'd be willing to give a more in-depth example if that would help, but since this is a conceptual question I think this may be enough.

Use case

I have written a parser for a snippet language, which might look like this.

Please insert your $1, with everyhting you've ${2:got${3: to give}}. If that's okay, of course, answer: ${4|yes,no|} # This is a comment.

The tokenizer to accomplish looks like this

const lexer = moo.states({
  main: {
    lineComment: /^#[^\n]*\n?/,
    inlineComment: /#[^\n]*/,
    open: { match: /\${/, push: 'nested' },
    dollar: { match: /\$/, push: 'unnested' },

	// More rules 

    // Matches any character except "${", "$" and "#"
    text: { match: /(?:(?!(?:\$\{|\$|#))[^])+/, lineBreaks: true }
  }

  // More states and rules
}

What I'm trying to accomplish is having everything that's not parsed as one songle token until one of my other rules is matched, without the catch all rule "eating away" at my other rules. So with the above rule in place I expect to get the following text tokens.

1. "Please insert your "
2. ", with everyhting you've "
3. "got"
4. " to give"
5. ". If that's okay, of course, answer: "
6. "yes"
7. "no"
8. " "

So far so good. This works for now. However if a user of this language would like to use a regular dollar sign in his text, I would like a way to tokenize escaped dollars as text. See example:

The $1 will cost you \\$ $2

The latter is hard to do without breaking the first requirement, but much easier when you have a defaultToken option (with negative lookbehind or lookahead).

Martien

@tjvr
Copy link
Collaborator

tjvr commented Jul 2, 2018

I think we're going to need an example I'm afraid :)

@moranje
Copy link
Contributor Author

moranje commented Jul 2, 2018

Updated the example

@tjvr
Copy link
Collaborator

tjvr commented Jul 2, 2018

I think our recommendation is to use negative lookahead here.

You could use an error token, but that will match everything to the end of the buffer; we don't "resume" lexing at the first recognised token.

I don't think it's trivial to extend Moo to support a "default token", because of how RegExp-based tokenizers work. 😕

@moranje
Copy link
Contributor Author

moranje commented Jul 2, 2018

I have a little trouble locating the section with the logic that triggers "invalid syntax" in the end so perhaps you'll forgive my ignorance in the next paragraph :).

Wouldn't it be possible to add all tokens that have no matching rule 'defaultToken' buffer a token does match a rule and store it before the new rule is tokenized?

@tjvr
Copy link
Collaborator

tjvr commented Jul 2, 2018

I'm afraid I'm not sure what you mean!

Does negative lookahead not work for you?

@moranje
Copy link
Contributor Author

moranje commented Jul 3, 2018

It has to be a double negative lookahead, since I already use one to have find all connected characters except: $ ${ or #. I'm afraid my regex foo not up to that task.

See here: https://regex101.com/r/rHZNr0/1

I'll try to explain my train of thought a little better.

When the regex tokenizer finds a character (x) it has no match for it throws the "invalid syntax" error right? At this point I'd at least have access to the offset of the offending character so I should be able to store the offending character and advance the stream. Store the characters that match no rule until a token is found that's part of a rule. Tokenize the stored characters as a default token (text) in my case. The stream would never throw a "invalid syntax" error since all characters are valid when a defaultToken is specified. Does that make any sense?

@nathan
Copy link
Collaborator

nathan commented Jul 3, 2018

@tjvr I believe this is just asking for a mode that switches from y-mode RegExps to refutable g-mode RegExps and emits an intermediate token when matches aren't adjacent (and after the last match, if there are remaining characters).

@moranje instead of trying to use double-negative lookahead, try making escape sequences a separate token; then you can just add their syntax (e.g., \\.) to your negative lookahead.

@moranje
Copy link
Contributor Author

moranje commented Jul 4, 2018

@nathan Thanks for helping out. The problem is that I really need a negative lookbehind (which will be coming to the ES spec) so the negative lookahead is really a hack to a sort of negative lookbehind. I can't wrap my head around how tokenizing the escape sequence would help me 1. Match as many tokens as long as they don't match $, ${ and # 2. Match $, ${ and # when preceded by \\.

Test case

I've been trying another thing to get around this which didn't work either

Create my own default token category text: /[\s\S]/ and concatenate adjacent tokens in with nearleys grammar, which doesn't work because since having one or more of the same token as a grammar rule is very ambiguous (n!, where n is the number of tokens). Seems like the place to work this out is at the tokenizer level.

@nathan
Copy link
Collaborator

nathan commented Jul 5, 2018

@moranje Use the EBNF qualifiers; that's what they're there for.

@{%
const moo = require('moo')
const lexer = moo.states({
  main: {
    open: { match: /\${/, push: 'nested' },
    dollar: { match: /\$/, push: 'unnested' },
    escape: /\\./,
    text: { match: /(?:(?!\$|\\.)[^])+/, lineBreaks: true },
  },
  nested: {
    int: {match: /\d+/, next: 'nested2'},
  },
  nested2: {
    colon: {match: ':', next: 'nested3'},
    close: {match: /}/, pop: 1},
  },
  nested3: {
    text: { match: /(?:(?![$}]|\\.)[^])+/, lineBreaks: true },
    escape: /\\./,
    open: {match: /\${/, push: 'nested'},
    close: {match: /}/, pop: 1},
  },
  unnested: {
    int: {match: /\d+/, pop: 1},
  },
})
%}
@lexer lexer

expression
 -> text (subst text):* {% ([first, rest]) => [first, ...[].concat(...rest)] %}
text
 -> tp:* {% ([parts]) => ({type: 'text', data: parts.join('')}) %}
tp
 -> %text {% ([t]) => t %}
  | %escape {% ([e]) => e.value.charAt(1) %}
subst
 -> %dollar %int {% ([_, data]) => ({type: 'subst', data: +data.value}) %}
  | %open %int (%colon expression):? %close {% ([_, data, [__, alternate]]) => ({type: 'subst', data: +data.value, alternate}) %}

@moranje
Copy link
Contributor Author

moranje commented Jul 7, 2018

I can't for the life of me figure out how to make this work... I have tried every concievable way to parse the following text.

Text \\$1 Text.

I know this isn't really realted to the issue anymore, but I'll share the code with you, maybe I could lean on you knowledge to close this is issue for the alternative you provided if I can get it working. Thanks for both your patience.

# grammer.ne

@preprocessor typescript

@{%
import lexer from './lexer';
import {
  snippet,
  tabstop,
  placeholder,
  text as textNode,
  escaped
} from './grammer-helper';
%}

# Use moo tokenizer
@lexer lexer

# Snippet ::= Element:+
Snippet ->
  Element:+ {% snippet %}

# Element ::= Tabstop | Placeholder | Text
Element ->
  Tabstop {% id %}
  | Placeholder {% id %}
  | Text {% id %}

# Tabstop ::= "$" [0-9]+
Tabstop ->
  # No modifier
  %dollar %int {% tabstop %}

# Placeholder ::= "${" ( [0-9]+ ":" Snippet ) "}"
Placeholder ->
  %open %int %colon Snippet %close {% placeholder%}

# Text ::= .:+
Text ->
  TextPartial:+ {% textNode %}

TextPartial ->
  %text {% id %}
  | %escape {% escaped %}
// lexer.ts

import moo from 'moo';

const lexer = moo.states({
  main: {
    open: { match: /\${/, push: 'nested' },
    dollar: { match: /\$/, push: 'unnested' },
    escape: /\\./,

    // Matches any character except "$" and "\\."
    text: { match: /(?:(?!\$|\\.)[^])+/, lineBreaks: true }
  },

  unnested: {
    int: { match: /[0-9]+/, pop: true }
  },

  nested: {
    open: { match: /\${/, push: 'nested' },
    dollar: { match: /\$/, push: 'unnested' },
    colon: { match: /:/, next: 'args' },
    int: /[0-9]+/
  },

  args: {
    open: { match: /\${/, push: 'nested' },
    dollar: { match: /\$/, push: 'unnested' },
    close: { match: /\}/, pop: true },
    colon: /:/,
    escape: /\\./,

    // Matches any character except "$", "\\.", ":" and "}"
    text: { match: /(?:(?!\$|\\.|:|\})[^])+/, lineBreaks: true }
  }
});

export default lexer;
// grammer-helper.ts

import { cloneDeep } from 'lodash';
import uniqid from 'uniqid';

let tracker = {
  list: [],
  queue: []
};

// *********************************
// * Main functions
// *********************************

export function snippet([body]) {
  // Prevent tracker object to be reused between parsing calls
  let copy = cloneDeep(tracker);
  tracker.list = [];
  tracker.queue = [];

  return {
    type: 'Snippet',
    body,
    tracker: copy
  };
}

export function tabstop([dollar, int]) {
  let trackerId = track('list');

  return astNode(
    'Tabstop',
    {
      trackerId,
      modifier: null,
      int: int.value
    },
    dollar,
    int
  );
}

export function placeholder([open, int, colon, snippet, close]) {
  let trackerId = track('list');
  storeNestedTrackers(snippet.tracker);

  return astNode(
    'Placeholder',
    {
      trackerId,
      modifier: null,
      int: int.value,
      body: snippet.body
    },
    open,
    close
  );
}

export function text([text]) {
  let first = text[0];
  let last = text[text.length - 1];

  return astNode(
    'Text',
    {
      value: escape(text.map(partial => partial.value).join(''))
    },
    first,
    last
  );
}

export function escaped([escaped]) {
  // Unescape token
  return Object.assign(escaped, { value: escaped.value.charAt(1) });
}

// *********************************
// * Helper functions
// *********************************

function track(type: string) {
  let trackerId = uniqid();

  tracker[type].push(trackerId);

  return trackerId;
}

function storeNestedTrackers(nestedTracker) {
  tracker.list.push(...nestedTracker.list);
  tracker.queue.push(...nestedTracker.queue);
}

function astNode(type, object: any, first, last?) {
  if (!last) last = first;

  return Object.assign({ type }, object, location(first, last));
}

function location(start, end) {
  return {
    start: start.offset,
    end: start.offset + end.value.length,
    loc: {
      start: {
        line: start.line,
        column: start.col
      },
      end: {
        line: end.line,
        column: end.col + end.value.length
      }
    }
  };
}

function escape(json: string) {
  return json
    .replace(/\n/g, '\\n')
    .replace(/\t/g, '\\t')
    .replace(/\"/g, '\\"');
}

@nathan
Copy link
Collaborator

nathan commented Jul 7, 2018

I can't for the life of me figure out how to make this work... I have tried every concievable way to parse the following text.

Text \\$1 Text.

The example I posted above parses that text:

[ [ { type: 'text', data: 'Text \\' },
    { type: 'subst', data: 1 },
    { type: 'text', data: ' Text.' } ] ]

It shouldn't be too hard to modify your own code to do the same thing.

@moranje
Copy link
Contributor Author

moranje commented Aug 16, 2018

Closing this, as the workaround @nathan supplied solved my grievences (for which I am grateful). I still feel that the default token is a more elegant solution to this problem.

Again, thanks for your help!

Martien

@nathan
Copy link
Collaborator

nathan commented Aug 23, 2018

Reopening (#98).

@tjvr tjvr closed this as completed Sep 1, 2018
@tjvr
Copy link
Collaborator

tjvr commented Sep 1, 2018

Merged #98.

Sent with GitHawk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants