Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexer: use Crystal::Token::Kind enum instead of symbols #11616

Merged
merged 8 commits into from
Mar 3, 2022

Conversation

HertzDevil
Copy link
Contributor

This PR removes one of the biggest uses of symbol variables in the compiler.

The symbol solution is attractive, but ultimately I believe it is better if we can see all possible token kinds at a glance. We also gain the ability to define instance methods on Crystal::Token::Kind, e.g. #assignment_operator?, as opposed to global methods with Symbol arguments.

spec/compiler/lexer/lexer_comment_spec.cr Show resolved Hide resolved
Comment on lines 139 to 153
{% for member in @type.constants %}
when {{ @type.constant(member) }}
{% if member.starts_with?("OP_") %}
{% parts = member.split("_") %}
{{ parts.map { |ch| operator1[ch] || "" }.join("") }}
{% elsif member.starts_with?("MAGIC_") %}
{{ "__#{member[6..-1].id}__" }}
{% else %}
{{ member.stringify }}
{% end %}
{% end %}
else
value.to_s
end
{% end %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this generative implementation. I would prefer to spell everything out.
This is much more complex and harder to understand than a simple list of when OP_CARET then "^". It ties constant names closely to their string representation which reduces flexibility. I'd like more flexibility and to consider some improvements to operator names.
And in the end it only saves a couple of LOC, where the saved LOCs would be super trivial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like ¯\_(ツ)_/¯

The result is compile-time and can be inspected via {% debug %} if wanted.

I'd like more flexibility and to consider some improvements to operator names.

I even think that operator names shouldn't be "clever". This approach ensures that there is no ambiguity or bikeshedding (beyond the names of each character, anyway)

{% debug %}
case self
in EOF
  "EOF"
in SPACE
  "SPACE"
in NEWLINE
  "NEWLINE"
in IDENT
  "IDENT"
in CONST
  "CONST"
in INSTANCE_VAR
  "INSTANCE_VAR"
in CLASS_VAR
  "CLASS_VAR"
in CHAR
  "CHAR"
in STRING
  "STRING"
in SYMBOL
  "SYMBOL"
in NUMBER
  "NUMBER"
in UNDERSCORE
  "UNDERSCORE"
in COMMENT
  "COMMENT"
in DELIMITER_START
  "DELIMITER_START"
in DELIMITER_END
  "DELIMITER_END"
in STRING_ARRAY_START
  "STRING_ARRAY_START"
in INTERPOLATION_START
  "INTERPOLATION_START"
in SYMBOL_ARRAY_START
  "SYMBOL_ARRAY_START"
in STRING_ARRAY_END
  "STRING_ARRAY_END"
in GLOBAL
  "GLOBAL"
in GLOBAL_MATCH_DATA_INDEX
  "GLOBAL_MATCH_DATA_INDEX"
in MAGIC_DIR
  "__DIR__"
in MAGIC_END_LINE
  "__END_LINE__"
in MAGIC_FILE
  "__FILE__"
in MAGIC_LINE
  "__LINE__"
in MACRO_LITERAL
  "MACRO_LITERAL"
in MACRO_EXPRESSION_START
  "MACRO_EXPRESSION_START"
in MACRO_CONTROL_START
  "MACRO_CONTROL_START"
in MACRO_VAR
  "MACRO_VAR"
in MACRO_END
  "MACRO_END"
in OP_BANG
  "!"
in OP_BANG_EQ
  "!="
in OP_BANG_TILDE
  "!~"
in OP_DOLLAR_QUESTION
  "$?"
in OP_DOLLAR_TILDE
  "$~"
in OP_PERCENT
  "%"
in OP_PERCENT_EQ
  "%="
in OP_PERCENT_CURLYR
  "%}"
in OP_AMP
  "&"
in OP_AMP_AMP
  "&&"
in OP_AMP_AMP_EQ
  "&&="
in OP_AMP_STAR
  "&*"
in OP_AMP_STAR_STAR
  "&**"
in OP_AMP_STAR_EQ
  "&*="
in OP_AMP_PLUS
  "&+"
in OP_AMP_PLUS_EQ
  "&+="
in OP_AMP_MINUS
  "&-"
in OP_AMP_MINUS_EQ
  "&-="
in OP_AMP_EQ
  "&="
in OP_PARENL
  "("
in OP_PARENR
  ")"
in OP_STAR
  "*"
in OP_STAR_STAR
  "**"
in OP_STAR_STAR_EQ
  "**="
in OP_STAR_EQ
  "*="
in OP_PLUS
  "+"
in OP_PLUS_EQ
  "+="
in OP_COMMA
  ","
in OP_MINUS
  "-"
in OP_MINUS_EQ
  "-="
in OP_MINUS_GT
  "->"
in OP_PERIOD
  "."
in OP_PERIOD_PERIOD
  ".."
in OP_PERIOD_PERIOD_PERIOD
  "..."
in OP_SLASH
  "/"
in OP_SLASH_SLASH
  "//"
in OP_SLASH_SLASH_EQ
  "//="
in OP_SLASH_EQ
  "/="
in OP_COLON
  ":"
in OP_COLON_COLON
  "::"
in OP_SEMICOLON
  ";"
in OP_LT
  "<"
in OP_LT_LT
  "<<"
in OP_LT_LT_EQ
  "<<="
in OP_LT_EQ
  "<="
in OP_LT_EQ_GT
  "<=>"
in OP_EQ
  "="
in OP_EQ_EQ
  "=="
in OP_EQ_EQ_EQ
  "==="
in OP_EQ_GT
  "=>"
in OP_EQ_TILDE
  "=~"
in OP_GT
  ">"
in OP_GT_EQ
  ">="
in OP_GT_GT
  ">>"
in OP_GT_GT_EQ
  ">>="
in OP_QUESTION
  "?"
in OP_AT_SQUAREL
  "@["
in OP_SQUAREL
  "["
in OP_SQUAREL_SQUARER
  "[]"
in OP_SQUAREL_SQUARER_EQ
  "[]="
in OP_SQUAREL_SQUARER_QUESTION
  "[]?"
in OP_SQUARER
  "]"
in OP_CARET
  "^"
in OP_CARET_EQ
  "^="
in OP_GRAVE
  "`"
in OP_CURLYL
  "{"
in OP_CURLYL_PERCENT
  "{%"
in OP_CURLYL_CURLYL
  "{{"
in OP_BAR
  "|"
in OP_BAR_EQ
  "|="
in OP_BAR_BAR
  "||"
in OP_BAR_BAR_EQ
  "||="
in OP_CURLYR
  "}"
in OP_TILDE
  "~"
end

src/compiler/crystal/syntax/token.cr Outdated Show resolved Hide resolved
Copy link
Member

@oprypin oprypin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on board with this change. Only left small comments.

Just making sure to cc @asterite here: what is your opinion, does this make it less fun to hack on the compiler?

src/compiler/crystal/syntax/token.cr Outdated Show resolved Hide resolved
src/compiler/crystal/syntax/token.cr Outdated Show resolved Hide resolved
Comment on lines 139 to 153
{% for member in @type.constants %}
when {{ @type.constant(member) }}
{% if member.starts_with?("OP_") %}
{% parts = member.split("_") %}
{{ parts.map { |ch| operator1[ch] || "" }.join("") }}
{% elsif member.starts_with?("MAGIC_") %}
{{ "__#{member[6..-1].id}__" }}
{% else %}
{{ member.stringify }}
{% end %}
{% end %}
else
value.to_s
end
{% end %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like ¯\_(ツ)_/¯

The result is compile-time and can be inspected via {% debug %} if wanted.

I'd like more flexibility and to consider some improvements to operator names.

I even think that operator names shouldn't be "clever". This approach ensures that there is no ambiguity or bikeshedding (beyond the names of each character, anyway)

{% debug %}
case self
in EOF
  "EOF"
in SPACE
  "SPACE"
in NEWLINE
  "NEWLINE"
in IDENT
  "IDENT"
in CONST
  "CONST"
in INSTANCE_VAR
  "INSTANCE_VAR"
in CLASS_VAR
  "CLASS_VAR"
in CHAR
  "CHAR"
in STRING
  "STRING"
in SYMBOL
  "SYMBOL"
in NUMBER
  "NUMBER"
in UNDERSCORE
  "UNDERSCORE"
in COMMENT
  "COMMENT"
in DELIMITER_START
  "DELIMITER_START"
in DELIMITER_END
  "DELIMITER_END"
in STRING_ARRAY_START
  "STRING_ARRAY_START"
in INTERPOLATION_START
  "INTERPOLATION_START"
in SYMBOL_ARRAY_START
  "SYMBOL_ARRAY_START"
in STRING_ARRAY_END
  "STRING_ARRAY_END"
in GLOBAL
  "GLOBAL"
in GLOBAL_MATCH_DATA_INDEX
  "GLOBAL_MATCH_DATA_INDEX"
in MAGIC_DIR
  "__DIR__"
in MAGIC_END_LINE
  "__END_LINE__"
in MAGIC_FILE
  "__FILE__"
in MAGIC_LINE
  "__LINE__"
in MACRO_LITERAL
  "MACRO_LITERAL"
in MACRO_EXPRESSION_START
  "MACRO_EXPRESSION_START"
in MACRO_CONTROL_START
  "MACRO_CONTROL_START"
in MACRO_VAR
  "MACRO_VAR"
in MACRO_END
  "MACRO_END"
in OP_BANG
  "!"
in OP_BANG_EQ
  "!="
in OP_BANG_TILDE
  "!~"
in OP_DOLLAR_QUESTION
  "$?"
in OP_DOLLAR_TILDE
  "$~"
in OP_PERCENT
  "%"
in OP_PERCENT_EQ
  "%="
in OP_PERCENT_CURLYR
  "%}"
in OP_AMP
  "&"
in OP_AMP_AMP
  "&&"
in OP_AMP_AMP_EQ
  "&&="
in OP_AMP_STAR
  "&*"
in OP_AMP_STAR_STAR
  "&**"
in OP_AMP_STAR_EQ
  "&*="
in OP_AMP_PLUS
  "&+"
in OP_AMP_PLUS_EQ
  "&+="
in OP_AMP_MINUS
  "&-"
in OP_AMP_MINUS_EQ
  "&-="
in OP_AMP_EQ
  "&="
in OP_PARENL
  "("
in OP_PARENR
  ")"
in OP_STAR
  "*"
in OP_STAR_STAR
  "**"
in OP_STAR_STAR_EQ
  "**="
in OP_STAR_EQ
  "*="
in OP_PLUS
  "+"
in OP_PLUS_EQ
  "+="
in OP_COMMA
  ","
in OP_MINUS
  "-"
in OP_MINUS_EQ
  "-="
in OP_MINUS_GT
  "->"
in OP_PERIOD
  "."
in OP_PERIOD_PERIOD
  ".."
in OP_PERIOD_PERIOD_PERIOD
  "..."
in OP_SLASH
  "/"
in OP_SLASH_SLASH
  "//"
in OP_SLASH_SLASH_EQ
  "//="
in OP_SLASH_EQ
  "/="
in OP_COLON
  ":"
in OP_COLON_COLON
  "::"
in OP_SEMICOLON
  ";"
in OP_LT
  "<"
in OP_LT_LT
  "<<"
in OP_LT_LT_EQ
  "<<="
in OP_LT_EQ
  "<="
in OP_LT_EQ_GT
  "<=>"
in OP_EQ
  "="
in OP_EQ_EQ
  "=="
in OP_EQ_EQ_EQ
  "==="
in OP_EQ_GT
  "=>"
in OP_EQ_TILDE
  "=~"
in OP_GT
  ">"
in OP_GT_EQ
  ">="
in OP_GT_GT
  ">>"
in OP_GT_GT_EQ
  ">>="
in OP_QUESTION
  "?"
in OP_AT_SQUAREL
  "@["
in OP_SQUAREL
  "["
in OP_SQUAREL_SQUARER
  "[]"
in OP_SQUAREL_SQUARER_EQ
  "[]="
in OP_SQUAREL_SQUARER_QUESTION
  "[]?"
in OP_SQUARER
  "]"
in OP_CARET
  "^"
in OP_CARET_EQ
  "^="
in OP_GRAVE
  "`"
in OP_CURLYL
  "{"
in OP_CURLYL_PERCENT
  "{%"
in OP_CURLYL_CURLYL
  "{{"
in OP_BAR
  "|"
in OP_BAR_EQ
  "|="
in OP_BAR_BAR
  "||"
in OP_BAR_BAR_EQ
  "||="
in OP_CURLYR
  "}"
in OP_TILDE
  "~"
end

@asterite
Copy link
Member

asterite commented Feb 4, 2022

@oprypin Thanks for checking!

I'm all on board with replacing symbols with enums. Ideally Symbol won't exist in Crystal 2.0, but that's just my wish. Symbols have no use cases, one can use enums or strings in every case where a symbol could be used.

@oprypin oprypin added this to the 1.4.0 milestone Feb 26, 2022
@straight-shoota
Copy link
Member

Hm, for some reason this ends up including the playground server in std_spec on Windows... https://github.com/crystal-lang/crystal/runs/5376982202?check_suite_focus=true 🤔

@HertzDevil
Copy link
Contributor Author

That is introduced here: b0d3c7a

@straight-shoota straight-shoota merged commit 7ca9d5a into crystal-lang:master Mar 3, 2022
@HertzDevil HertzDevil deleted the refactor/token-kind branch March 5, 2022 07:14
@Sija
Copy link
Contributor

Sija commented Mar 13, 2022

There's SYMBOL_ARRAY_START but SYMBOL_ARRAY_END token type is missing somehow (& was already before this PR) 🤔
Is that intentional? /cc @asterite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants