Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable P8SCII unescaping to fix mangling of emoji characters #106

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

simonwulf
Copy link

@simonwulf simonwulf commented Aug 10, 2022

TL;DR

This PR addresses an issue where any emoji symbols in the input lua script would be replaced by a garbled sequence of characters. The proposed solution is to remove picotool's current handling of P8SCII escape sequences which does not seem to function as intended.

The Details

I encountered an issue where any use of the 🅾️ emoji in my lua script would be replaced by "ユか✽ゆヤま◆" after building a .p8 cart with picotool. The cause of this issue seems to stem from P8SCII being treated as an encoding in itself. In practice, this treatment boils down to two steps:

  1. When parsing a string literal, the lexer replaces any numerical P8SCII escape sequence it encounters with a byte of the specified value, seemingly hoping that this results in a "pure" P8SCII string.
  2. Later, the P8 formatter calls lua.p8scii_to_unicode, which seems meant to convert all P8SCII characters in the passed string to their utf-8 counterparts. The formatter assumes, at this point, that the lua script is P8SCII encoded. As a side note, this substitution routine runs on the entire script and not just on the string tokens that had their escape sequences converted by the lexer in step 1.

Both of the above steps have inherent issues:

  1. Replacing P8SCII escape sequences with their corresponding byte values does not turn the input string in its entirety into a P8SCII encoded string as the majority of the string retains its original encoding (utf-8). What we end up with instead is a mix of utf-8 and P8SCII.
  2. The assumption that the passed string is P8SCII encoded is incorrect. It is, In fact, mostly utf-8 with a few dashes of P8SCII encoded characters as a result of step 1. When this conversion routine encounters the seven byte long utf-8 character for 🅾️, it will replace each of the seven bytes with a new utf-8 character, resulting in "ユか✽ゆヤま◆".

Future Improvements

  • I would argue against treating P8SCII as a text encoding, instead merely treating it as a collection of escape sequences that hold a special meaning when passed to Pico-8's print function and passing them through unchanged. If pre-interpreting these escape sequences is still a desired feature, I'd suggest it be done in one go when parsing or writing the string tokens instead of passing through an intermediate format.
  • There are probably additional code paths or data structures that are made dead by this change and could be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant