Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

echo does not print emoji given escape sequence \xf0\x9f\x98\x82 #6741

Closed
kkew3 opened this issue Sep 26, 2024 · 5 comments
Closed

echo does not print emoji given escape sequence \xf0\x9f\x98\x82 #6741

kkew3 opened this issue Sep 26, 2024 · 5 comments
Labels

Comments

@kkew3
Copy link

kkew3 commented Sep 26, 2024

How to reproduce

cargo run -p uu_echo -- -e '\xf0\x9f\x98\x82'

gives �.

Expected behavior

Under Ubuntu 22.04, with /bin/echo --version:

echo (GNU coreutils) 8.32
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Brian Fox and Chet Ramey.

the output is 😂, given command /bin/echo -e '\xf0\x9f\x98\x82'.

@sylvestre sylvestre added the good first issue For newcomers! label Sep 26, 2024
@kkew3
Copy link
Author

kkew3 commented Sep 26, 2024

I'm glad to submit a PR.

Proposal

I identify the issue as returning char at parse_code function:

fn parse_code(input: &mut Peekable<Chars>, base: Base) -> Option<char> {

A u8 should be returned instead, since it occurs that multiple bytes constitute one Unicode char. A possible solution is to maintain a 4-byte buffer, and repeatedly check for valid utf-8 character from it upon every character read from input, using String::from_utf8:

/// A buffer used to interpret bytes as Unicode characters.
struct TryUnicodeBuffer {
    bytes: [u8; 4],
    len: usize,
}

impl TryUnicodeBuffer {
    /// Push and attempt to convert the buffer into Unicode characters, which
    /// are written to `output`. Panic if the buffer is already full, which
    /// shouldn't happen normally. After `push`, it's guaranteed that the
    /// remaining bytes do not make up a valid utf-8 character.
    fn push(&mut self, i: u8, mut output: impl Write) -> io::Result<()> {}

    /// Try to interpret the bytes started at position `start` as a Unicode
    /// character.
    fn to_char(&self, start: usize) -> Option<char> {}

    /// Clear the remaining (invalid) bytes and replace with the replacement
    /// characters if not empty.
    fn clear(&mut self, mut output: impl Write) -> io::Result<()> {}

    /// Clear and push something that can be interpreted as a Unicode
    /// character.
    fn clear_push(&mut self, i: impl Into<char>, mut output: impl Write) -> io::Result<()> {}
}

Only print_escaped function needs to be modified.

Test cases

  • The MRE of this issue: echo -e '\xf0\x9f\x98\x82' should yield 😂.
  • ASCII and emoji: echo -e '\x41\xf0\x9f\x98\x82\x42' should yield A😂B.
  • The emoji broken by an ASCII: echo -e '\xf0\x41\x9f\x98\x82' should yield �A���.
  • Tests involving letter escape character; e.g. echo -e '\x41\xf0\c\x9f\x98\x82' should yield A� (no newline).

andrewliebenow added a commit to andrewliebenow/coreutils that referenced this issue Oct 20, 2024
Bug was reported, with root cause analysis, by kkew3
Added tests were derived from test cases provided by kkew3
See uutils#6741
@andrewliebenow
Copy link
Contributor

#6803 should fix this. I went with a simpler fix. Since everything is being printed to stdout, which is obviously not restricted to UTF-8 data, the escape codes can just be printed out byte by byte, without trying to keep track of whether the output is valid UTF-8.

@kkew3
Copy link
Author

kkew3 commented Oct 21, 2024

Yeah, it's definitely a better fix.

It also comes to my mind that all of these, tested on ubuntu 22.04, /bin/echo -e '\xf0', /bin/echo -e '\xf0\x9f', /bin/echo -e '\xf0\x9f\x98', should yield the same (the unicode replacement character \u{FFFD}), which seems to break your code. But it follows immediately that /bin/echo -n -e '\xf0' | wc -c, /bin/echo -n -e '\xf0\x9f' | wc -c, /bin/echo -n -e '\xf0\x9f\x98' | wc -c prints 1, 2 and 3 respectively. In contrast, (zsh) builtin echo -n -e '\uFFFD' | wc -c prints 3. This shows that the printed could origin from the terminal's rendering, where in iTerm2.app the byte sequences are rendered as the , and in Terminal.app the ?.

Great! I'll close this issue.

@kkew3 kkew3 closed this as completed Oct 21, 2024
@andrewliebenow
Copy link
Contributor

A nice way to debug these issues is to use bat -A:

❯ echo -e '\xf0\x41\x9f\x98\x82' | bat -A  
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ STDIN
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ \xF0A\x9F\x98\x82␊
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

https://github.com/sharkdp/bat

Otherwise, yes, you can't really tell what's going on when you're dealing with weird/non-UTF-8 output.

Technically I don't think this issue should be closed until a PR resolving the bug has been merged, but I'll be checking in on my PR periodically until it's merged, so it shouldn't matter much.

@kkew3 kkew3 reopened this Oct 21, 2024
andrewliebenow added a commit to andrewliebenow/coreutils that referenced this issue Oct 21, 2024
Bug was reported, with root cause analysis, by kkew3
Added tests were derived from test cases provided by kkew3
See uutils#6741
cakebaker added a commit that referenced this issue Oct 22, 2024
* echo: handle multibyte escape sequences

Bug was reported, with root cause analysis, by kkew3
Added tests were derived from test cases provided by kkew3
See #6741

* Use concrete type

* Fix MSRV issue

* Fix non-UTF-8 argument handling

* Fix MSRV issue

* Fix Clippy violation

* Fix compiler warning

* Address PR comments

* Add MSRV TODO comments

* echo: use stdout_only_bytes instead of stdout_is_bytes

---------

Co-authored-by: Daniel Hofstetter <[email protected]>
@cakebaker
Copy link
Contributor

Fixed in #6803

@kkew3 thanks for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants