`echo` does not print emoji given escape sequence `\xf0\x9f\x98\x82` #6741

kkew3 · 2024-09-26T07:31:36Z

How to reproduce

cargo run -p uu_echo -- -e '\xf0\x9f\x98\x82'

gives ð��.

Expected behavior

Under Ubuntu 22.04, with /bin/echo --version:

echo (GNU coreutils) 8.32
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Brian Fox and Chet Ramey.

the output is 😂, given command /bin/echo -e '\xf0\x9f\x98\x82'.

The text was updated successfully, but these errors were encountered:

kkew3 · 2024-09-26T14:35:10Z

I'm glad to submit a PR.

Proposal

I identify the issue as returning char at parse_code function:

coreutils/src/uu/echo/src/echo.rs

Line 42 in a0d258d

fn parse_code(input: &mut Peekable<Chars>, base: Base) -> Option<char> {

A u8 should be returned instead, since it occurs that multiple bytes constitute one Unicode char. A possible solution is to maintain a 4-byte buffer, and repeatedly check for valid utf-8 character from it upon every character read from input, using String::from_utf8:

/// A buffer used to interpret bytes as Unicode characters.
struct TryUnicodeBuffer {
    bytes: [u8; 4],
    len: usize,
}

impl TryUnicodeBuffer {
    /// Push and attempt to convert the buffer into Unicode characters, which
    /// are written to `output`. Panic if the buffer is already full, which
    /// shouldn't happen normally. After `push`, it's guaranteed that the
    /// remaining bytes do not make up a valid utf-8 character.
    fn push(&mut self, i: u8, mut output: impl Write) -> io::Result<()> {}

    /// Try to interpret the bytes started at position `start` as a Unicode
    /// character.
    fn to_char(&self, start: usize) -> Option<char> {}

    /// Clear the remaining (invalid) bytes and replace with the replacement
    /// characters if not empty.
    fn clear(&mut self, mut output: impl Write) -> io::Result<()> {}

    /// Clear and push something that can be interpreted as a Unicode
    /// character.
    fn clear_push(&mut self, i: impl Into<char>, mut output: impl Write) -> io::Result<()> {}
}

Only print_escaped function needs to be modified.

Test cases

The MRE of this issue: echo -e '\xf0\x9f\x98\x82' should yield 😂.
ASCII and emoji: echo -e '\x41\xf0\x9f\x98\x82\x42' should yield A😂B.
The emoji broken by an ASCII: echo -e '\xf0\x41\x9f\x98\x82' should yield �A��.
Tests involving letter escape character; e.g. echo -e '\x41\xf0\c\x9f\x98\x82' should yield A� (no newline).

Bug was reported, with root cause analysis, by kkew3 Added tests were derived from test cases provided by kkew3 See uutils#6741

andrewliebenow · 2024-10-20T16:42:38Z

#6803 should fix this. I went with a simpler fix. Since everything is being printed to stdout, which is obviously not restricted to UTF-8 data, the escape codes can just be printed out byte by byte, without trying to keep track of whether the output is valid UTF-8.

kkew3 · 2024-10-21T03:41:11Z

Yeah, it's definitely a better fix.

It also comes to my mind that all of these, tested on ubuntu 22.04, /bin/echo -e '\xf0', /bin/echo -e '\xf0\x9f', /bin/echo -e '\xf0\x9f\x98', should yield the same � (the unicode replacement character \u{FFFD}), which seems to break your code. But it follows immediately that /bin/echo -n -e '\xf0' | wc -c, /bin/echo -n -e '\xf0\x9f' | wc -c, /bin/echo -n -e '\xf0\x9f\x98' | wc -c prints 1, 2 and 3 respectively. In contrast, (zsh) builtin echo -n -e '\uFFFD' | wc -c prints 3. This shows that the printed � could origin from the terminal's rendering, where in iTerm2.app the byte sequences are rendered as the �, and in Terminal.app the ?.

Great! I'll close this issue.

andrewliebenow · 2024-10-21T05:33:51Z

A nice way to debug these issues is to use bat -A:

❯ echo -e '\xf0\x41\x9f\x98\x82' | bat -A  
───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ STDIN
───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ \xF0A\x9F\x98\x82␊
───────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

https://github.com/sharkdp/bat

Otherwise, yes, you can't really tell what's going on when you're dealing with weird/non-UTF-8 output.

Technically I don't think this issue should be closed until a PR resolving the bug has been merged, but I'll be checking in on my PR periodically until it's merged, so it shouldn't matter much.

Bug was reported, with root cause analysis, by kkew3 Added tests were derived from test cases provided by kkew3 See uutils#6741

* echo: handle multibyte escape sequences Bug was reported, with root cause analysis, by kkew3 Added tests were derived from test cases provided by kkew3 See #6741 * Use concrete type * Fix MSRV issue * Fix non-UTF-8 argument handling * Fix MSRV issue * Fix Clippy violation * Fix compiler warning * Address PR comments * Add MSRV TODO comments * echo: use stdout_only_bytes instead of stdout_is_bytes --------- Co-authored-by: Daniel Hofstetter <[email protected]>

cakebaker · 2024-10-22T09:24:13Z

Fixed in #6803

@kkew3 thanks for reporting!

cakebaker added the U - echo label Sep 26, 2024

sylvestre added the good first issue For newcomers! label Sep 26, 2024

andrewliebenow added a commit to andrewliebenow/coreutils that referenced this issue Oct 20, 2024

echo: handle multibyte escape sequences

9691afb

Bug was reported, with root cause analysis, by kkew3 Added tests were derived from test cases provided by kkew3 See uutils#6741

andrewliebenow mentioned this issue Oct 20, 2024

echo: handle multibyte escape sequences #6803

Merged

kkew3 closed this as completed Oct 21, 2024

kkew3 reopened this Oct 21, 2024

andrewliebenow added a commit to andrewliebenow/coreutils that referenced this issue Oct 21, 2024

echo: handle multibyte escape sequences

6d8a781

Bug was reported, with root cause analysis, by kkew3 Added tests were derived from test cases provided by kkew3 See uutils#6741

cakebaker closed this as completed Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`echo` does not print emoji given escape sequence `\xf0\x9f\x98\x82` #6741

`echo` does not print emoji given escape sequence `\xf0\x9f\x98\x82` #6741

kkew3 commented Sep 26, 2024

kkew3 commented Sep 26, 2024

andrewliebenow commented Oct 20, 2024

kkew3 commented Oct 21, 2024

andrewliebenow commented Oct 21, 2024

cakebaker commented Oct 22, 2024

echo does not print emoji given escape sequence \xf0\x9f\x98\x82 #6741

echo does not print emoji given escape sequence \xf0\x9f\x98\x82 #6741

Comments

kkew3 commented Sep 26, 2024

How to reproduce

Expected behavior

kkew3 commented Sep 26, 2024

Proposal

Test cases

andrewliebenow commented Oct 20, 2024

kkew3 commented Oct 21, 2024

andrewliebenow commented Oct 21, 2024

cakebaker commented Oct 22, 2024

`echo` does not print emoji given escape sequence `\xf0\x9f\x98\x82` #6741

`echo` does not print emoji given escape sequence `\xf0\x9f\x98\x82` #6741