Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IFS: UTF-8 support is incomplete #1372

Open
McDutchie opened this issue Aug 4, 2019 · 1 comment
Open

IFS: UTF-8 support is incomplete #1372

McDutchie opened this issue Aug 4, 2019 · 1 comment
Labels

Comments

@McDutchie
Copy link
Contributor

McDutchie commented Aug 4, 2019

Test script:

set -o noglob

# test 1
IFS='£'
set -- : :
v="${#},$*"
echo "$v"

# test 2
IFS='£'			# £ = C2 A3
v='abc§def ghi§jkl'	# § = C2 A7 (same initial byte)
set -- $v
v="${#},${1-},${2-},${3-}"
echo "$v"

Expected output (UTF-8 locale):

2,:£:
1,abc§def ghi§jkl,,

Output on ksh Version AJM 93u+ 2012-08-01 current development version:

2,:£:
1,abc?def ghi?jkl,,

In the second test, the § characters get mangled.

For reference, output on the latest release Version AJM 93u+ 2012-08-01:

2,:?:
1,abc?def ghi?jkl,,
@krader1961 krader1961 added the bug label Aug 4, 2019
@krader1961
Copy link
Contributor

krader1961 commented Aug 4, 2019

@McDutchie This is definitely broken. I'm pretty sure I had commented earlier that IFS only works for single byte locales but I can't find it now. However, I'm seeing slightly different behavior than you documented. I suspect you made a copy/paste mistake and meant to say the first output you showed was from the current source; i.e., 2017.0.0-devel-....

I took your script and modified it slightly to make understanding the behavior easier. Note that god on my macOS system is the GNU od command:

set -o noglob
IFS='£'  # 0xC2 0xA3
print -n "$IFS" | god -tx1z

set -- : :
v="${#},$*"
print -n "$v" | god -tx1z

v='ab§cd ef§gh'  # § = 0xC2 0xA7 (same initial byte)
set -- $v
v="${#},${1-},${2-},${3-}"
print -n "$v" | god -tx1z

Output from ksh93u+ included with macOS and ksh93v-:

0000000 c2 a3                                            >£<
0000002
0000000 32 2c 3a c2 3a                                   >2,::<
0000005
0000000 31 2c 61 62 a7 63 64 20 65 66 a7 67 68 2c 2c     >1,abcd efgh,,<
0000017

Output from ksh built from the current source:

0000000 c2 a3                                            >£<
0000002
0000000 32 2c 3a c2 a3 3a                                >2,:£:<
0000006
0000000 31 2c 61 62 a7 63 64 20 65 66 a7 67 68 2c 2c     >1,abcd efgh,,<
0000017

So somewhere along the line it appears we fixed the first test. Probably as a result of my replacing most of the AST locale code with the platform's locale support.

P.S., Whomever fixes this should be certain to add appropriate unit tests.

JohnoKing added a commit to JohnoKing/ksh that referenced this issue Jul 25, 2020
This commit fixes BUG_MULTIBIFS, which had two bug reports in the ksh2020 branch.
The modernish regression test suite now only reports eight test failures.

src/cmd/ksh93/sh/macro.c:
- Backport Eric Scrivner's fix for multibyte IFS characters (slightly modified for
  compatibility with C89). Explanation from att#737:

  Previously, the varsub method used for the macro expansion of $param, ${param},
  and ${param op word} would incorrectly expand the internal field separator (IFS)
  if it was a multibyte character. This was due to truncation based on the
  incorrect assumption that the IFS would never be larger than a single byte.

  This change fixes this issue by carefully tracking the number of bytes that
  should be persisted in the IFS case and ensuring that all bytes are written
  during expansion and substitution.

  Bug report: att#13

- Fixed another bug that caused multibyte characters with the same initial byte
  to be treated as the same character by the IFS. This bug was occurring because
  the first byte of a multibyte character wasn't being written to the stack when
  the IFS delimiter had the same initial byte:

  $ IFS=£
  $ v='§'
  $ set -- $v
  $ v="${1-}"
  $ echo "$v" | hd # The first byte should be c2, but it isn't due to the bug
  00000000  a7 0a                                             |..|
  00000002

  Bug report: att#1372

src/cmd/ksh93/tests/variables.sh:
- Add (reworked) regression tests from ksh2020 for the multibyte IFS bugs.
- Add a regression test for att#1372 based on the reproducer.
JohnoKing added a commit to JohnoKing/ksh that referenced this issue Jul 25, 2020
This commit fixes BUG_MULTIBIFS, which had two bug reports in the ksh2020 branch.
The modernish regression test suite now only reports eight test failures.

src/cmd/ksh93/sh/macro.c:
- Backport Eric Scrivner's fix for multibyte IFS characters (slightly modified
  for compatibility with C89). Explanation from att#737:

  Previously, the varsub method used for the macro expansion of $param, ${param},
  and ${param op word} would incorrectly expand the internal field separator (IFS)
  if it was a multibyte character. This was due to truncation based on the
  incorrect assumption that the IFS would never be larger than a single byte.

  This change fixes this issue by carefully tracking the number of bytes that
  should be persisted in the IFS case and ensuring that all bytes are written
  during expansion and substitution.

  Bug report: att#13

- Fixed another bug that caused multibyte characters with the same initial byte
  to be treated as the same character by the IFS. This bug was occurring because
  the first byte of a multibyte character wasn't being written to the stack when
  the IFS delimiter had the same initial byte:

  $ IFS=£
  $ v='§'
  $ set -- $v
  $ v="${1-}"
  $ echo "$v" | hd # The first byte should be c2, but it isn't due to the bug
  00000000  a7 0a                                             |..|
  00000002

  Bug report: att#1372

src/cmd/ksh93/tests/variables.sh:
- Add (reworked) regression tests from ksh2020 for the multibyte IFS bugs.
- Add a regression test for att#1372 based on the reproducer.
JohnoKing added a commit to JohnoKing/ksh that referenced this issue Jul 25, 2020
This commit fixes BUG_MULTIBIFS, which had two bug reports in the ksh2020 branch.
The modernish regression test suite now only reports eight test failures.

src/cmd/ksh93/sh/macro.c:
- Backport Eric Scrivner's fix for multibyte IFS characters (slightly modified
  for compatibility with C89). Explanation from att#737:

  Previously, the varsub method used for the macro expansion of $param, ${param},
  and ${param op word} would incorrectly expand the internal field separator (IFS)
  if it was a multibyte character. This was due to truncation based on the
  incorrect assumption that the IFS would never be larger than a single byte.

  This change fixes this issue by carefully tracking the number of bytes that
  should be persisted in the IFS case and ensuring that all bytes are written
  during expansion and substitution.

  Bug report: att#13

- Fixed another bug that caused multibyte characters with the same initial byte
  to be treated as the same character by the IFS. This bug was occurring because
  the first byte of a multibyte character wasn't being written to the stack when
  the IFS delimiter had the same initial byte:

  $ IFS=£
  $ v='§'
  $ set -- $v
  $ v="${1-}"
  $ echo "$v" | hd # The first byte should be c2, but it isn't due to the bug
  00000000  a7 0a                                             |..|
  00000002

  Bug report: att#1372

src/cmd/ksh93/tests/variables.sh:
- Add (reworked) regression tests from ksh2020 for the multibyte IFS bugs.
- Add a regression test for att#1372 based on the reproducer.
McDutchie pushed a commit to ksh93/ksh that referenced this issue Jul 25, 2020
Add support for multibyte characters to $IFS

This commit fixes BUG_MULTIBIFS, which had two bug reports in the ksh2020 branch.

src/cmd/ksh93/sh/macro.c:
- Backport Eric Scrivner's fix for multibyte IFS characters (slightly modified
  for compatibility with C89). Explanation from att#737:

  Previously, the varsub method used for the macro expansion of $param, ${param},
  and ${param op word} would incorrectly expand the internal field separator (IFS)
  if it was a multibyte character. This was due to truncation based on the
  incorrect assumption that the IFS would never be larger than a single byte.

  This change fixes this issue by carefully tracking the number of bytes that
  should be persisted in the IFS case and ensuring that all bytes are written
  during expansion and substitution.

  Bug report: att#13

- Fixed another bug that caused multibyte characters with the same initial byte
  to be treated as the same character by the IFS. This bug was occurring because
  the first byte of a multibyte character wasn't being written to the stack when
  the IFS delimiter had the same initial byte:

  $ IFS=£
  $ v='§'
  $ set -- $v
  $ v="${1-}"
  $ echo "$v" | hd # The first byte should be c2, but it isn't due to the bug
  00000000  a7 0a                                             |..|
  00000002

  Bug report: att#1372

src/cmd/ksh93/tests/variables.sh:
- Add (reworked) regression tests from ksh2020 for the multibyte IFS bugs.
- Add a regression test for att#1372 based on the reproducer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants