Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not short-circuit decode_utf8 with utf8 flags #11

Merged
merged 1 commit into from
Aug 29, 2013

Conversation

miyagawa
Copy link
Contributor

The documentation about decode_utf8 says:

decode_utf8
    $string = decode_utf8($octets [, CHECK]);
Equivalent to "$string = decode("utf8", $octets [, CHECK])". 

which is not true, because it bypasses the whole decoding process if $octet already has utf-8 flags.

I think it's incorrect to bypass the decoding depending on the utf-8 flags, since a) it will emit unreliable, inconsistent results depending on how the byte string has been generated, and b) it's not consistent to decode("utf8", $octets) in any ways.

This patch eliminates the check, and passes all tests.

@miyagawa
Copy link
Contributor Author

# file encoding: utf-8
use strict;
use Encode qw(decode_utf8 FB_WARN);

my $t1 = "Léon";
my $t2 = "L\x{e9}on"; # latin-1

my $pad = "\x{30C9}";

for my $byte ($t1, $t2) {
    warn "Decoding $byte";
    my $orig = $byte;
    Encode::decode_utf8($orig, FB_WARN);

    warn "Decoding $byte";
    my $orig2 = $byte;
    Encode::decode('utf-8', $orig2, FB_WARN);

    chop(my $new = $byte . $pad);
    warn "Decoding $new after pad/chop";
    Encode::decode_utf8($new, FB_WARN);
}

From the user's point of view, because $t2 is latin-1 string encoded, we expect decode_utf8($t2) to give warnings all the time, since you're trying to decode latin-1 with UTF-8.

Before this patch:

perl ~/tmp/decode_latin1.t
Decoding Léon at /Users/miyagawa/tmp/decode_latin1.t line 11.
Decoding Léon at /Users/miyagawa/tmp/decode_latin1.t line 15.
Decoding Léon after pad/chop at /Users/miyagawa/tmp/decode_latin1.t line 20.
Decoding L�on at /Users/miyagawa/tmp/decode_latin1.t line 11.
utf8 "\xE9" does not map to Unicode at /Users/miyagawa/.plenv/versions/5.18.1/lib/perl5/site_perl/5.18.1/darwin-2level/Encode.pm line 217.
Decoding L�on at /Users/miyagawa/tmp/decode_latin1.t line 15.
utf8 "\xE9" does not map to Unicode at /Users/miyagawa/.plenv/versions/5.18.1/lib/perl5/site_perl/5.18.1/darwin-2level/Encode.pm line 176.
Decoding L�on after pad/chop at /Users/miyagawa/tmp/decode_latin1.t line 20.

The last version, decode_utf8($t2) with pad/chop (i.e. implicit upgrade) doesn't give a warning, and returns the original string as a result. This is not just wrong, but also makes decode_utf8 unreliable and unusable because you, a user of Encode.pm, always have to make sure if the data has a utf-8 flag in it.

After this patch, all the calls to decode_utf8, will give warnings and reliable result, as decode("utf-8") does.

@miyagawa
Copy link
Contributor Author

I checked
https://rt.cpan.org/Ticket/Display.html?id=87267
https://rt.cpan.org/Ticket/Display.html?id=61671

decode_utf8() is now ISOMORPHICALLY IDENTICAL to decode('utf8' …), which wasn't till 2.40.

While I'm not sure what other issues have been fixed in 2.40, i'm confident decode_utf8 and decode("utf-8") isn't equivalent in the current version.

Let me know if I should better make that above test as an actual unit test. I'm happy to.

dankogai added a commit that referenced this pull request Aug 29, 2013
Do not short-circuit decode_utf8 with utf8 flags
@dankogai dankogai merged commit 8005a82 into dankogai:master Aug 29, 2013
@vsespb
Copy link
Contributor

vsespb commented Aug 29, 2013

Contradict #10 which is merged too

@miyagawa
Copy link
Contributor Author

Good catch. I would vote for reverting #10 then since it is not necessary?

Sent from Mailbox for iPhone

On Thu, Aug 29, 2013 at 8:05 AM, Victor Efimov [email protected]
wrote:

Contradict #10 which is merged too

Reply to this email directly or view it on GitHub:
#11 (comment)

@vsespb
Copy link
Contributor

vsespb commented Aug 29, 2013

I would vote for reverting my patch ( #10 ) too. However @dankogai rejected patches like #11 since Y2010, that's why I sent #10. Also there is testsuites for #11 in RT#87267

@miyagawa
Copy link
Contributor Author

@vsespb can't see a test suite (.t) on the RT ticket. I will make one up.

@vsespb
Copy link
Contributor

vsespb commented Aug 29, 2013

@miyagawa It's plaintext (not attachment) in first message https://rt.cpan.org/Public/Bug/Display.html?id=87267

@miyagawa
Copy link
Contributor Author

done #12

dankogai added a commit that referenced this pull request Aug 29, 2013
Unit test for decoding behavior change in #11
jperkin pushed a commit to TritonDataCenter/pkgsrc-legacy that referenced this pull request Dec 9, 2013
$Revision: 2.54 $ $Date: 2013/08/29 16:47:39 $
! Encode.xs
+ t/cow.t
  Addressed: COW breakage with _utf8_on()
  https://rt.cpan.org/Ticket/Display.html?id=88230
! Encode.pm
  Reverted the document accordingly to #11
  dankogai/p5-encode#10
+ t/decode.t
  Unit test for decoding behavior change in #11
  dankogai/p5-encode#12

2.53 2013/08/29 15:20:31
! Encode.pm
  Merged: Do not short-circuit decode_utf8 with utf8 flags
  dankogai/p5-encode#11
  Merged: document decode_utf8 behaviour more precise
  dankogai/p5-encode#10
! Makefile.PL
  Added repository cpan metadata
  dankogai/p5-encode#9

2.52 2013/08/14 02:29:54
! ucm/*.ucm
  Addressed:
    Unicode Mappping tables are missing Unicode Inc. license notification
    All files including "as long as this notice remains attached" now
    have that notice attached in the comment section.  (cp* and mac*
    do not since their source files do not include that notice)
  https://rt.cpan.org/Ticket/Display.html?id=87340
! lib/Encode/MIME/Header.pm
  t/mime-header.t
  Addressed: encoding "0" with MIME-Headers gets a blank string
  https://rt.cpan.org/Ticket/Display.html?id=87831
! Encode.pm
  Addressed: Documentation buglet
  https://rt.cpan.org/Ticket/Display.html?id=84992
! Byte/Makefile.PL CN/Makefile.PL EBCDIC/Makefile.PL
  Encode/Makefile_PL.e2x JP/Makefile.PL KR/Makefile.PL
  Symbol/Makefile.PL TW/Makefile.PL
  Applied: Patch to output #includes in deterministic order
  https://rt.cpan.org/Ticket/Display.html?id=86974

2.51 2013/04/29 22:19:11
! Encode.xs
  Addressed: Encode.xs doesn't compile with Microsoft C compiler
  https://rt.cpan.org/Public/Bug/Display.html?id=84920
! MANIFEST
  Addressed: t/taint.t missing
  https://rt.cpan.org/Public/Bug/Display.html?id=84919

2.50 2013/04/26 18:30:46
! Encode.xs Unicode/Unicode.xs
  lib/Encode/Unicode/UTF7.pm lib/CN/HZ.pm lib/Encode/GSM0338.pm
  t/taint.t
  Addressed: Encode::encode and Encode::decode
             gratuitously launders tainted data
  Taintedness now propagates as it should.
  https://rt.cpan.org/Ticket/Display.html?id=84879
! encoding.pm
  Addressed: 5.18 deprecation
  https://rt.cpan.org/Ticket/Display.html?id=84709
! bin/piconv
  Applied: Update piconv documentation
  https://rt.cpan.org/Ticket/Display.html?id=84695

2.49 2013/03/05 03:12:49
! Encode.xs
  Addressed: Encoding objects leak memory if decoding fails
  dankogai/p5-encode#8

2.48 2013/02/18 02:23:56
! encoding.pm
  t/Mod_EUCJP.pm t/enc_data.t t/enc_eucjp.t t/enc_module.t t/enc_utf8.t
  t/encoding.t t/jperl.t
  [PATCH] Deprecate encoding.pm
  https://rt.cpan.org/Ticket/Display.html?id=81255
! Encode/Supported.pod
  Fixed: Pod errors
  https://rt.cpan.org/Ticket/Display.html?id=81426
! Encode.pm t/Encode.t
  [PATCH] Fix for shared hash key scalars
  https://rt.cpan.org/Ticket/Display.html?id=80608
! Encode.pm
  Fixed: Uninitialized value warning from Encode->encodings()
  https://rt.cpan.org/Ticket/Display.html?id=80181
! Makefile.PL
  Install to 'site' instead of 'perl' when perl version is 5.11+
  https://rt.cpan.org/Ticket/Display.html?id=78917
! Encode/Makefile_PL.e2x
  find enc2xs.bat if it works on windows.
  dankogai/p5-encode#7
! t/piconv.t
  Fix finding piconv in t/piconv.t
  dankogai/p5-encode#6
RichiH pushed a commit to RichiH/perl that referenced this pull request Jan 11, 2014
  [DELTA]

$Revision: 2.54 $ $Date: 2013/08/29 16:47:39 $
! Encode.xs
+ t/cow.t
  Addressed: COW breakage with _utf8_on()
  https://rt.cpan.org/Ticket/Display.html?id=88230
! Encode.pm
  Reverted the document accordingly to #11
  dankogai/p5-encode#10
+ t/decode.t
  Unit test for decoding behavior change in #11
  dankogai/p5-encode#12

2.53 2013/08/29 15:20:31
! Encode.pm
  Merged: Do not short-circuit decode_utf8 with utf8 flags
  dankogai/p5-encode#11
  Merged: document decode_utf8 behaviour more precise
  dankogai/p5-encode#10
! Makefile.PL
  Added repository cpan metadata
  dankogai/p5-encode#9
jperkin pushed a commit to TritonDataCenter/pkgsrc-legacy that referenced this pull request Jan 21, 2014
$Revision: 2.54 $ $Date: 2013/08/29 16:47:39 $
! Encode.xs
+ t/cow.t
  Addressed: COW breakage with _utf8_on()
  https://rt.cpan.org/Ticket/Display.html?id=88230
! Encode.pm
  Reverted the document accordingly to #11
  dankogai/p5-encode#10
+ t/decode.t
  Unit test for decoding behavior change in #11
  dankogai/p5-encode#12

2.53 2013/08/29 15:20:31
! Encode.pm
  Merged: Do not short-circuit decode_utf8 with utf8 flags
  dankogai/p5-encode#11
  Merged: document decode_utf8 behaviour more precise
  dankogai/p5-encode#10
! Makefile.PL
  Added repository cpan metadata
  dankogai/p5-encode#9

2.52 2013/08/14 02:29:54
! ucm/*.ucm
  Addressed:
    Unicode Mappping tables are missing Unicode Inc. license notification
    All files including "as long as this notice remains attached" now
    have that notice attached in the comment section.  (cp* and mac*
    do not since their source files do not include that notice)
  https://rt.cpan.org/Ticket/Display.html?id=87340
! lib/Encode/MIME/Header.pm
  t/mime-header.t
  Addressed: encoding "0" with MIME-Headers gets a blank string
  https://rt.cpan.org/Ticket/Display.html?id=87831
! Encode.pm
  Addressed: Documentation buglet
  https://rt.cpan.org/Ticket/Display.html?id=84992
! Byte/Makefile.PL CN/Makefile.PL EBCDIC/Makefile.PL
  Encode/Makefile_PL.e2x JP/Makefile.PL KR/Makefile.PL
  Symbol/Makefile.PL TW/Makefile.PL
  Applied: Patch to output #includes in deterministic order
  https://rt.cpan.org/Ticket/Display.html?id=86974

2.51 2013/04/29 22:19:11
! Encode.xs
  Addressed: Encode.xs doesn't compile with Microsoft C compiler
  https://rt.cpan.org/Public/Bug/Display.html?id=84920
! MANIFEST
  Addressed: t/taint.t missing
  https://rt.cpan.org/Public/Bug/Display.html?id=84919

2.50 2013/04/26 18:30:46
! Encode.xs Unicode/Unicode.xs
  lib/Encode/Unicode/UTF7.pm lib/CN/HZ.pm lib/Encode/GSM0338.pm
  t/taint.t
  Addressed: Encode::encode and Encode::decode
             gratuitously launders tainted data
  Taintedness now propagates as it should.
  https://rt.cpan.org/Ticket/Display.html?id=84879
! encoding.pm
  Addressed: 5.18 deprecation
  https://rt.cpan.org/Ticket/Display.html?id=84709
! bin/piconv
  Applied: Update piconv documentation
  https://rt.cpan.org/Ticket/Display.html?id=84695

2.49 2013/03/05 03:12:49
! Encode.xs
  Addressed: Encoding objects leak memory if decoding fails
  dankogai/p5-encode#8

2.48 2013/02/18 02:23:56
! encoding.pm
  t/Mod_EUCJP.pm t/enc_data.t t/enc_eucjp.t t/enc_module.t t/enc_utf8.t
  t/encoding.t t/jperl.t
  [PATCH] Deprecate encoding.pm
  https://rt.cpan.org/Ticket/Display.html?id=81255
! Encode/Supported.pod
  Fixed: Pod errors
  https://rt.cpan.org/Ticket/Display.html?id=81426
! Encode.pm t/Encode.t
  [PATCH] Fix for shared hash key scalars
  https://rt.cpan.org/Ticket/Display.html?id=80608
! Encode.pm
  Fixed: Uninitialized value warning from Encode->encodings()
  https://rt.cpan.org/Ticket/Display.html?id=80181
! Makefile.PL
  Install to 'site' instead of 'perl' when perl version is 5.11+
  https://rt.cpan.org/Ticket/Display.html?id=78917
! Encode/Makefile_PL.e2x
  find enc2xs.bat if it works on windows.
  dankogai/p5-encode#7
! t/piconv.t
  Fix finding piconv in t/piconv.t
  dankogai/p5-encode#6
jperkin pushed a commit to TritonDataCenter/pkgsrc-legacy that referenced this pull request Mar 14, 2014
$Revision: 2.54 $ $Date: 2013/08/29 16:47:39 $
! Encode.xs
+ t/cow.t
  Addressed: COW breakage with _utf8_on()
  https://rt.cpan.org/Ticket/Display.html?id=88230
! Encode.pm
  Reverted the document accordingly to #11
  dankogai/p5-encode#10
+ t/decode.t
  Unit test for decoding behavior change in #11
  dankogai/p5-encode#12

2.53 2013/08/29 15:20:31
! Encode.pm
  Merged: Do not short-circuit decode_utf8 with utf8 flags
  dankogai/p5-encode#11
  Merged: document decode_utf8 behaviour more precise
  dankogai/p5-encode#10
! Makefile.PL
  Added repository cpan metadata
  dankogai/p5-encode#9

2.52 2013/08/14 02:29:54
! ucm/*.ucm
  Addressed:
    Unicode Mappping tables are missing Unicode Inc. license notification
    All files including "as long as this notice remains attached" now
    have that notice attached in the comment section.  (cp* and mac*
    do not since their source files do not include that notice)
  https://rt.cpan.org/Ticket/Display.html?id=87340
! lib/Encode/MIME/Header.pm
  t/mime-header.t
  Addressed: encoding "0" with MIME-Headers gets a blank string
  https://rt.cpan.org/Ticket/Display.html?id=87831
! Encode.pm
  Addressed: Documentation buglet
  https://rt.cpan.org/Ticket/Display.html?id=84992
! Byte/Makefile.PL CN/Makefile.PL EBCDIC/Makefile.PL
  Encode/Makefile_PL.e2x JP/Makefile.PL KR/Makefile.PL
  Symbol/Makefile.PL TW/Makefile.PL
  Applied: Patch to output #includes in deterministic order
  https://rt.cpan.org/Ticket/Display.html?id=86974

2.51 2013/04/29 22:19:11
! Encode.xs
  Addressed: Encode.xs doesn't compile with Microsoft C compiler
  https://rt.cpan.org/Public/Bug/Display.html?id=84920
! MANIFEST
  Addressed: t/taint.t missing
  https://rt.cpan.org/Public/Bug/Display.html?id=84919

2.50 2013/04/26 18:30:46
! Encode.xs Unicode/Unicode.xs
  lib/Encode/Unicode/UTF7.pm lib/CN/HZ.pm lib/Encode/GSM0338.pm
  t/taint.t
  Addressed: Encode::encode and Encode::decode
             gratuitously launders tainted data
  Taintedness now propagates as it should.
  https://rt.cpan.org/Ticket/Display.html?id=84879
! encoding.pm
  Addressed: 5.18 deprecation
  https://rt.cpan.org/Ticket/Display.html?id=84709
! bin/piconv
  Applied: Update piconv documentation
  https://rt.cpan.org/Ticket/Display.html?id=84695

2.49 2013/03/05 03:12:49
! Encode.xs
  Addressed: Encoding objects leak memory if decoding fails
  dankogai/p5-encode#8

2.48 2013/02/18 02:23:56
! encoding.pm
  t/Mod_EUCJP.pm t/enc_data.t t/enc_eucjp.t t/enc_module.t t/enc_utf8.t
  t/encoding.t t/jperl.t
  [PATCH] Deprecate encoding.pm
  https://rt.cpan.org/Ticket/Display.html?id=81255
! Encode/Supported.pod
  Fixed: Pod errors
  https://rt.cpan.org/Ticket/Display.html?id=81426
! Encode.pm t/Encode.t
  [PATCH] Fix for shared hash key scalars
  https://rt.cpan.org/Ticket/Display.html?id=80608
! Encode.pm
  Fixed: Uninitialized value warning from Encode->encodings()
  https://rt.cpan.org/Ticket/Display.html?id=80181
! Makefile.PL
  Install to 'site' instead of 'perl' when perl version is 5.11+
  https://rt.cpan.org/Ticket/Display.html?id=78917
! Encode/Makefile_PL.e2x
  find enc2xs.bat if it works on windows.
  dankogai/p5-encode#7
! t/piconv.t
  Fix finding piconv in t/piconv.t
  dankogai/p5-encode#6
omron93 pushed a commit to omron93/rpmgirll that referenced this pull request May 5, 2014
Somewhere between 2.51-1 and 2.54-1, perl-Encode changed from
leaving ñ (n-tilde) untouched to converting it to ñ
(ampersand-pound-241-semicolon). I think this is why:
    dankogai/p5-encode#11
...and I think the new behavior is actually correct: invoking
decode_utf8() with known good UTF-8 is a bad idea.

Solution: test before calling decode_utf8(). If our input
string is valid UTF-8, use it as-is.
@anarcat
Copy link

anarcat commented Sep 10, 2014

for the record, this totally broke unicode support in at least one Perl application I use (ikiwiki), see http://ikiwiki.info/bugs/garbled_non-ascii_characters_in_body_in_web_interface/ for the fun stuff there...

maybe keeping a "safe_decode_utf8" function for backwards compatibility could be useful for all those tools out there that made the mistake of surviving with the hack?

@miyagawa
Copy link
Contributor Author

@anarcat sorry to hear that, but I'm sure that was due to the broken model of string handlings in the application, where it was accidentally running due to the bug in Encode.pm. If you really liked the previous behavior of Encode#decode_utf8, you can do so by providing a wrapper function that checks Encode::is_utf8($str) first.

@anarcat
Copy link

anarcat commented Sep 10, 2014

i know it was probably a "bug" in the application, but the reality of this is that this is a major change in the way this function actually works. Perl 5.20 now ships with this and this will break probably hundreds of perl programs that worked because of this handler that is now removed.

i also know how to make my own wrapper, but it seems to me this should have been part of the perldelta for 5.20, at the very least, and ideally provide backwards-compatible wrappers for people.

i am not saying this change is wrong, mind you, I am just saying that things that used to work are now broken and that porting old application to this new paradigm is a significant amount of work that seems to have been overlooked here.

@vsespb
Copy link
Contributor

vsespb commented Sep 10, 2014

I am just saying that things that used to work are now broken

that's sad. sorry for your trouble.

will break probably hundreds of perl programs that worked

I hope no. I hope people use things not just because "that works", but only when it's documented (or, at least sane) feature that they use.
But I could be wrong.

this should have been part of the perldelta for 5.20

probably yes, you're right.

jsonn pushed a commit to jsonn/pkgsrc that referenced this pull request Oct 11, 2014
$Revision: 2.54 $ $Date: 2013/08/29 16:47:39 $
! Encode.xs
+ t/cow.t
  Addressed: COW breakage with _utf8_on()
  https://rt.cpan.org/Ticket/Display.html?id=88230
! Encode.pm
  Reverted the document accordingly to #11
  dankogai/p5-encode#10
+ t/decode.t
  Unit test for decoding behavior change in #11
  dankogai/p5-encode#12

2.53 2013/08/29 15:20:31
! Encode.pm
  Merged: Do not short-circuit decode_utf8 with utf8 flags
  dankogai/p5-encode#11
  Merged: document decode_utf8 behaviour more precise
  dankogai/p5-encode#10
! Makefile.PL
  Added repository cpan metadata
  dankogai/p5-encode#9

2.52 2013/08/14 02:29:54
! ucm/*.ucm
  Addressed:
    Unicode Mappping tables are missing Unicode Inc. license notification
    All files including "as long as this notice remains attached" now
    have that notice attached in the comment section.  (cp* and mac*
    do not since their source files do not include that notice)
  https://rt.cpan.org/Ticket/Display.html?id=87340
! lib/Encode/MIME/Header.pm
  t/mime-header.t
  Addressed: encoding "0" with MIME-Headers gets a blank string
  https://rt.cpan.org/Ticket/Display.html?id=87831
! Encode.pm
  Addressed: Documentation buglet
  https://rt.cpan.org/Ticket/Display.html?id=84992
! Byte/Makefile.PL CN/Makefile.PL EBCDIC/Makefile.PL
  Encode/Makefile_PL.e2x JP/Makefile.PL KR/Makefile.PL
  Symbol/Makefile.PL TW/Makefile.PL
  Applied: Patch to output #includes in deterministic order
  https://rt.cpan.org/Ticket/Display.html?id=86974

2.51 2013/04/29 22:19:11
! Encode.xs
  Addressed: Encode.xs doesn't compile with Microsoft C compiler
  https://rt.cpan.org/Public/Bug/Display.html?id=84920
! MANIFEST
  Addressed: t/taint.t missing
  https://rt.cpan.org/Public/Bug/Display.html?id=84919

2.50 2013/04/26 18:30:46
! Encode.xs Unicode/Unicode.xs
  lib/Encode/Unicode/UTF7.pm lib/CN/HZ.pm lib/Encode/GSM0338.pm
  t/taint.t
  Addressed: Encode::encode and Encode::decode
             gratuitously launders tainted data
  Taintedness now propagates as it should.
  https://rt.cpan.org/Ticket/Display.html?id=84879
! encoding.pm
  Addressed: 5.18 deprecation
  https://rt.cpan.org/Ticket/Display.html?id=84709
! bin/piconv
  Applied: Update piconv documentation
  https://rt.cpan.org/Ticket/Display.html?id=84695

2.49 2013/03/05 03:12:49
! Encode.xs
  Addressed: Encoding objects leak memory if decoding fails
  dankogai/p5-encode#8

2.48 2013/02/18 02:23:56
! encoding.pm
  t/Mod_EUCJP.pm t/enc_data.t t/enc_eucjp.t t/enc_module.t t/enc_utf8.t
  t/encoding.t t/jperl.t
  [PATCH] Deprecate encoding.pm
  https://rt.cpan.org/Ticket/Display.html?id=81255
! Encode/Supported.pod
  Fixed: Pod errors
  https://rt.cpan.org/Ticket/Display.html?id=81426
! Encode.pm t/Encode.t
  [PATCH] Fix for shared hash key scalars
  https://rt.cpan.org/Ticket/Display.html?id=80608
! Encode.pm
  Fixed: Uninitialized value warning from Encode->encodings()
  https://rt.cpan.org/Ticket/Display.html?id=80181
! Makefile.PL
  Install to 'site' instead of 'perl' when perl version is 5.11+
  https://rt.cpan.org/Ticket/Display.html?id=78917
! Encode/Makefile_PL.e2x
  find enc2xs.bat if it works on windows.
  dankogai/p5-encode#7
! t/piconv.t
  Fix finding piconv in t/piconv.t
  dankogai/p5-encode#6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants