[decoder_byte_count] returns incorrect count after decoding normalized CRLF #17

max-heller · 2022-08-12T15:19:37Z

The docs say that

(* [decoder_byte_count d] is the number of bytes already decoded on
   [d] (including malformed ones). This is the last {!decode}'s
   end byte offset counting from the beginning of the stream. *)

but this contract does not hold for normalized CRLFs:

let%expect_test "CRLF" =
  (* Print out the [decoder_byte_count] after each decoded character *)
  let print_offsets s =
    (* Normalize ASCII newlines to '\n' *)
    let nln = Uchar.of_char '\n' in
    let dec = Uutf.decoder ~nln:(`ASCII nln) ~encoding:`UTF_8 (`String s) in
    let rec loop () =
      match Uutf.decode dec with
      | `Uchar _ ->
        print_int (Uutf.decoder_byte_count dec); print_char ' ';
        loop ()
      | `End -> ()
      | _ -> assert false
    in
    loop ()
  in
  print_offsets "\na";
  (* Correctly outputs "1 2" because the end byte offset of '\n' is 2
     and the end byte offset of 'a' is 2. *)
  [%expect "1 2"];

  print_offsets "\r\na";
  (* Should output "2 3" because the end byte offset of "\r\n" (normalized to '\n') is 2
     and the end byte offset of 'a' is 3, but actually outputs "1 3" because the normalized
     [`Uchar '\n'] corresponding to "\r\n" is returned early: after the decoder decodes the '\r'
     and before it decodes the '\n'. *)
  [%expect "2 3"];
;;

dbuenzli · 2022-08-12T16:18:27Z

Thanks for the report.

Basically this happens here.

I didn't look at this code for a very long time and just had a quick look but I'm not sure there's an easy fix.

Basically since `ASCII normalizes U+000D (CR), U+000A (LF) and the sequence <U+000D, U+000A> (CRLF). The easiest is to always report the normalisation on either U+000D or U+000A and then suppress the output on U+000A if the last character was a CR.

Changing this would require to introduce a lookahead in the implementation and introducing one gets you into non-blocking ``Await`` business handling. Not saying it can be done, just not obvious in the 10mins I spent on this and not sure it's worth fixing now that we have UTF decoding in the stdlib since 4.14.

Just to make sure, are you here because of this code ? Maybe I'd rather help you to get rid of Uutf in favor of the new 4.14 in the stdlib than fix this bug :-) In particular you don't seem to use the non-blocking stuff so that should be easy.

dbuenzli added the bug label Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[decoder_byte_count] returns incorrect count after decoding normalized CRLF #17

[decoder_byte_count] returns incorrect count after decoding normalized CRLF #17

max-heller commented Aug 12, 2022

dbuenzli commented Aug 12, 2022

[decoder_byte_count] returns incorrect count after decoding normalized CRLF #17

[decoder_byte_count] returns incorrect count after decoding normalized CRLF #17

Comments

max-heller commented Aug 12, 2022

dbuenzli commented Aug 12, 2022