Skip to content
This repository has been archived by the owner on Oct 17, 2024. It is now read-only.

Update the ISO-8859 encodings to match the current standards #92

Closed

Conversation

brianquinlan
Copy link
Contributor

@brianquinlan brianquinlan commented Nov 6, 2023

I generated the mappings using this script:

import unicodedata

top_controls = (r'\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c'
    r'\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e'
    r'\x9f\xa0')

def get_char_mapping(encoding, r=None):
    s = ""
    if r is None:
        r = range(256)
    for i in r:
        try:
            ch = bytes([i]).decode(encoding)
        except UnicodeDecodeError:
            s += '\\ufffd'
        else:
            if ord(ch) > 0xffff:
                raise ValueError('not in BMP')
            if unicodedata.category(ch).startswith('C') or unicodedata.category(ch).startswith('Z'):
                if ord(ch) < 256:
                    s += f'\\x{ord(ch):02x}'
                else:
                    s += f'\\u{ord(ch):04x}'
            else:
                s += ch
    return s

char_map = get_char_mapping('iso8859-3', range(128, 256))
print(char_map.replace(top_controls, '$_topIsoControls'))

  • I’ve reviewed the contributor guide and applied the relevant portions to this PR.
Contribution guidelines:

Note that many Dart repos have a weekly cadence for reviewing PRs - please allow for some latency before initial review feedback.

@brianquinlan brianquinlan requested a review from lrhn November 6, 2023 17:52
// ignore: missing_whitespace_between_adjacent_strings
const _ascii = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e'
'\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20'
r"""!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcd"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ''' instead of """, to (better) match the other '-quoted strings.


- Require Dart 3.0
- Add chunked decoding support (`startChunkedConversion`) for `CodePage`
encodings.
- Update the ISO-8859 mappings to the latest version published by the Unicode
consortium.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just "new", update to actually use the official Unicode mappings, which we didn't before.
(Mainly because I was not aware they existed, and based these tables directly on the ISO standards.)
So, suggested phrasing

- Update the ISO-8859 mappings to use the Unicode
  consortium's recommended mappings between Unicode
  text and one-byte encodings.

@@ -146,17 +147,21 @@ const _top8859_16 = '\xa0ĄąŁ€„Š§š©Ș«Ź\xadźŻ°±ČłŽ”¶·žč
'ÀÁÂĂÄĆÆÇÈÉÊËÌÍÎÏĐŃÒÓÔŐÖŚŰÙÚÛÜĘȚß'
'àáâăäćæçèéêëìíîïđńòóôőöśűùúûüęțÿ';

const _top8859Controls = '\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c'
'\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be room for 16 hex-escapes per line (64 chars + start/end quote and indent, easily within 70 chars),
so split in multiples of 16, for ease of reading:

const _top8859controls =
  '\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f'
  '\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f'
  '\xa0';

CodePage._bmp('latin-3', '$_ascii$_noControls$_top8859_3');
///
/// See https://unicode.org/Public/MAPPINGS/ISO8859/8859-3.TXT
final CodePage latin3 = CodePage._bmp('latin-3', '$_ascii$_top8859_3');

/// The ISO-8859-4/Latin-4 (North European) code page.
final CodePage latin4 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do the rest too.
(I'd write a script to fetch and parse the Unicode tables, rather than doing it manually.)

... OK, so I did that.
#93

@mosuem
Copy link
Contributor

mosuem commented Oct 16, 2024

Closing as the dart-lang/convert repository is merged into the dart-lang/core monorepo. Please re-open this PR there!

@mosuem mosuem closed this Oct 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants