-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use UTF-8 internally for strings. #684
Comments
We will also need the length in bytes, with adds another typedef struct {
PyObject_HEAD
uintptr_t interned: 2;
uintptr_t ascii: 1;
uintptr_t valid_utf8: 1;
uintptr_t length: (WORD_SIZE-4); /* Number of code points in the string */
uintptr_t byte_count; /* Number of bytes in the string */
Py_hash_t hash; /* Hash value; -1 if not set */
PyUnicodeIndex *index; /* NULL unless needed */
uint8_t data[1];
} PyUnicodeObject; On 64 bit machines that's 48 bytes of header, compared to the current 40 bytes, unfortunately. Given that for most strings typedef struct {
PyObject_HEAD
uint32_t is_small: 1;
uint32_t is_ascii: 1;
uint32_t interned: 2;
uint32_t ascii: 1;
uint32_t valid_utf8: 1;
uint32_t small_length: 24; /* Number of code points in the string */
uint32_t small_count; /* Number of bytes in the string */
Py_hash_t hash; /* Hash value; -1 if not set */
} PyUnicodeObjectBase;
typedef struct {
PyUnicodeObjectBase base;
uint8_t data[1];
} PySmallAsciiObject;
typedef struct {
PyUnicodeObjectBase base;
PyUnicodeIndex *index; /* NULL unless needed */
uint8_t data[1];
} PySmallUnicodeObject;
typedef struct {
PyUnicodeObjectBase base;
PyUnicodeIndex *index; /* NULL unless needed */
uint64_t large_length; /* Number of code points in the string */
uint64_t large_count; /* Number of bytes in the string */
uint8_t data[1];
} PyLargeUnicodeObject;
typedef union { PySmallAsciiObject ascii; PySmallUnicodeObject small; PyLargeUnicodeObject large } PyUnicodeObject; This makes the code more complex, but still considerably simpler than what we have now. |
It does seem that the world at large has come around to UTF-8, with fixed-size encodings feeling more like legacy at this point. See UTF-8 Everywhere. The relevant PEP 393 provides background for the current state of Python strings. The effects on the C API and the runtime/memory semantic changes seem big enough this should also have a PEP, IMHO. |
PyPy has been using this approach since a number of years and we are indeed very happy with it (we even have zero-copy utf-8 decoding of bytes, at the cost of an extra indirection from our equivalent One of the most annoying parts in this approach is actually I've been vaguely considering to rewrite the string find algorithm to also count the codepoints it skips over, but haven't gotten around to that yet. (the JIT actually helps us here sometimes, because it knows that the functions |
Can you elaborate on this more? This scheme seems clever, but after staring at it a while I still have no idea how it is supposed to work. I'm assuming that you meant Either way, it seems like there are offsets that are impossible to encode... for example, a string with 64 one-byte characters followed by a bunch of two-byte characters, I think? |
I think the struct for an index entry may have been intended to be: struct index_entry {
uintptr_t base_offset;
uint8_t additional_offsets[64];
}; and then lookup looks like entry = index_entries[index/64];
offset = entry.base_offset + entry.additional_offsets[index%64]; This works because the UTF-8 encoding of a code point is at most 4 bytes, so the largest value one can need to store in A possible variation here, trading a bit of computation at lookup time for further reducing the size of the index, would be to have a sparse list of offsets (like only every 2 or 4 or 8 code points), and then partially decode the UTF-8 to skip past a few code points as necessary. Like this: // or this could be 2 or 4, or 16
#define INDEX_SKIP_FACTOR 8
struct index_entry {
uintptr_t base_offset;
uint8_t additional_offsets[64 / INDEX_SKIP_FACTOR];
};
// …
entry = index_entries[index/64];
offset = entry.base_offset + entry.additional_offsets[(index%64) / INDEX_SKIP_FACTOR];
for (index = index % INDEX_SKIP_FACTOR; index > 0; index--) {
offset += utf8_codepoint_length_from_first_byte(data[offset]);
} where With the first scheme above (my interpretation of what Mark had in mind in the OP) the index is fairly large — about 112% the size of the string itself if the string is long but mostly ASCII — so this sort of further compression might be worthwhlie. With INDEX_SKIP_FACTOR of 8, the index would be 25% the size of the string itself in that case. For short strings, the relative size of the index can be greater: 72 bytes, reduced to 16 bytes using a skip factor of 8. |
Yes, thanks. I've corrected the original. |
My thinking was that either you don't need an index (all acsii strings and the majority of strings where indexing isn't used), or you do. If you do need indexing you'll want it to be fast, and having to parse the utf-8 might be too slow.
The nice thing about this is that we can experiment with various schemes, to find the sweet spot. |
Yes, exactly. Thanks for the clarification. |
Reasonable! And yeah, this should lend itself very well to experimentation. |
My take is that With I've seen papers about fast utf-8 indexing, and one of them proposed an index entry for every 1000 or so characters. I am sorry but I don't have the reference to share now. This would be a great enhancement proposal. In Fluent Python I wrote that the current scheme means that concatenating a single 🐀 (U+1F400) on "pure" Latin-1 text produces a |
Python's current scheme uses 1 byte for Latin-1 strings, not just ASCII. Latin-1 text is ASCII letters with some accents (e.g. "façade") plus curly braces, m-dashes, and other common symbols used in typeset text. In utf-8, non-ASCII Latin-1 takes 2 bytes per character instead of 1 in the current scheme. I believe migrating to utf-8 is a good idea, but just beware of the impact on text in Western European languages that uses Latin-1 which is mostly ASCII with accents here and there. |
I don't think that we can simply remove functions without providing better replacements. I added a new We should get rid of functions writing directly into Python strings ( Allowing to modify an immutable str is a nightmare for caches. How can you make sure that a string is ASCII if it can be modified for example? I'm also working on PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions which is somehow an abstraction for the "kind" and "data" things of PEP 393 C APIs. |
Can we afford to have 2 bits for these members to create them uninitialized, and only compute them on demand? Checking if a string is ASCII or if it's valid UTF-8 are O(n) operations. |
During deprecation period, we need to support PyUnicode_New() + PyUnicode_WriteChar() in some way. I have removed "Legacy" unicode object when I removed Or we can over allocate memory and in-place conversion to use contiguous memory layout always:
PyUnicode_READY() do the conversion, as we did for legacy Unicode object. |
@vstinner Mojo🔥 implementation also looks very promising in general: modularml/mojo#3401 (this is in fact merged to internal OpenAI branch). It looks even faster than simdutf. |
On Wed, Oct 9, 2024 at 1:34 PM Vasily Ryabov ***@***.***> wrote:
We can implement method is_ascii() extremely quickly: return byte_count == length.
Nice trick, with a catch. If byte_count != length we are 100% sure it
is not ASCII.
But if byte_count == length, the string may contain bytes with values
127 which are not ASCII.
So I'd say your trick is an excellent first step for a fast
implementation of is_ascii().
Cheers,
Luciano
…
Mojo🔥 implementation also looks very promising in general: modularml/mojo#3401 (this is in fact merged to internal OpenAI branch). It looks even faster than simdutf.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
--
Luciano Ramalho
| Author of Fluent Python (O'Reilly, 2015-2022)
| https://www.oreilly.com/library/view/fluent-python-2nd/9781492056348/
| Mastodon: @***@***.***
|
@ramalho thanks. According to this table https://www.ascii-code.com/ 127 is still ASCII symbol. So the trick should be necessary and enough. Only 128 becomes 2-byte sequence in utf-8. |
I'm sure this has been discussed elsewhere and I don't know when, or if, we'll have time to implement it, but I think it's worth adding here.
Currently Python
str
s (PyUnicodeObject
in C) are implemented as arrays of 1, 2 or 4 byte unicode code points, plus a headerI propose that we implement them as a utf8 encoded array of bytes, plus a header.
The advantages are numerous:
str
method and eachPyUnicode_
C function, saving considerable code size and simplifying the codeuint8_t
,uint16_t
anduint32_t
for character data.However there are two problems:
Keeping indexing O(1)
To maintain this properly we will have to lazily create an offset table for larger, non-ASCII strings.
This is won't be a problem in practice because:
The C API
We will have to deprecate, and then remove, the C API that exposes the implementation details.
We should probably deprecate for 3.14, so that we can implement utf8 strings in 3.16, allowing a proper deprecation period.
Implementation
We will probably want to embed the string data directly into the object, so the struct will look something like this:
The
valid_utf8
bit helps fast encoding. It is false if the string contains half surrogate pairs or any other code point not allowed in legal utf-8.Indexing operations
setitem(self, index)
would be implemented something like this:The index table would be composed of len(s)/64 entries, each entry being:
With the offset being computed as
base_offset[index/64] + additional_offset[index%64]
.The text was updated successfully, but these errors were encountered: