MySQL-compatible Collations #8991

vmg · 2021-10-13T13:38:24Z

Description

This is a project I'm working on with the help of our student intern @king-11. The goal of the project is bringing in support for collations in Vitess that behave identically to the ones in MySQL, so that they can used on the Vitess evaluation engine to perform multi-node queries that involve collated operations (e.g. aggregated joins). Right now, these kind of queries are being performed by introducing an extra artificial column, WeightString, which calls MySQL's WEIGHT_STRING function on the column we're trying to collate, and uses this this column when comparing the values. We want to move away from this approach because it's not particularly perfomant, it introduces a lot of overhead over the wire, and it uses an API (WEIGHT_STRING) which is not supposed to be used outside of debugging MySQL instances.

The approach

As you're probably aware, the collations implementation in MySQL is part of the MySQL codebase (duh!) and hence it's licensed as GPLv2, like the rest of the database. Some parts of it are actually LGPLv2, but that's not relevant: neither license is compatible with Vitess' Apache v2 licensing.

In order to have a Go implementation for collations that matches the behavior of MySQL's, we had to write the actual collation logic from scratch (cumbersome, but not particularly hard) and then compare it extensively with MySQL's output to ensure it behaves identically.

Here's the catch: a collation algorithm (any of them, really), assigns a given weight or weights to each codepoint in a string. It's not complicated to perform the actual collation, but in order to fully mimic the behavior of MySQL, we need to know the actual values for these weights that MySQL assigns by default. We discussed several options to accomplish this in our team meeting, and after involving some friendly lawyers, we came up with these 3 possible approaches (ordered from "more expensive" to "less expensive").

When Vitess connects to mysqld for the first time, it can perform thousands of queries in the shape of SELECT WEIGHT_STRING(x) ..., and then decompose those values to generate a map of all the weights for all the codepoints in the Unicode standard, for each collation that MySQL supports.
We can do the same thing, but offline. I.e. run the queries locally only once, and then store all the dumped weights in an efficient format for Vitess.
We can do the same thing, but instead of going through the SQL interface, linking a small program to mysqld so that when it starts up, it dumps the weights to a file on disk.

After a lot of discussion, we've agreed that these three approaches are equivalent, as the output of the queries we're performing is always the same, and the actual data (i.e. the values for the weights) we're dumping is not "an algorithm". Hence, we've opted for the 3rd option (because it's the most efficient one), which was implemented in this mysql-server fork by @king-11:

https://github.com/king-11/mysql-server/tree/vitess

The code in that repository links against mysqld and dumps a .json file for each encoding the MySQL server knows about. The JSON file contains metadata about the encoding, such was the weights for the codepoints, its name, etc. The JSON output for the tool has been checked in on this PR, and a tool named makemysqldata has been introduced that parses the tables in the JSON and outputs native Go code (mysqldata.go) so that our algorithm's implementation can use the same weights as MySQL.

Compatibility

We've worked very hard to ensure that we support as many collations from MySQL as it is feasible. Right now, the output of makemysqldata is as follows:

2021/10/13 11:41:28 unhandled implementation "any_uca": utf8_unicode_ci, utf8_icelandic_ci, utf8_latvian_ci, utf8_romanian_ci, utf8_slovenian_ci, utf8_polish_ci, utf8_estonian_ci, utf8_spanish_ci, utf8_swedish_ci, utf8_turkish_ci, utf8_czech_ci, utf8_danish_ci, utf8_lithuanian_ci, utf8_slovak_ci, utf8_spanish2_ci, utf8_roman_ci, utf8_persian_ci, utf8_esperanto_ci, utf8_hungarian_ci, utf8_sinhala_ci, utf8_german2_ci, utf8_croatian_ci, utf8_unicode_520_ci, utf8_vietnamese_ci, utf8mb4_unicode_ci, utf8mb4_icelandic_ci, utf8mb4_latvian_ci, utf8mb4_romanian_ci, utf8mb4_slovenian_ci, utf8mb4_polish_ci, utf8mb4_estonian_ci, utf8mb4_spanish_ci, utf8mb4_swedish_ci, utf8mb4_turkish_ci, utf8mb4_czech_ci, utf8mb4_danish_ci, utf8mb4_lithuanian_ci, utf8mb4_slovak_ci, utf8mb4_spanish2_ci, utf8mb4_roman_ci, utf8mb4_persian_ci, utf8mb4_esperanto_ci, utf8mb4_hungarian_ci, utf8mb4_sinhala_ci, utf8mb4_german2_ci, utf8mb4_croatian_ci, utf8mb4_unicode_520_ci, utf8mb4_vietnamese_ci
2021/10/13 11:41:28 unhandled implementation "": big5_chinese_ci, latin2_czech_cs, ujis_japanese_ci, sjis_japanese_ci, tis620_thai_ci, euckr_korean_ci, gb2312_chinese_ci, gbk_chinese_ci, latin1_german2_ci, utf8_general_ci, cp1250_czech_cs, ucs2_general_ci, utf16_general_ci, utf16_bin, utf16le_general_ci, utf32_general_ci, utf32_bin, utf16le_bin, utf8_tolower_ci, utf8_bin, big5_bin, euckr_bin, gb2312_bin, gbk_bin, sjis_bin, ucs2_bin, ujis_bin, cp932_japanese_ci, cp932_bin, eucjpms_japanese_ci, eucjpms_bin, ucs2_general_mysql500_ci, utf8_general_mysql500_ci, gb18030_chinese_ci, gb18030_bin, gb18030_unicode_520_ci
2021/10/13 11:41:28 unhandled implementation "utf16_uca": utf16_unicode_ci, utf16_icelandic_ci, utf16_latvian_ci, utf16_romanian_ci, utf16_slovenian_ci, utf16_polish_ci, utf16_estonian_ci, utf16_spanish_ci, utf16_swedish_ci, utf16_turkish_ci, utf16_czech_ci, utf16_danish_ci, utf16_lithuanian_ci, utf16_slovak_ci, utf16_spanish2_ci, utf16_roman_ci, utf16_persian_ci, utf16_esperanto_ci, utf16_hungarian_ci, utf16_sinhala_ci, utf16_german2_ci, utf16_croatian_ci, utf16_unicode_520_ci, utf16_vietnamese_ci
2021/10/13 11:41:28 unhandled implementation "ucs2_uca": ucs2_unicode_ci, ucs2_icelandic_ci, ucs2_latvian_ci, ucs2_romanian_ci, ucs2_slovenian_ci, ucs2_polish_ci, ucs2_estonian_ci, ucs2_spanish_ci, ucs2_swedish_ci, ucs2_turkish_ci, ucs2_czech_ci, ucs2_danish_ci, ucs2_lithuanian_ci, ucs2_slovak_ci, ucs2_spanish2_ci, ucs2_roman_ci, ucs2_persian_ci, ucs2_esperanto_ci, ucs2_hungarian_ci, ucs2_sinhala_ci, ucs2_german2_ci, ucs2_croatian_ci, ucs2_unicode_520_ci, ucs2_vietnamese_ci
2021/10/13 11:41:28 unhandled implementation "utf32_uca": utf32_unicode_ci, utf32_icelandic_ci, utf32_latvian_ci, utf32_romanian_ci, utf32_slovenian_ci, utf32_polish_ci, utf32_estonian_ci, utf32_spanish_ci, utf32_swedish_ci, utf32_turkish_ci, utf32_czech_ci, utf32_danish_ci, utf32_lithuanian_ci, utf32_slovak_ci, utf32_spanish2_ci, utf32_roman_ci, utf32_persian_ci, utf32_esperanto_ci, utf32_hungarian_ci, utf32_sinhala_ci, utf32_german2_ci, utf32_croatian_ci, utf32_unicode_520_ci, utf32_vietnamese_ci
2021/10/13 11:41:28 written "mysqldata.go" - 9964265 bytes, 116/272 collations (42.65% handled)

As you can see, we support 42.65% of all the collations that MySQL supports. The ones we don't support, in broad terms, are:

the any_uca collations: these collations are using the UCA500 standard, and can be divided in utf8 and utf8mb4 charsets. There's no good reason to support the utf8 charsets (which are actually utf8mb3 -- aka deprecated, insane), but we could support the utf8mb4 variants down the road.
The utf16_uca collations: these are UCA500 standard with utf16 as its charset. These could be supported down the road, they're not unreasonable.
The ucs2_uca collations: likewise, UCA500 with ucs2 as the charset. There doesn't seem to be any point on supporting these, since utf16 is the sane superset of UCS2. We should support utf16 instead.
The utf32_uca collations: same as utf16_uca: these could be reasonably supported.
Other arbitrary collations: these are weird off-page encodings which we shouldn't rush to implement.

Testing

To ensure full compatibility with MySQL, we're using three kinds of tests here:

Unit tests with WEIGHT_STRING: these are sample strings that have been extracted from mysql-servers test suite, then sent through WEIGHT_STRING in a live mysqld instance and the resulting values stored for our unit tests.
Weight table tests: these tests try to rebuild the weight tables for all known encodings in memory, as Vitess will now do when booting, and compares the results against the full dumped data from MySQL to ensure they're identical
Full integration tests w/ SQL: these tests run a battery of tests against a live mysqld instance (i.e. we spawn the instance when running the tests). The tests are from mysql-server's SQL-based compliance suite, and basically create a massive table with all the codepoints for each encoding (one per row), then collates them in mysqld, and then we query the live table to ensure our collation matches the server's. These are slow tests but very comprehensive, and give us a high degree of confidence in our collation algorithms. 👌

About this PR

This first version of the PR just introduces the mysq/collations package to the Vitess codebase. The goal is to get this merged without any integrations so I can start working on performance for the implemented collations while @king-11 integrates this with the evaluation engine.

cc @systay @harshit-gangal @deepthi

Related Issue(s)

Collation and Character set support #8606 (comment)

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

systay · 2021-10-14T11:31:33Z

This is really great! So excited to finally be able to do this on the vtgate.

General comment: public structs and methods should have comments, according to our linting rules.

go/mysql/collations/tools/maketables2/maketables.go

Co-authored-by: Lakshya Singh <[email protected]> Signed-off-by: Vicent Marti <[email protected]> Signed-off-by: Lakshya Singh <[email protected]>

vmg · 2021-10-15T16:15:00Z

I did, huh, some work today. Dumped data and implemented the collations for the legacy UCA collations in MySQL. Here's the output of makemysqldata:

2021/10/15 18:10:23 unhandled implementation "": big5_chinese_ci, latin2_czech_cs, ujis_japanese_ci, sjis_japanese_ci, tis620_thai_ci, euckr_korean_ci, gb2312_chinese_ci, gbk_chinese_ci, latin1_german2_ci, utf8_general_ci, cp1250_czech_cs, ucs2_general_ci, utf16_general_ci, utf16_bin, utf16le_general_ci, utf32_general_ci, utf32_bin, utf16le_bin, utf8_tolower_ci, utf8_bin, big5_bin, euckr_bin, gb2312_bin, gbk_bin, sjis_bin, ucs2_bin, ujis_bin, cp932_japanese_ci, cp932_bin, eucjpms_japanese_ci, eucjpms_bin, ucs2_general_mysql500_ci, utf8_general_mysql500_ci, gb18030_chinese_ci, gb18030_bin, gb18030_unicode_520_ci
2021/10/15 18:10:23 written "mysqldata.go" - 409875 bytes, 236/272 collations (86.76% handled)

That's 86% of all collations in MySQL 8.0+ supported. Imma treat myself to some candy this weekend.

One thing I noticed is that the size of the .json dumps was getting out of control (more than 500mb!!!, which are not actually needed to run/develop the collations package), so I took the drastic approach of rewriting our history as to remove them all from the history. I don't want to grow the size of Vitess' clones by 500mb.

I'll finish documenting the implementation on Monday and start working on some performance tuning, which still hasn't happened.

Signed-off-by: Vicent Marti <[email protected]>

systay · 2021-10-18T09:39:23Z

I think you accidentally:

> make build
Mon Oct 18 11:38:10 AM CEST 2021: Building source tree
# vitess.io/vitess/go/mysql/collations
go/mysql/collations/utf8.go:80:59: undefined: encoding.CodepointIterator
go/mysql/collations/utf8.go:158:36: undefined: encoding.CodepointIterator

Signed-off-by: Vicent Marti <[email protected]>

vmg · 2021-10-18T17:22:50Z

@deepthi: GH Actions is particularly flaky today, I'll merge this tomorrow morning once the Cluster tests become less flaky.

Signed-off-by: Vicent Marti <[email protected]>

vmg requested review from harshit-gangal and systay as code owners October 13, 2021 13:38

vmg added Component: Query Serving Type: Feature release notes none labels Oct 13, 2021

GuptaManan100 approved these changes Oct 14, 2021

View reviewed changes

go/mysql/collations/tools/maketables2/maketables.go Outdated Show resolved Hide resolved

collations: MySQL-compatible text collations

7c75c96

Co-authored-by: Lakshya Singh <[email protected]> Signed-off-by: Vicent Marti <[email protected]> Signed-off-by: Lakshya Singh <[email protected]>

vmg force-pushed the collations branch from 637bac4 to 7c75c96 Compare October 15, 2021 16:07

vmg added 3 commits October 16, 2021 21:45

collations: use embedding for the UCA tables

387c818

Signed-off-by: Vicent Marti <[email protected]>

collations: add support for gb18030

027183f

Signed-off-by: Vicent Marti <[email protected]>

collations: cleanup & document public APIs

8a05448

Signed-off-by: Vicent Marti <[email protected]>

vmg added 6 commits October 18, 2021 12:21

collations: more comprehensive layout tests

ca44089

Signed-off-by: Vicent Marti <[email protected]>

collation: faster, smaller weight tables

2d6e997

Signed-off-by: Vicent Marti <[email protected]>

collation: fix golden tests

5e328c6

Signed-off-by: Vicent Marti <[email protected]>

collations: change WeightString behavior to match MySQL's

44920b6

Signed-off-by: Vicent Marti <[email protected]>

collation: add tests for space weights

fd6e7b7

Signed-off-by: Vicent Marti <[email protected]>

collations: fix uca_legacy padding

70b6837

Signed-off-by: Vicent Marti <[email protected]>

king-11 mentioned this pull request Oct 19, 2021

Collations Module Integration #9018

Merged

3 tasks

collations: add support for remote collations

09d731f

Signed-off-by: Vicent Marti <[email protected]>

vmg merged commit 0f1ee35 into vitessio:main Oct 19, 2021

This was referenced Nov 6, 2021

Collations Integration Ordering #9155

Merged

Explicit Collation Error #9205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL-compatible Collations #8991

MySQL-compatible Collations #8991

vmg commented Oct 13, 2021

systay commented Oct 14, 2021

vmg commented Oct 15, 2021

systay commented Oct 18, 2021 •

edited

Loading

vmg commented Oct 18, 2021

MySQL-compatible Collations #8991

MySQL-compatible Collations #8991

Conversation

vmg commented Oct 13, 2021

Description

The approach

Compatibility

Testing

About this PR

Related Issue(s)

Checklist

Deployment Notes

systay commented Oct 14, 2021

vmg commented Oct 15, 2021

systay commented Oct 18, 2021 • edited Loading

vmg commented Oct 18, 2021

systay commented Oct 18, 2021 •

edited

Loading