Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL-compatible Collations #8991

Merged
merged 11 commits into from
Oct 19, 2021
Merged

MySQL-compatible Collations #8991

merged 11 commits into from
Oct 19, 2021

Conversation

vmg
Copy link
Collaborator

@vmg vmg commented Oct 13, 2021

Description

This is a project I'm working on with the help of our student intern @king-11. The goal of the project is bringing in support for collations in Vitess that behave identically to the ones in MySQL, so that they can used on the Vitess evaluation engine to perform multi-node queries that involve collated operations (e.g. aggregated joins). Right now, these kind of queries are being performed by introducing an extra artificial column, WeightString, which calls MySQL's WEIGHT_STRING function on the column we're trying to collate, and uses this this column when comparing the values. We want to move away from this approach because it's not particularly perfomant, it introduces a lot of overhead over the wire, and it uses an API (WEIGHT_STRING) which is not supposed to be used outside of debugging MySQL instances.

The approach

As you're probably aware, the collations implementation in MySQL is part of the MySQL codebase (duh!) and hence it's licensed as GPLv2, like the rest of the database. Some parts of it are actually LGPLv2, but that's not relevant: neither license is compatible with Vitess' Apache v2 licensing.

In order to have a Go implementation for collations that matches the behavior of MySQL's, we had to write the actual collation logic from scratch (cumbersome, but not particularly hard) and then compare it extensively with MySQL's output to ensure it behaves identically.

Here's the catch: a collation algorithm (any of them, really), assigns a given weight or weights to each codepoint in a string. It's not complicated to perform the actual collation, but in order to fully mimic the behavior of MySQL, we need to know the actual values for these weights that MySQL assigns by default. We discussed several options to accomplish this in our team meeting, and after involving some friendly lawyers, we came up with these 3 possible approaches (ordered from "more expensive" to "less expensive").

  • When Vitess connects to mysqld for the first time, it can perform thousands of queries in the shape of SELECT WEIGHT_STRING(x) ..., and then decompose those values to generate a map of all the weights for all the codepoints in the Unicode standard, for each collation that MySQL supports.
  • We can do the same thing, but offline. I.e. run the queries locally only once, and then store all the dumped weights in an efficient format for Vitess.
  • We can do the same thing, but instead of going through the SQL interface, linking a small program to mysqld so that when it starts up, it dumps the weights to a file on disk.

After a lot of discussion, we've agreed that these three approaches are equivalent, as the output of the queries we're performing is always the same, and the actual data (i.e. the values for the weights) we're dumping is not "an algorithm". Hence, we've opted for the 3rd option (because it's the most efficient one), which was implemented in this mysql-server fork by @king-11:

https://github.com/king-11/mysql-server/tree/vitess

The code in that repository links against mysqld and dumps a .json file for each encoding the MySQL server knows about. The JSON file contains metadata about the encoding, such was the weights for the codepoints, its name, etc. The JSON output for the tool has been checked in on this PR, and a tool named makemysqldata has been introduced that parses the tables in the JSON and outputs native Go code (mysqldata.go) so that our algorithm's implementation can use the same weights as MySQL.

Compatibility

We've worked very hard to ensure that we support as many collations from MySQL as it is feasible. Right now, the output of makemysqldata is as follows:

2021/10/13 11:41:28 unhandled implementation "any_uca": utf8_unicode_ci, utf8_icelandic_ci, utf8_latvian_ci, utf8_romanian_ci, utf8_slovenian_ci, utf8_polish_ci, utf8_estonian_ci, utf8_spanish_ci, utf8_swedish_ci, utf8_turkish_ci, utf8_czech_ci, utf8_danish_ci, utf8_lithuanian_ci, utf8_slovak_ci, utf8_spanish2_ci, utf8_roman_ci, utf8_persian_ci, utf8_esperanto_ci, utf8_hungarian_ci, utf8_sinhala_ci, utf8_german2_ci, utf8_croatian_ci, utf8_unicode_520_ci, utf8_vietnamese_ci, utf8mb4_unicode_ci, utf8mb4_icelandic_ci, utf8mb4_latvian_ci, utf8mb4_romanian_ci, utf8mb4_slovenian_ci, utf8mb4_polish_ci, utf8mb4_estonian_ci, utf8mb4_spanish_ci, utf8mb4_swedish_ci, utf8mb4_turkish_ci, utf8mb4_czech_ci, utf8mb4_danish_ci, utf8mb4_lithuanian_ci, utf8mb4_slovak_ci, utf8mb4_spanish2_ci, utf8mb4_roman_ci, utf8mb4_persian_ci, utf8mb4_esperanto_ci, utf8mb4_hungarian_ci, utf8mb4_sinhala_ci, utf8mb4_german2_ci, utf8mb4_croatian_ci, utf8mb4_unicode_520_ci, utf8mb4_vietnamese_ci
2021/10/13 11:41:28 unhandled implementation "": big5_chinese_ci, latin2_czech_cs, ujis_japanese_ci, sjis_japanese_ci, tis620_thai_ci, euckr_korean_ci, gb2312_chinese_ci, gbk_chinese_ci, latin1_german2_ci, utf8_general_ci, cp1250_czech_cs, ucs2_general_ci, utf16_general_ci, utf16_bin, utf16le_general_ci, utf32_general_ci, utf32_bin, utf16le_bin, utf8_tolower_ci, utf8_bin, big5_bin, euckr_bin, gb2312_bin, gbk_bin, sjis_bin, ucs2_bin, ujis_bin, cp932_japanese_ci, cp932_bin, eucjpms_japanese_ci, eucjpms_bin, ucs2_general_mysql500_ci, utf8_general_mysql500_ci, gb18030_chinese_ci, gb18030_bin, gb18030_unicode_520_ci
2021/10/13 11:41:28 unhandled implementation "utf16_uca": utf16_unicode_ci, utf16_icelandic_ci, utf16_latvian_ci, utf16_romanian_ci, utf16_slovenian_ci, utf16_polish_ci, utf16_estonian_ci, utf16_spanish_ci, utf16_swedish_ci, utf16_turkish_ci, utf16_czech_ci, utf16_danish_ci, utf16_lithuanian_ci, utf16_slovak_ci, utf16_spanish2_ci, utf16_roman_ci, utf16_persian_ci, utf16_esperanto_ci, utf16_hungarian_ci, utf16_sinhala_ci, utf16_german2_ci, utf16_croatian_ci, utf16_unicode_520_ci, utf16_vietnamese_ci
2021/10/13 11:41:28 unhandled implementation "ucs2_uca": ucs2_unicode_ci, ucs2_icelandic_ci, ucs2_latvian_ci, ucs2_romanian_ci, ucs2_slovenian_ci, ucs2_polish_ci, ucs2_estonian_ci, ucs2_spanish_ci, ucs2_swedish_ci, ucs2_turkish_ci, ucs2_czech_ci, ucs2_danish_ci, ucs2_lithuanian_ci, ucs2_slovak_ci, ucs2_spanish2_ci, ucs2_roman_ci, ucs2_persian_ci, ucs2_esperanto_ci, ucs2_hungarian_ci, ucs2_sinhala_ci, ucs2_german2_ci, ucs2_croatian_ci, ucs2_unicode_520_ci, ucs2_vietnamese_ci
2021/10/13 11:41:28 unhandled implementation "utf32_uca": utf32_unicode_ci, utf32_icelandic_ci, utf32_latvian_ci, utf32_romanian_ci, utf32_slovenian_ci, utf32_polish_ci, utf32_estonian_ci, utf32_spanish_ci, utf32_swedish_ci, utf32_turkish_ci, utf32_czech_ci, utf32_danish_ci, utf32_lithuanian_ci, utf32_slovak_ci, utf32_spanish2_ci, utf32_roman_ci, utf32_persian_ci, utf32_esperanto_ci, utf32_hungarian_ci, utf32_sinhala_ci, utf32_german2_ci, utf32_croatian_ci, utf32_unicode_520_ci, utf32_vietnamese_ci
2021/10/13 11:41:28 written "mysqldata.go" - 9964265 bytes, 116/272 collations (42.65% handled)

As you can see, we support 42.65% of all the collations that MySQL supports. The ones we don't support, in broad terms, are:

  • the any_uca collations: these collations are using the UCA500 standard, and can be divided in utf8 and utf8mb4 charsets. There's no good reason to support the utf8 charsets (which are actually utf8mb3 -- aka deprecated, insane), but we could support the utf8mb4 variants down the road.
  • The utf16_uca collations: these are UCA500 standard with utf16 as its charset. These could be supported down the road, they're not unreasonable.
  • The ucs2_uca collations: likewise, UCA500 with ucs2 as the charset. There doesn't seem to be any point on supporting these, since utf16 is the sane superset of UCS2. We should support utf16 instead.
  • The utf32_uca collations: same as utf16_uca: these could be reasonably supported.
  • Other arbitrary collations: these are weird off-page encodings which we shouldn't rush to implement.

Testing

To ensure full compatibility with MySQL, we're using three kinds of tests here:

  1. Unit tests with WEIGHT_STRING: these are sample strings that have been extracted from mysql-servers test suite, then sent through WEIGHT_STRING in a live mysqld instance and the resulting values stored for our unit tests.
  2. Weight table tests: these tests try to rebuild the weight tables for all known encodings in memory, as Vitess will now do when booting, and compares the results against the full dumped data from MySQL to ensure they're identical
  3. Full integration tests w/ SQL: these tests run a battery of tests against a live mysqld instance (i.e. we spawn the instance when running the tests). The tests are from mysql-server's SQL-based compliance suite, and basically create a massive table with all the codepoints for each encoding (one per row), then collates them in mysqld, and then we query the live table to ensure our collation matches the server's. These are slow tests but very comprehensive, and give us a high degree of confidence in our collation algorithms. 👌

About this PR

This first version of the PR just introduces the mysq/collations package to the Vitess codebase. The goal is to get this merged without any integrations so I can start working on performance for the implemented collations while @king-11 integrates this with the evaluation engine.

cc @systay @harshit-gangal @deepthi

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@systay
Copy link
Collaborator

systay commented Oct 14, 2021

This is really great! So excited to finally be able to do this on the vtgate.

General comment: public structs and methods should have comments, according to our linting rules.

Co-authored-by: Lakshya Singh <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Lakshya Singh <[email protected]>
@vmg
Copy link
Collaborator Author

vmg commented Oct 15, 2021

I did, huh, some work today. Dumped data and implemented the collations for the legacy UCA collations in MySQL. Here's the output of makemysqldata:

2021/10/15 18:10:23 unhandled implementation "": big5_chinese_ci, latin2_czech_cs, ujis_japanese_ci, sjis_japanese_ci, tis620_thai_ci, euckr_korean_ci, gb2312_chinese_ci, gbk_chinese_ci, latin1_german2_ci, utf8_general_ci, cp1250_czech_cs, ucs2_general_ci, utf16_general_ci, utf16_bin, utf16le_general_ci, utf32_general_ci, utf32_bin, utf16le_bin, utf8_tolower_ci, utf8_bin, big5_bin, euckr_bin, gb2312_bin, gbk_bin, sjis_bin, ucs2_bin, ujis_bin, cp932_japanese_ci, cp932_bin, eucjpms_japanese_ci, eucjpms_bin, ucs2_general_mysql500_ci, utf8_general_mysql500_ci, gb18030_chinese_ci, gb18030_bin, gb18030_unicode_520_ci
2021/10/15 18:10:23 written "mysqldata.go" - 409875 bytes, 236/272 collations (86.76% handled)

That's 86% of all collations in MySQL 8.0+ supported. Imma treat myself to some candy this weekend.

One thing I noticed is that the size of the .json dumps was getting out of control (more than 500mb!!!, which are not actually needed to run/develop the collations package), so I took the drastic approach of rewriting our history as to remove them all from the history. I don't want to grow the size of Vitess' clones by 500mb.

I'll finish documenting the implementation on Monday and start working on some performance tuning, which still hasn't happened.

@systay
Copy link
Collaborator

systay commented Oct 18, 2021

I think you accidentally:

> make build
Mon Oct 18 11:38:10 AM CEST 2021: Building source tree
# vitess.io/vitess/go/mysql/collations
go/mysql/collations/utf8.go:80:59: undefined: encoding.CodepointIterator
go/mysql/collations/utf8.go:158:36: undefined: encoding.CodepointIterator

@vmg
Copy link
Collaborator Author

vmg commented Oct 18, 2021

@deepthi: GH Actions is particularly flaky today, I'll merge this tomorrow morning once the Cluster tests become less flaky.

@king-11 king-11 mentioned this pull request Oct 19, 2021
3 tasks
@vmg vmg merged commit 0f1ee35 into vitessio:main Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants