-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MySQL-compatible Collations #8991
Conversation
This is really great! So excited to finally be able to do this on the vtgate. General comment: public structs and methods should have comments, according to our linting rules. |
Co-authored-by: Lakshya Singh <[email protected]> Signed-off-by: Vicent Marti <[email protected]> Signed-off-by: Lakshya Singh <[email protected]>
I did, huh, some work today. Dumped data and implemented the collations for the legacy UCA collations in MySQL. Here's the output of
That's 86% of all collations in MySQL 8.0+ supported. Imma treat myself to some candy this weekend. One thing I noticed is that the size of the I'll finish documenting the implementation on Monday and start working on some performance tuning, which still hasn't happened. |
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
I think you accidentally:
|
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
@deepthi: GH Actions is particularly flaky today, I'll merge this tomorrow morning once the Cluster tests become less flaky. |
Signed-off-by: Vicent Marti <[email protected]>
Description
This is a project I'm working on with the help of our student intern @king-11. The goal of the project is bringing in support for collations in Vitess that behave identically to the ones in MySQL, so that they can used on the Vitess evaluation engine to perform multi-node queries that involve collated operations (e.g. aggregated joins). Right now, these kind of queries are being performed by introducing an extra artificial column,
WeightString
, which calls MySQL'sWEIGHT_STRING
function on the column we're trying to collate, and uses this this column when comparing the values. We want to move away from this approach because it's not particularly perfomant, it introduces a lot of overhead over the wire, and it uses an API (WEIGHT_STRING
) which is not supposed to be used outside of debugging MySQL instances.The approach
As you're probably aware, the collations implementation in MySQL is part of the MySQL codebase (duh!) and hence it's licensed as GPLv2, like the rest of the database. Some parts of it are actually LGPLv2, but that's not relevant: neither license is compatible with Vitess' Apache v2 licensing.
In order to have a Go implementation for collations that matches the behavior of MySQL's, we had to write the actual collation logic from scratch (cumbersome, but not particularly hard) and then compare it extensively with MySQL's output to ensure it behaves identically.
Here's the catch: a collation algorithm (any of them, really), assigns a given weight or weights to each codepoint in a string. It's not complicated to perform the actual collation, but in order to fully mimic the behavior of MySQL, we need to know the actual values for these weights that MySQL assigns by default. We discussed several options to accomplish this in our team meeting, and after involving some friendly lawyers, we came up with these 3 possible approaches (ordered from "more expensive" to "less expensive").
mysqld
for the first time, it can perform thousands of queries in the shape ofSELECT WEIGHT_STRING(x) ...
, and then decompose those values to generate a map of all the weights for all the codepoints in the Unicode standard, for each collation that MySQL supports.mysqld
so that when it starts up, it dumps the weights to a file on disk.After a lot of discussion, we've agreed that these three approaches are equivalent, as the output of the queries we're performing is always the same, and the actual data (i.e. the values for the weights) we're dumping is not "an algorithm". Hence, we've opted for the 3rd option (because it's the most efficient one), which was implemented in this
mysql-server
fork by @king-11:https://github.com/king-11/mysql-server/tree/vitess
The code in that repository links against
mysqld
and dumps a.json
file for each encoding the MySQL server knows about. The JSON file contains metadata about the encoding, such was the weights for the codepoints, its name, etc. The JSON output for the tool has been checked in on this PR, and a tool namedmakemysqldata
has been introduced that parses the tables in the JSON and outputs native Go code (mysqldata.go
) so that our algorithm's implementation can use the same weights as MySQL.Compatibility
We've worked very hard to ensure that we support as many collations from MySQL as it is feasible. Right now, the output of
makemysqldata
is as follows:As you can see, we support 42.65% of all the collations that MySQL supports. The ones we don't support, in broad terms, are:
any_uca
collations: these collations are using the UCA500 standard, and can be divided inutf8
andutf8mb4
charsets. There's no good reason to support theutf8
charsets (which are actuallyutf8mb3
-- aka deprecated, insane), but we could support theutf8mb4
variants down the road.utf16_uca
collations: these are UCA500 standard withutf16
as its charset. These could be supported down the road, they're not unreasonable.ucs2_uca
collations: likewise, UCA500 withucs2
as the charset. There doesn't seem to be any point on supporting these, sinceutf16
is the sane superset of UCS2. We should support utf16 instead.utf32_uca
collations: same asutf16_uca
: these could be reasonably supported.Testing
To ensure full compatibility with MySQL, we're using three kinds of tests here:
WEIGHT_STRING
: these are sample strings that have been extracted frommysql-server
s test suite, then sent throughWEIGHT_STRING
in a livemysqld
instance and the resulting values stored for our unit tests.mysqld
instance (i.e. we spawn the instance when running the tests). The tests are frommysql-server
's SQL-based compliance suite, and basically create a massive table with all the codepoints for each encoding (one per row), then collates them inmysqld
, and then we query the live table to ensure our collation matches the server's. These are slow tests but very comprehensive, and give us a high degree of confidence in our collation algorithms. 👌About this PR
This first version of the PR just introduces the
mysq/collations
package to the Vitess codebase. The goal is to get this merged without any integrations so I can start working on performance for the implemented collations while @king-11 integrates this with the evaluation engine.cc @systay @harshit-gangal @deepthi
Related Issue(s)
Checklist
Deployment Notes