Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL case sensitivty fix #28651

Closed
wants to merge 6 commits into from
Closed

MySQL case sensitivty fix #28651

wants to merge 6 commits into from

Conversation

algernon
Copy link
Contributor

This is a... complicated pull request, and does a number of things:

  • It adds setting.Database.DefaultCharset and setting.Database.DefaultCollation, and removes setting.Database.MysqlCharset.
  • Refactors db.ConvertUtf8ToUtf8mb4() into db.ConvertCharsetAndCollation(), a function that can convert a database to the desired charset + collation, rather than hardcoding either of them.
  • Adds db.findCaseSensitiveCollation() to auto-detect the most fitting collate, and db.GetDesiredCharsetAndCollation() to return either the charset/collation set in the config, or (if either is empty) the auto-detected default.
  • With the above changes, gitea doctor convert will convert the db to the desired charset + collation combo, rather than hardcoded values.
  • When starting the web server, Gitea will now perform a sanity check on the database: this currently checks the charset & collation on the database and each table, and warns if there's a mismatch (it does not stop the process, merely warns)
  • In InitDBWithMigrations, if it sees an empty MySQL database, it will ALTER DATABASE it to have the correct charset & collate.
  • Documentation is updated to mention the case sensitive collations, and recommend using gitea doctor convert more strongly.

The reason for the setting.Database.MysqlCharset removal, and the InitDBWithMigrations change is to let CREATE TABLE calls inherit the charset and the collate function from the database. If we don't remove MysqlCharset from the connection string, xorm will add CHARSET <foo> to each CREATE TABLE call, which in turn will implicitly set the collate function to the charset's default, rather than the database's. If we remove the charset from the connection string, xorm will not add CHARSET <foo> to the CREATE TABLE calls, and they will correctly inherit both charset & collate.

A better solution for this would be to teach xorm to let us set a database-level default collate, and add that to each CREATE TABLE call. If it did that, we could remove the ALTER TABLE from InitDBWithMigration. This is something that is worth doing longer term, and I'm more than willing to do that and submit a PR against xorm, and once that's updated - if such a change is accepted - submit an update to Gitea aswell. Meanwhile, I think this is the best we can do.

By the way, the ALTER TABLE thing (and the xorm-based fix later on) will both help new installations when the database was automatically created with the wrong collate, such as when using a dockerized MySQL/MariaDB. Existing installs will have to use gitea doctor convert.

This PR does not address the problem for MSSQL, however. I don't have the resources to make a fix for that, but I'm reasonably sure something similar could be accomplished there, too.

Remove the (undocumented) `Database.MysqlCharset` setting, and
introduce `Database.DefaultCharset` and `Database.DefaultCollation`
settings instead.

The reason for the `MysqlCharset` removal is that in a later patch,
we're going to adjust the engine initialization so that it achieves
similar results, in a different way.

The new settings will remain undocumented for similar reasons
`MysqlCharset` was undocumented: the defaults work for the vast majority
of cases, and it's too easy to break things if changing them.

Signed-off-by: Gergely Nagy <[email protected]>
@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Dec 29, 2023
@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 29, 2023
@github-actions github-actions bot added modifies/cli PR changes something on the CLI, i.e. gitea doctor or gitea admin modifies/docs labels Dec 29, 2023
@algernon
Copy link
Contributor Author

algernon commented Dec 29, 2023

@wxiaoguang I invited you as a collaborator to my repo, so you can edit this PR as you want, as promised in #28633. Feel free to add new commits, rebase, change, whatever. If I disagree, I'll comment, and we'll figure out how to make everyone happy.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 29, 2023

Thanks. I will try in a few days. And keep this comment as dummy (to record changes & screenshots)

(added "refactoring" label, maybe this is the most suitable one in the list, it mainly refactors the database tables ....)

Brief ideas (for reference)

  1. Add an config option
    • It defaults to empty, means use default collation.
    • Default for MySQL means "utf8mb4_0900_as_cs"
    • Default for MariaDB means "uca1400_as_cs"
    • Default for MSSQL means "Latin1_General_CS_AS"
  2. Add a startup check, check the collations database and tables
    • If there are mismatched collations, show a startup warning, and show an admin warning
  3. Refactor gitea doctor convert, to use the collation config to convert the tables.
  4. Add some code to help to initialize docker database.
  5. MSSQL might not be covered this time, because it's difficult for MSSQL to ALTER COLUMN collation when the field is in a unique index (leave the TODO to the future, or just document it, or find some other approahes)

Changes (TODO)

...

Screenshots (TODO)

...


Update: for reference, I would not be able to work if there are negative voices from Forgejo.

https://codeberg.org/forgejo/forgejo/pulls/2050

image

@wxiaoguang wxiaoguang self-requested a review December 29, 2023 16:29
@wxiaoguang wxiaoguang marked this pull request as draft December 29, 2023 16:29
@wxiaoguang wxiaoguang added the type/refactoring Existing code has been cleaned up. There should be no new functionality. label Dec 29, 2023
@wxiaoguang wxiaoguang added this to the 1.22.0 milestone Dec 29, 2023
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 30, 2023

The following update:

The Forgejo team said no (deleting my comment) and would like to keep express negative voices to Gitea, so I won't touch it.

Meanwhile, thanks to algernon for your efforts on this.

Before

image

After

image

@wxiaoguang wxiaoguang removed their request for review December 30, 2023 05:13
@lunny
Copy link
Member

lunny commented Dec 30, 2023

By Dialect().Version, you can get to know whether it's mysql or mariadb.

@algernon
Copy link
Contributor Author

algernon commented Dec 30, 2023

By Dialect().Version, you can get to know whether it's mysql or mariadb.

Sweet, that means I can get rid of the auto-detection. Will update the PR accordingly in a bit. Thanks!

EDIT: Updated, db.findCaseSensitiveCollation() now looks at Dialect().Version: if it is "MariaDB", it sets collation to uca1400_as_cs, otherwise to utf8mb4_0900_as_cs, and utf8mb4_bin is thus no longer a fallback.

Replace `db.ConvertUtf8ToUtf8mb4()` with
`db.ConvertCharsetAndCollation()`, which does essentially the same
thing, but the charset and the collation is controlled by the caller.

Also introduce `db.findCaseSensitiveCollation()` which attempts to
auto-detect the best supported case-sensitive collation for MySQL and
MariaDB databases.

With these two functions, `gitea doctor check` can be adapted to use the
charset and collation set in the configuration (or auto-detect them, if
unset), and this patch does that too.

Signed-off-by: Gergely Nagy <[email protected]>
@delvh
Copy link
Member

delvh commented Dec 30, 2023

What would need to be done to remove the WIP status of this PR?

@algernon algernon marked this pull request as ready for review December 30, 2023 12:51
@algernon
Copy link
Contributor Author

What would need to be done to remove the WIP status of this PR?

Pressing a button (done). It was only set to WIP so @wxiaoguang can do a review, but since that's not going to happen, and I consider this ready for a review, I've flipped it back.

@wxiaoguang
Copy link
Contributor

What would need to be done to remove the WIP status of this PR?

Pressing a button (done). It was only set to WIP so @wxiaoguang can do a review, but since that's not going to happen, and I consider this ready for a review, I've flipped it back.

I will write a new one from scratch, because I believe it needs to address all the problems in #28651 (comment)

@algernon
Copy link
Contributor Author

What would need to be done to remove the WIP status of this PR?

Pressing a button (done). It was only set to WIP so @wxiaoguang can do a review, but since that's not going to happen, and I consider this ready for a review, I've flipped it back.

I will write a new one from scratch, because I believe it needs to address all the problems in #28651 (comment)

It already addresses those. Well, most of them. An empty DEFAULT_COLLATION in my version means "use the most appropriate one", the same way the former (undocumented) MYSQL_CHARSET was enforcing "utf8mb4". Everything else is covered here, so I honestly don't understand why you want to duplicate work I have already done. It would be so much easier for everyone involved if you'd work with me, on this PR. I'm happy to address all concerns you may have, and I am sure we can come to a solution that is acceptable to both of us.

I understand you have a beef with Forgejo people, but I am not Forgejo people. For all practical purposes, I'm just a random guy who happens to contribute there too. (And the reason I do that, and why this change was submitted to Forgejo first, is because my personal policy is that I do not contribute to projects with a CLA, unless I am paid to do so - which in this case, I am.)

@wxiaoguang
Copy link
Contributor

It already addresses those. ......

I will submit my proposal, from scratch means there might be a quite different approach. If you feel it looks better, feel free to use and/or improve it. If it doesn't look good, you could still use your approach in the forks.

I understand you have a beef with Forgejo people, but I am not Forgejo people. For all practical purposes, I'm just a random guy who happens to contribute there too. .....

This my also personal decision, personally I would like to keep a distance from Forgejo (under and related to Codeberg) because there is a long history.


Again, disclaimer, that's just my personal decision and option, it doesn't mean the Gitea community's and personally I don't quite know Gitea community's.


And off-topic: I really appreciate your work and I think you have shown very experienced programming skills and open social attitudes 🙏

@lunny
Copy link
Member

lunny commented Dec 30, 2023

What would need to be done to remove the WIP status of this PR?

Pressing a button (done). It was only set to WIP so @wxiaoguang can do a review, but since that's not going to happen, and I consider this ready for a review, I've flipped it back.

I will write a new one from scratch, because I believe it needs to address all the problems in #28651 (comment)

It already addresses those. Well, most of them. An empty DEFAULT_COLLATION in my version means "use the most appropriate one", the same way the former (undocumented) MYSQL_CHARSET was enforcing "utf8mb4". Everything else is covered here, so I honestly don't understand why you want to duplicate work I have already done. It would be so much easier for everyone involved if you'd work with me, on this PR. I'm happy to address all concerns you may have, and I am sure we can come to a solution that is acceptable to both of us.

I understand you have a beef with Forgejo people, but I am not Forgejo people. For all practical purposes, I'm just a random guy who happens to contribute there too. (And the reason I do that, and why this change was submitted to Forgejo first, is because my personal policy is that I do not contribute to projects with a CLA, unless I am paid to do so - which in this case, I am.)

There is no CLA, we use DCO. https://github.com/go-gitea/gitea/blob/main/DCO

@algernon
Copy link
Contributor Author

There is no CLA, we use DCO. https://github.com/go-gitea/gitea/blob/main/DCO

You do require copyright assignment. Effectively the same problem I have with CLAs. But this is not relevant to this PR anymore, I only mentioned it as an explanation, and that side-discussion has concluded.

When starting the web server, perform a sanity check on the database.
The sanity check currently runs for MySQL/MariaDB only, and verifies
that the charset and collation are correctly set on both the database,
and on all tables in it.

If there is a discrepancy, it does not error out, but prints a warning
only.

Signed-off-by: Gergely Nagy <[email protected]>
When initializing an empty MySQL/MariaDB database, ensure that the
character set and collation is set to the desired values.

With these set, creating a table will inherit these settings, unless
table creation specifies a different charset or collate function. This
is the reason why `setting.Database.MysqlCharset` was removed: that
forced table creation to explicitly set a charset, rather than inherit
the database's default, and in doing so, also changed the collate
function to the charset's default (which may - and usually is -
different from the one we want).

Signed-off-by: Gergely Nagy <[email protected]>
For MySQL, suggest not setting a collate function, to let Gitea deal
with it. But also mention some of the options, would one want to set it
up in advance and not rely on Gitea.

Signed-off-by: Gergely Nagy <[email protected]>
@wxiaoguang
Copy link
Contributor

-> Recommend/convert to use case-sensitive collation for MySQL/MSSQL #28662

And the case-sensitive branch test passes all databases including MSSQL.

@algernon algernon closed this Jan 10, 2024
@GiteaBot GiteaBot removed this from the 1.22.0 milestone Jan 10, 2024
@algernon algernon deleted the for-gitea/mysql-collation-case-sensitity-fix branch January 10, 2024 06:48
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. modifies/cli PR changes something on the CLI, i.e. gitea doctor or gitea admin modifies/docs size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/refactoring Existing code has been cleaned up. There should be no new functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants