Use POSIX collation in the Postgres registry #765

theory · 2023-06-25T20:15:42Z

Applies also to Yugabyte, and separately add for Cockroach. Using the POSIX collation ensures that index ordering never changes when the database is upgraded, which is a particular problem with glibc collations, but since POSIX/C collation is strictly byte-ordered, it should be fine. Of course, any of use of ORDER BY on such columns will return unexpected results when users are used to other locales, but since Sqitch itself only ever orders by timestamp, it should not be an issue in its own use.

Closes #763.

sjstoelting · 2023-06-29T17:53:53Z

As I would always have to reindex all indexes in a database with reindexdb, I don't see any advantage.
In addition outside of certain countries (US, CA, UK) email addresses with characters outside of the POSIX character set are possible and in fact used.

theory · 2023-06-29T19:29:18Z

As I would always have to reindex all indexes in a database with reindexdb, I don't see any advantage.

That's what I was wondering about.

In addition outside of certain countries (US, CA, UK) email addresses with characters outside of the POSIX character set are possible and in fact used.

I believe you can still use UTF-8 bytes, it's just that things are indexed purely in byte order, not a character encoding order. I think that's right, but I'm not positive, admittedly.

theory · 2023-07-29T21:11:05Z

The more I think about it, the more convinced I am that this isn't the right thing to do. We shouldn't be making collation decisions for the people using the database. It might even be surprising that the registry tables use a different collation than the rest of the database. And as @sjstoelting points out, one ought to expect to reindex all indexes in a database where the collation has changed; and including Sqitch's indexes should be pretty obvious (and generally low-overhead, it's not usually much data).

What do you think, @datafoo?

datafoo · 2023-08-07T07:11:27Z

The more I think about it, the more convinced I am that this isn't the right thing to do. We shouldn't be making collation decisions for the people using the database. It might even be surprising that the registry tables use a different collation than the rest of the database.

Sqitch should use whatever collation is best for itself. As far as users are concerned, this is an implementation detail so, as I see it, Sqitch would not be "making decisions for the people using the database".

And as @sjstoelting points out, one ought to expect to reindex all indexes in a database where the collation has changed; and including Sqitch's indexes should be pretty obvious (and generally low-overhead, it's not usually much data).

It might be obvious for you because you knows Sqitch inside out but I can tell you it is was not for me. All I had was "Rebuild all objects affected by this collation" when faced with the error:

WARNING: collation "xx-x-icu" has version mismatch
DETAIL: The collation in the database was created using version 1.2.3.4, but the operating system provides version 2.3.4.5.
HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog."xx-x-icu" REFRESH VERSION, or build PostgreSQL with the right library version.

I think we should do what's right. PostgreSQL indicates:

For this reason use locales only if you actually need them.

Is there any reason to use locales?

theory · 2023-08-12T23:56:40Z

Is there any reason to use locales?

The thing is, we don't. We just default to whatever the DBA has configured things for. It might be more surprising to them if it wasn't the same.

datafoo · 2023-08-14T06:00:55Z

The thing is, we don't. We just default to whatever the DBA has configured things for.

Then I believe we should follow PostgreSQL recommendation.

It might be more surprising to them if it wasn't the same.

Again, this is an implementation detail so it should not impact the rest of the user's database. Can you think of anything that might be problematic and/or more difficult for a DBA if Sqitch were to use POSIX collation?

sjstoelting · 2023-08-14T14:17:54Z

And again, there's a world outside the USA which do need and use different collations. If you don't take the characters used in other countries seriously, I boycott your products wherever possible. Leave it to the users and don't make decisions, that don't fit elsewhere. And again, I even have email addresses with domains containing umlauts.

theory · 2023-08-15T01:44:22Z

Yeah, that's the thing, I think we have no business choosing a collation.

datafoo · 2023-08-15T09:52:26Z

And again, there's a world outside the USA which do need and use different collations. If you don't take the characters used in other countries seriously, I boycott your products wherever possible. Leave it to the users and don't make decisions, that don't fit elsewhere. And again, I even have email addresses with domains containing umlauts.

I believe there is a misunderstanding. In issue #763, I am suggesting for Sqitch to explicitly use a collation not a character set, not a character encoding.

If you run

initdb --auth=trust --pgdata="./mydata" --encoding=UTF-8 --no-locale --username="postgres"

... you end up with UTF8 encoding and C collation:

postgres=# \l
                                            List of databases
   Name    |  Owner   | Encoding | Collate | Ctype | ICU Locale | Locale Provider |   Access privileges   
-----------+----------+----------+---------+-------+------------+-----------------+-----------------------
 postgres  | postgres | UTF8     | C       | C     |            | libc            | 
 template0 | postgres | UTF8     | C       | C     |            | libc            | =c/postgres          +
           |          |          |         |       |            |                 | postgres=CTc/postgres
 template1 | postgres | UTF8     | C       | C     |            | libc            | =c/postgres          +
           |          |          |         |       |            |                 | postgres=CTc/postgres
(3 rows)

UTF8 being an encoding for unicode, you should be able to use any character you like.

sjstoelting · 2023-08-15T11:50:53Z

What is the goal about having data stored as UTF-8 and sorted by something that doesn't contain most of the characters that are available? You can change your own stuff to whatever you like, if you love POSIX, do it in your own stuff. But dictating that to everyone is still something I believe to be a decision by it's users. Am 15. August 2023 11:52:37 MESZ schrieb datafoo ***@***.***>:

…

> And again, there's a world outside the USA which do need and use different collations. If you don't take the characters used in other countries seriously, I boycott your products wherever possible. Leave it to the users and don't make decisions, that don't fit elsewhere. And again, I even have email addresses with domains containing umlauts. I believe there is a misunderstanding. In issue #763, I am suggesting for Sqitch to explicitly use a _collation_ not a _character set_, not a _character encoding_. If you run ```shell initdb --auth=trust --pgdata="./mydata" --encoding=UTF-8 --no-locale --username="postgres" ``` ... you end up with UTF8 encoding and C collation: ```shell postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | ICU Locale | Locale Provider | Access privileges -----------+----------+----------+---------+-------+------------+-----------------+----------------------- postgres | postgres | UTF8 | C | C | | libc | template0 | postgres | UTF8 | C | C | | libc | =c/postgres + | | | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | C | C | | libc | =c/postgres + | | | | | | | postgres=CTc/postgres (3 rows) ``` UTF8 being an encoding for unicode, you should be able to use any character you like. -- Reply to this email directly or view it on GitHub: #765 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

datafoo · 2023-08-15T12:05:41Z

What is the goal about having data stored as UTF-8 and sorted by something that doesn't contain most of the characters that are available?

As a user of Sqitch, I do not interact with Sqitch using SQL, I interact with sqitch using the sqitch command. So, I do not care about how sqitch sort things as long as it is working for everyone.

You can change your own stuff to whatever you like, if you love POSIX, do it in your own stuff. But dictating that to everyone is still something I believe to be a decision by it's users.

I am not dictating anything to anyone, I am interacting on this issue to make sqitch even better than it already is. If you share this goal, you could perhaps try to explain how this proposal is creating a problem.

theory · 2023-08-16T22:15:48Z

I'm beginning to come back around to @datafoo's point of view, because using C/POSIX collation has no effect on the data you can store: it's all UTF-8. What it affects is the sort order:

david=# CREATE TABLE try (name TEXT COLLATE "POSIX", email TEXT COLLATE "POSIX");
CREATE TABLE
david=# INSERT INTO try VALUES('Stölting', 'stö[email protected]'), ('Stõlting', 'stõ[email protected]'), ('Stolting', '[email protected]');
INSERT 0 3
david=# select * from try order by name;
   name   |        email         
----------+----------------------
 Stolting | [email protected]
 Stõlting | stõ[email protected]
 Stölting | stö[email protected]
(3 rows)

All the character encodings remain, although the sort ordering might not be expected for some locale or another: it's simply sorted in lexicographic order — that is, by the UTF-8 byte orders.

coveralls · 2025-01-04T16:50:16Z

Pull Request Test Coverage Report for Build 12613051425

Details

3 of 5 (60.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.04%) to 99.891%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
lib/App/Sqitch/Engine/pg.pm	3	5	60.0%

Totals
Change from base Build 12613046808:	-0.04%
Covered Lines:	4576
Relevant Lines:	4581

💛 - Coveralls

autarch

LGTM % a few questions and one typo.

autarch · 2025-01-04T16:50:43Z

Changes

+     - Set all text column collations to "POSIX" on Postgres 9.1 and higher.
+       and Yugabyte 2.9 and higher. Does not apply to existing registries,
+       only new ones. Thanks to @datafoo for the suggestion (#763)!


Doesn't this also apply when you upgrade the registry, at least for some columns?

It would if we added more upgrades, that's true.

autarch · 2025-01-04T16:51:21Z

t/pg.t

    is_deeply \@done, [['SET search_path = ?', undef, $registry]],
        'The registry should have been added to the search path again';

    # Make sure the file was changed to remove SCHEMA IF NOT EXISTS.
    file_contents_like $tmp_fh, qr/\QCREATE SCHEMA :"registry";/,
        'Should have removed IF NOT EXISTS from CREATE SCHEMA';

+    # Make sure we havne't removed collation.


Suggested change

# Make sure we havne't removed collation.

# Make sure we haven't removed collation.

Applies to Postgres 9.1 and higher and Yugabyte 2.9 and higher. Using the POSIX collation ensures that index ordering never changes when the database is upgraded, which is a particular problem with glibc collations, but since POSIX/C collation is strictly byte-ordered, it should be fine. Of course, any of use of `ORDER BY` on such columns will return unexpected results when users are used to other locales, but since Sqitch itself only ever orders by timestamp, it should not be an issue in its own use. Closes #763.

theory · 2025-01-05T22:32:32Z

I was close to merging this PR, but then three things happened:

@autarch pointed out the upgrade challenges in a comment
@ardentperf pointed out via Slack that stuff like initcap() does not work as expected with POSIX collation:
```
postgres=# select initcap('stõlting' collate "C.UTF-8");
initcap
----------
StõLting
(1 row)
```
I realized that the same issue exists for the other database engines, so shouldn't they get the same treatment?

To this last point, I spent a few hours experimenting with just setting the URI and SHA hash hex string columns to a relevant encoding or collation for all the engines. The results are in #857. It's not a super promising start. More notes there, but I think it'd make sense to pick up the work there, rather than here, should there be energy for or demand to let users determine the character set or encoding to use for text columns.

In the meantime, I'll update #763 with a workaround for Postgres.

datafoo · 2025-01-06T08:45:16Z

Thank you for having considered it and for the associated work!

theory added the engine label Jun 25, 2023

theory requested a review from autarch June 25, 2023 20:15

theory self-assigned this Jun 25, 2023

theory force-pushed the postgres-collate branch from 51a532d to 496a597 Compare June 25, 2023 21:02

theory force-pushed the postgres-collate branch from 496a597 to ee1af6c Compare July 1, 2023 13:12

theory force-pushed the postgres-collate branch from ee1af6c to 23a6974 Compare July 29, 2023 21:07

theory force-pushed the postgres-collate branch from 23a6974 to 05968ce Compare September 2, 2023 23:05

theory force-pushed the postgres-collate branch 2 times, most recently from 63ace35 to dd6a9bc Compare February 5, 2024 14:02

theory force-pushed the postgres-collate branch 2 times, most recently from 1a9b853 to 3064c9d Compare January 4, 2025 16:35

autarch previously approved these changes Jan 4, 2025

View reviewed changes

theory dismissed autarch’s stale review via dd9ba0f January 4, 2025 19:13

theory force-pushed the postgres-collate branch from 3064c9d to dd9ba0f Compare January 4, 2025 19:13

theory closed this Jan 5, 2025

theory mentioned this pull request Jan 5, 2025

Provide option to specify the charset or collation of text columns in the registry tables #763

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use POSIX collation in the Postgres registry #765

Use POSIX collation in the Postgres registry #765

theory commented Jun 25, 2023

sjstoelting commented Jun 29, 2023

theory commented Jun 29, 2023

theory commented Jul 29, 2023

datafoo commented Aug 7, 2023

theory commented Aug 12, 2023

datafoo commented Aug 14, 2023

sjstoelting commented Aug 14, 2023 via email

theory commented Aug 15, 2023

datafoo commented Aug 15, 2023

sjstoelting commented Aug 15, 2023 via email

datafoo commented Aug 15, 2023

theory commented Aug 16, 2023

coveralls commented Jan 4, 2025 •

edited

Loading

autarch left a comment

autarch Jan 4, 2025

theory Jan 5, 2025

autarch Jan 4, 2025

theory commented Jan 5, 2025 •

edited

Loading

datafoo commented Jan 6, 2025

	# Make sure we havne't removed collation.
	# Make sure we haven't removed collation.

Use POSIX collation in the Postgres registry #765

Use POSIX collation in the Postgres registry #765

Conversation

theory commented Jun 25, 2023

sjstoelting commented Jun 29, 2023

theory commented Jun 29, 2023

theory commented Jul 29, 2023

datafoo commented Aug 7, 2023

theory commented Aug 12, 2023

datafoo commented Aug 14, 2023

sjstoelting commented Aug 14, 2023 via email

theory commented Aug 15, 2023

datafoo commented Aug 15, 2023

sjstoelting commented Aug 15, 2023 via email

datafoo commented Aug 15, 2023

theory commented Aug 16, 2023

coveralls commented Jan 4, 2025 • edited Loading

Pull Request Test Coverage Report for Build 12613051425

Details

💛 - Coveralls

autarch left a comment

Choose a reason for hiding this comment

autarch Jan 4, 2025

Choose a reason for hiding this comment

theory Jan 5, 2025

Choose a reason for hiding this comment

autarch Jan 4, 2025

Choose a reason for hiding this comment

theory commented Jan 5, 2025 • edited Loading

datafoo commented Jan 6, 2025

coveralls commented Jan 4, 2025 •

edited

Loading

theory commented Jan 5, 2025 •

edited

Loading