The goal of verifying schema equality is to have the same query yield exactly the same results in two databases whose schemas match and data is the same (except for non-determinism such as no guarantee of order unless ORDER BY
is specified and other similar things).
This is not currently the case, may not be entirely possible and some decisions made already involve tradeoffs. The rationale behind some decisions isn't always obvious either, so this section discusses some of those.
What follows is an incomplete list of what currently is verified, but be aware there may be caveats:
- Tables, columns, CHECK constraints, FKs, indexes, and other constraints
- Indexes
- Sequences (although their RESTART value is not currently checked)
- Functions, operators, and VIEWs
- Triggers
- Row Level Security Policies
- Extensions, through their objects, but not directly their names
- Collations (with important caveats)
- Roles, including their config attributes such as
search_path
, which other roles they belong to and database-related permissions - Composite, domain, enum and range types
- Database ownership, encoding and its
default_transaction_*
settings - Extended statistics
In contrast, an incomplete list of things that may be not currently verified:
- Full-text dictionaries
- Base and shell types
- Foreign Servers
The final word on what codd checks can be found in the files generated by codd write-schema
.
Verifying every possible object is a top priority of codd, so if something's missing or not behaving as intended, please file a bug report.
In order to remain useful in a world of Cloud databases, docker images and distribution-dependent postgres installations, codd tries to strike a balance between correctness/strictness and usability/looseness. We want codd to match schemas across different postgres installations if they have the same DB schema, but in some cases that is just too hard to do right for everyone.
So codd currently provides 3 settings you can enable that will make your algorithm more strict in some cases or more lax in others. They're customizable through the CODD_SCHEMA_ALGO
environment variable, which is a space separated list containg any of these values:
- strict-collations: codd will include even the libc and icu native libraries' versions postgres was compiled with in its representations, which it does not do by default.
- strict-range-ctor-privs: codd will include ownership of constructor functions of custom range types in its representations and grantors of their privileges. Different installations of postgres seem to have different opinions on who the owner of such functions is, so codd does not set this by default.
- ignore-column-order: By default codd includes column order in its representations. This means possibly a lot of files touched when columns are removed from tables with lots of columns. For shops with enough discipline to not rely on column order, this more lax setting can be helpful.
So if you want codd in the strictest possible setting, for example, you can set CODD_SCHEMA_ALGO=strict-collations strict-range-ctor-privs
.
These first two settings, along other interesting details, are described in more detail in the sections that follow.
The system catalog is a namespace called pg_catalog
that comes with internal functions, operators, collations, and other database objects native to postgres.
It is always in the search path regardless of the search_path
setting and its objects take precedence over homonymous objects in other schemas. When you write SELECT 1+2
, this is effectively the same as using the +
operator from pg_catalog
, and could be written as SELECT 1 OPERATOR(pg_catalog.+) 2
.
The problem is that while this is fine for most things, some objects in the System Catalog are system-dependent. Collations are one such example.
According to https://www.postgresql.org/docs/13/collation.html, "initdb populates the system catalog pg_collation with collations based on all the locales it finds in the operating system at the time".
For an example look at psql's \dOS+
, pick a collation from pg_catalog
and run the following query:
-- Use a collation you have in your system if this one (Cherokee) doesn't work
SELECT 'abc' < 'abd' COLLATE "chr-x-icu";
The collation might exist in the development database but might not in the production server, which can make schemas match but queries fail in one server and work in another, since typically you don't want to check the pg_catalog
namespace.
One could think including pg_catalog
in the list of namespaces to be checked would fix this, but even if development databases contain a subset of production's collations, codd only does equality checks at the moment (and subset checks have other problems).
Our current recommendation is to create any collations you rely on in your app's schema and always use them namespace-qualified.
Important: codd's default decision is to relax collation representations, which could mean collations behave differently even when schemas match. The reason for that is that collations are affected by the specific version of native libraries used to compile postgres, which are really hard to match across different machines, particularly if you're using a cloud provider.
The following paragraphs describe strict mode, which you can enable by setting the environment variable CODD_SCHEMA_ALGO="strict-collations"
to make codd check even the versions of system libraries collations are implemented with. Even minor version automatic upgrades of postgres can make schemas differ in this mode.
Collations are implemented by calling into system libraries such as libc or icu. This means different binaries of the same PostgreSQL version can be linked to different versions of such libraries, which can lead to different collation behavior. It is possible to detect when versioned icu or libc provided collations are running in a postgres server linked to different versions of these libraries than the one collations were created with, because postgres will emit warnings like this one:
WARNING: collation "xx-x-icu" has version mismatch
This means that the version of the icu or libc library a collation was created with does not always determine a collation's behavior; what does is the system library the server is linked to. You can see both by querying:
-- Check the versions of system libraries collations were created with and the versions the postgres server is linked to
select oid, collname, collprovider, collversion AS created_with_version, pg_collation_actual_version(oid) AS system_library_version from pg_catalog.pg_collation;
The decision made in codd with strict-collations enabled is to verify both the created-with version and the current-library version for each collation. Verifying only current-library versions could trigger version mismatch warnings on one server but not on the other even though schemas match, and verifying only created-with versions could even lead to different behavior although schemas match.
When range types are created, postgres automatically creates two homonymous functions that build values of the newly created type, and two more to build multiranges when using at least Postgres v14. For example:
$ CREATE TYPE floatrange AS RANGE (subtype = float8,subtype_diff = float8mi);
$ SELECT floatrange(0.0, 10.0); -- Creates a range closed on both ends
floatrange
------------
[0,10)
(1 row)
$ SELECT floatrange(0.0, 10.0, '()'); -- Creates a range open on both ends
floatrange
------------
[0,10)
(1 row)
For some unknown reason - and I couldn't find this documented anywhere - the owner of these constructor functions is not the role their types were created with - on my machine I get postgres
(my local superuser) as the owner of these functions -, and on AWS RDS I get rdsadmin
, the superuser there.
So by default codd does not verify ownership of range and multirange constructors (but does check owners of any other functions). Grantors of privileges for such constructor functions are also ignored because they suffer from the same issue. You can still make codd check these in every case by adding strict-range-ctor-privs
to the CODD_SCHEMA_ALGO
environment variable.
This section contains decisions that probably behave the way you would expect, yet might surprise you a little if you pay attention to details. It is not a necessary read but might answer some of your questions.
For every user defined type, postgres creates an array type derived from it. This means if you create a type called complex
, postgres will create one named complex[]
(curiously the internal name that you find in the catalog is actually _complex
). You can read about this in https://www.postgresql.org/docs/10/sql-createtype.html.
Because postgres always does this we've decided in codd not to include array types in its representations. Existence of a type implies existence of an array type derived from it, and since the former is verified any differences will be detected. Array types cannot be dropped because of dependencies, so any two postgres clusters with a type sometype
that are the same must contain a type sometype[]
that is also the same.
For every table and view, a composite/row type is created automatically by postgres with a shape that represents the created relation. Because codd verifies tables and views, verifying these types created in their image would be redundant, and would imply two files being changed for every change to a table or view, which is less than ideal.