Sourcegraph stores most data in a PostgreSQL database. Git repositories, uploaded user content (e.g., image attachments in issues) are stored on the filesystem.
Our minimum supported version is 9.6
which means you must use that for development to ensure we don't use Postgres features introduced after that version.
For Ubuntu 18.04, you will need to add a repository source. Use the PostgreSQL.org official repo and instructions.
Sourcegraph assumes it has a dedicated PostgreSQL server, or at least that you can make global configuration changes, such as changing the timezone. If you need to use other settings for other databases, use a separate PostgreSQL instance.
After installing PostgreSQL, set up up a sourcegraph
user and database:
sudo su - postgres # this line only needed for Linux
createdb
createuser --superuser sourcegraph
psql -c "ALTER USER sourcegraph WITH PASSWORD 'sourcegraph';"
createdb --owner=sourcegraph --encoding=UTF8 --template=template0 sourcegraph
Then update your postgresql.conf
default timezone to UTC. Determine the location
of your postgresql.conf
by running psql -c 'show config_file;'
. Update the line beginning
with timezone =
to the following:
timezone = 'UTC'
Finally, restart your database server (mac: brew services restart postgresql
, recent linux
probably service postgresql restart
)
The Sourcegraph server reads PostgreSQL connection configuration from
the
PG*
environment variables;
for example, in your ~/.bashrc
:
export PGPORT=5432
export PGHOST=localhost
export PGUSER=sourcegraph
export PGPASSWORD=sourcegraph
export PGDATABASE=sourcegraph
export PGSSLMODE=disable
You can also use a tool like [envdir
][s] or a .dotenv
file to
source these env vars on demand when you start the server.
To test the environment's credentials, run psql
(the PostgreSQL CLI
client) with the PG*
environment variables set. If you see a
database prompt, then the environment's credentials are valid.
If you get an error message about "peer authentication", you are
probably connecting over the Unix domain socket, rather than over TCP.
Make sure you've set PGHOST
. (Postgres can do peer authentication
on local sockets, which provides reliable identification but must
be specially configured to authenticate you as a user with a name
different from your account name.)
Migrations get applied automatically at application startup - you shouldn't need to run anything by hand. For full documentation see migrations/README.md
Here is the preferred style going forward. Existing tables may be inconsistent with this style.
Use a NOT NULL
constraint whenever possible to enforce having a value on every column. NULL
values can easily introduce errors when not handled correctly, and for many fields it makes sense to always have a value anyways.
For example, a "revision" text
column can use NULL
to represent no revision, or instead ""
(empty string). On the other hand, it makes sense to represent a "deleted_at" timestamp
field as NULL
, meaning "this row has not been deleted".
When NULL fields are necessary, remember to use Null*
types in Go when querying this data. Otherwise row.Scan
will error after encountering a NULL
value.
var s sql.NullString
// Column name can be NULL
err := db.QueryRow("SELECT name FROM foo WHERE id=?", id).Scan(&s)
...
if s.Valid {
// use s.String
} else {
// NULL value
}
id
auto increment primary key.created_at
not null defaultnow()
set when a row is first inserted and never updated after that.updated_at
not null defaultnow()
set when a row is first inserted and updated on every update.deleted_at
set to a not null timestamp to indicate the row is deleted (called soft deleting). This is preferred over hard deleting data from our db (see discussion section below).- When querying the db, rows with a non-null
deleted_at
should be excluded.
- When querying the db, rows with a non-null
The timestamps are useful for forensics if something goes wrong, they do not necessarily need to be used or exposed by our graphql APIs. There is no harm in exposing them though.
Example:
CREATE TABLE "widgets" (
"id" bigserial NOT NULL PRIMARY KEY,
"created_at" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
"updated_at" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
"deleted_at" TIMESTAMP WITH TIME ZONE,
);
Definitions:
- A "hard" delete is when rows are deleted using
DELETE FROM table WHERE ...
- A "soft" delete is when rows are deleted using
UPDATE table SET deleted_at = now() WHERE ...
Hard deletes are hard to recover from if something goes wrong (application bug, bad migration, manual query, etc.). This usually involves restoring from a backup and it is hard to target only the data affected by the bad delete.
Soft deletes are easier to recover from once you determine what happened. You can simply find the affected rows and UPDATE table SET deleted_at = null WHERE ...
.
Soft deleting data has implications for unique constraints.
Consider a hypothetical schema:
CREATE TABLE "orgs" (
"id" serial NOT NULL PRIMARY KEY
);
CREATE TABLE "users" (
"id" serial NOT NULL PRIMARY KEY
);
CREATE TABLE "users_orgs" (
"id" serial NOT NULL PRIMARY KEY,
"user_id" integer NOT NULL,
"org_id" integer NOT NULL,
CONSTRAINT user_orgs_references_orgs
FOREIGN KEY (org_id)
REFERENCES orgs (id) ON DELETE RESTRICT,
CONSTRAINT users_references_users
FOREIGN KEY (user_id)
REFERENCES users (id) ON DELETE RESTRICT,
UNIQUE (user_id, org_id)
);
Removing a user from an org deletes the row from user_orgs
.
Adding a user inserts a row to user_orgs
. If the user is already a user of the org, the insert fails.
If we wanted to keep a record of membership, it would need to be in a separate audit log table.
Removing a user from an org sets a non-null timestamp on the deleted_at
column for the row.
Adding a user to an org sets deleted_at = null
if there is already an existing record for that combination of user_id
and org_id
, else a new record is inserted.
Alternatively, we could remove the unique constraint on user_id
and org_id
and always insert in the add user case (after checking to see if the user is in the org). This would then function as an audit log table.
The decision here can be made on a table by table basis.
If you have a column that references another column in the database, add a foreign key constraint.
There are reasons to not use foreign keys at scale, but we are not at scale and we can drop these in the future if they become a problem.
Foreign key constraints should not cascade deletes for a few reasons:
- We don't want to accidentally delete a lot of data (either from our application, or from a manual query in prod).
- If we ever add new tables that depend on other tables via foreign key, it is not necessarily the case that cascading the delete is correct for the new table. Explicit application code is better here.
- If we ever get to the point of sharding the db, we will probably need to drop all foreign key constraints so it would be great if we did not make our code depend on cascading delete behavior.
Instead of cascading deletes, applications should explicitly delete the rows that would otherwise get deleted if cascading deletes were enabled.
Tables are plural (e.g. repositories, users, comments, etc.).
Join tables should be named based on the two tables being joined (e.g. foo_bar
joins foo
and bar
).
To the extent that certain fields require validation (e.g. username) we should perform that validation in client AND EITHER the database when possible, OR the graphql api. This results in the best experience for the client, and protects us from corrupt data.
Because a trigger resides in the database and anyone who has the required privilege can use it, a trigger lets you write a set of SQL statements that multiple applications can use. It lets you avoid redundant code when multiple programs need to perform the same database operation.
Triggers are usually a good tool for:
- Computing derived column values automatically.
- Enforcing complex integrity constraints (e.g. when a faster CHECK constraint isn't powerful enough).
- Maintaining derived tables (e.g creating an audit trail of activity in the database).
Triggers are often not a good tool for:
- Implementing complex application logic.
Whatever you end up doing, ensure that the affected code paths are tested appropriately. For instance, if you created a trigger that populates a column, you should test that a record that is written can be read back with the affected column set to what you expect.
A test like this ought to exercise your data access module, regardless of the design pattern you use to implement it (Store
, Repo
, DAO
, DB
, etc). If you refactor your code to move some functionality from the trigger to the application code, or vice-versa, your tests should not need to change.
Here's an example of how you could structure such tests.
If you're uncertain about using triggers as part of your work, do some research before committing to a solution and don't hesitate to discuss it with your peers.