Protect Your Data with PostgreSQL Constraints

sql postgresql
Posted on: 2016-01-08

What should validate your data - your application code, or your database?

If you said "both", your answer is correct and you are entitled to feel smug for 1.5 seconds. And... stop.

While application code can generally produce friendlier user errors, only the database knows what data it has right now - even if the application code ran a SELECT a few milliseconds ago, there may have been an UPDATE since.

(You may think, "oh, that won't happen because I'll run them both in a transaction". OK, but you're still getting the database to figure it out, and it still won't work unless 1) you use serializable transactions (you probably don't) and 2) the database implements those correctly (PostgreSQL has a bug). See this paper for lots more.)

Ahem. So any attempts to validate data in relation to other data can only be reliably done by the database itself at the moment you insert or update data. That includes things like "does any user already have this username?" or "is this room already reserved?" or "is this account balance still sufficient to cover this charge?"

All this is is why I'm excited about some of PostgreSQL's correctness-enforcing features, like "exclusion constraints".

"What are those", you say? I'll tell you! But let's work our way up to that. Here's a quick run through of the techniques you can use in PostgreSQL to guarantee that your data is correct.

Level 0: Types

These are simple. PostgreSQL makes you define a type for every column, and you can't store "Jessica Jones" in a date column, even if you did take her to prom.

Less obviously, PostgreSQL will yell if you try to store a 50-byte string in a VARCHAR(30), the alternative being to throw away the rest of the string, which some databases may do. 😱

This is fine, but your application code can reliably format dates, so it's not very exciting.

Level 1: NOT NULL

You've seen this: declare a column as NOT NULL, and you won't be able to insert or update rows without supplying a value. (Unless you're using MySQL without "strict" mode and a transactional storage engine, in which case, whoops.😱 )

Again, your application code knows whether it has a real value to insert, so this is not exciting. Moving on.

Level 2: Check constraints (single row)

Now it's getting interesting. Check constraints let you say "what data is valid for a single row?" The database will reject any inserts or updates that violate the rule.

For instance, suppose we are building a reservation system for vacation properties. We have these tables:

#!sql
CREATE TABLE properties(
  id   serial  PRIMARY KEY NOT NULL,
  name varchar             NOT NULL
);

CREATE TABLE users(
  id       serial  PRIMARY KEY NOT NULL,
  email    varchar             NOT NULL,
  username varchar             NOT NULL
);

CREATE TABLE reservations (
  id            serial PRIMARY KEY NOT NULL,
  property_id   integer            NOT NULL,
  user_id       integer            NOT NULL,
  checkin_time  TIMESTAMP          NOT NULL,
  checkout_time TIMESTAMP          NOT NULL,
  status        varchar            NOT NULL default 'tentative'
);

Besides the NOT NULLs, we might ask the database to ensure that nobody can check out before they check in. We can use a check constraint for that:

#!sql
ALTER TABLE reservations ADD CONSTRAINT positive_duration
CHECK (checkout_time > checkin_time);

If we try to violate it:

#!sql
INSERT INTO reservations
  (property_id, user_id, checkin_time,          checkout_time)
VALUES
  (2,           1,       '2015-01-08 14:00:00', '2015-01-07 08:00:00');

PostgreSQL responds with an error:

ERROR:  23514: new row for relation "reservations" violates
  check constraint "positive_duration"
DETAIL:  Failing row contains
  (3, 2, 1, 2015-01-08 14:00:00, 2015-01-07 08:00:00, tentative).
SCHEMA NAME:  public
TABLE NAME:  reservations
CONSTRAINT NAME:  positive_duration
LOCATION:  ExecConstraints, execMain.c:1758

Other common-sense checks might include things like "price must be greater than 0", or "birth date can't be in the future".

In some cases, application code could verify these things just as well as the database. For example, if we want to say "name must be at least three characters long", we could do that with a CHECK constraint, but application code could handle it more easily.

But suppose (to borrow an example from Kevin Burke) we have multiple application processes updating the balance on an account. We don't want them both to simultaneously check the balance, find that it's $100, subtract $20, and issue an UPDATE to set balance = 80. In that case, we just lost $20! A better approach would be to have them both issue an UPDATE balance = balance - 20. That will result in two subtractions and the correct final amount. And a CHECK constraint could reject the second UPDATE if resulted in a negative balance.

Level 3: Exclusion constraints (row vs rest of table)

This is the really cool one. Unlike check constraints, exclusion constraints let you say "what data is valid for a row, considering the rows that already exist in this table?"

You've probably seen UNIQUE constraints. They say: "in what ways can this row not be the same as another row?" Conceptually, that's an exclusion constraint that tests for equality.

So these two constraints are equivalent (though an explicit index declaration may be the best approach):

#!sql
ALTER TABLE users ADD CONSTRAINT unique_username
UNIQUE (username);
-- OR
ALTER TABLE users ADD CONSTRAINT unique_username
EXCLUDE USING gist (username WITH =);

In the exclusion constraint version, we say the same thing, but in more general terms: "compare the new/updated row with the others, considering only the username values using a = comparison, and exclude the new/updated row if the comparison returns true - that is, if the username values are equal." The gist part is there because exclusion constraints are implemented using indexes.

Note that before trying any of these exclusion constraint examples, you'll need to enable the btree_gist extension, like this:

#!sql
CREATE EXTENSION btree_gist

Again, the exclusion constraint above works exactly the same as a UNIQUE constraint. But we can do a lot more with exclusion constraints. Exclusion constraints tell the database (conceptually) "When you're given a new or updated row, compare it against each existing row. Here's a list of comparisons to use. If all the comparisons between the new data and an existing row return true, reject the new data."

(It doesn't actually compare the rows one-by-one because constraints are indexes - more on that later.)

And we can do comparisons other than equality.

For instance, we might say "don't let multiple users rent the same property at the same time." In other words, if a new reservation has the same property_id as an existing reservation, AND their dates ranges overlap, reject the new reservation.

In a PostgreSQL query, we can build a range of two timestamps with tsrange, and check for overlap using the && operator. (The [] part is there to make the range inclusive.)

#!sql
ALTER TABLE reservations ADD CONSTRAINT no_overlapping_rentals
EXCLUDE USING gist (property_id WITH =,
tsrange("checkin_time", "checkout_time", '[]') WITH &&);

This time we have two conditions in the exclusion constraint: the new reservation row is invalid if it has a property_id that's = to that of an existing row, AND it has a date range that overlaps (&&) with the date range of an existing reservation. (Remember: the exclusion constraint fails if only if all the conditions it checks are true. It's OK to rent the same property the following week, or a different property for an overlapping period.)

Now let's make an initial reservation:

#!sql
INSERT INTO reservations
  (property_id, user_id, checkin_time,          checkout_time)
VALUES
  (1,           1,       '2015-01-08 14:00:00', '2015-01-09 10:00:00');

User 1 plans to check out at 10am on Friday. If we try to make another reservation for the same room and check in at 9am on Friday, before User 1 checks out:

#!sql
INSERT INTO reservations
  (property_id, user_id, checkin_time,          checkout_time)
VALUES
  (1,           2,       '2015-01-09 09:00:00', '2015-01-10 09:00:00');

...we get an error:

ERROR:  23P01: conflicting key value violates exclusion
constraint "no_overlapping_rentals"
...

However, we can reserve a non-overlapping time (by checking in at 11:00am):

#!sql
INSERT INTO reservations
  (property_id, user_id, checkin_time,          checkout_time)
VALUES
  (1,           2,       '2015-01-09 11:00:00', '2015-01-10 11:00:00'); -- OK

...or we can check into a different, unreserved property at 9am:

#!sql
INSERT INTO reservations
  (property_id, user_id, checkin_time,          checkout_time)
VALUES
 (2,            2,       '2015-01-09 09:00:00', '2015-01-10 09:00:00'); -- OK

We could try to go even further and say, for example, "we need 2 hours between one rental and the next to clean the property." But that's business logic that's likely to change, and database constraints are not as easy to change as code. Personally, I'd have the database constraints enforce only inviolable rules for the data, and do any context-dependent validations in the application code.

And be careful what you consider "inviolable." I'm fairly sure you don't want one multiple renters trying to occupy the same cabin at once. If you don't use a database constraint to prevent overlapping rentals, you'll need some other way to detect conflicts after they're created and do the messy cleanup - both with your database, and with your users' expectations.

But if your company has some internal calendaring software and somebody books herself for conflicting events, that's not a stop-everything situation. Maybe she plans to duck out early from one and peek in on the other. A warning in the UI is probably the best solution.

Can you believe it: Even More About Exclusion Constraints!

Constraints with WHERE

There are other ways to customize our constraints. Exclusion constraints are based on indexes, and PostgreSQL supports "partial indexes", meaning indexes with a WHERE clause. A typical use case might be to only include unpaid invoices in an index because you expect to search them more often than paid ones; this results in a smaller index and faster searches for the data you care about.

If we add a WHERE to an exclusion constraint's index, we can say that the constraint only applies to some of the rows. We could say, for example, "only one user can be the manager of a given store location":

#!sql
CREATE TABLE assignments(
 id serial PRIMARY KEY NOT NULL,
 employee_id integer,
 store_id integer,
 role varchar(128)
);

ALTER TABLE assignments
ADD CONSTRAINT one_manager_per_store
EXCLUDE USING gist (store_id with =) WHERE (role = 'manager');

Here, the WHERE means that the constraint ignores rows with role worker, but prevents multiple rows from being inserted that represent a manager for the same store_id.

Or, going back to our cabin reservation system, we might say "we can't have two overlapping rentals for the same property, but we can ignore cancelled reservations when enforcing this rule."

#!sql
-- remove the previous version of this constraint
ALTER TABLE reservations DROP CONSTRAINT no_overlapping_rentals;

ALTER TABLE reservations ADD CONSTRAINT no_overlapping_rentals
EXCLUDE USING gist (property_id WITH =,
tsrange("checkin_time", "checkout_time", '[]') WITH &&)
WHERE (status != 'cancelled');

Now if somebody cancels their reservation, we're free to make another one in its place.

Constraints can be deferred

Constraints can be set to DEFERRABLE, meaning "check them at the end of the transaction". This could be useful if, for example, you need to reassign employees; a store could have two managers momentarily, as long as one of them is reassigned by the end of the database transaction.

#!sql
-- Let's redo this to make it deferrable
ALTER TABLE assignments DROP CONSTRAINT one_manager_per_store;

ALTER TABLE assignments
ADD CONSTRAINT one_manager_per_store
EXCLUDE USING gist (store_id with =) WHERE (role = 'manager')
DEFERRABLE INITIALLY DEFERRED;

DEFERRABLE means "it's OK to wait until the end of the transaction to check this one", and INITIALLY DEFERRED means "actually do that by default; within a given transaction someone can SET CONSTRAINT one_manager_per_store IMMEDIATE to make it check immediately if they'd rather."

So with the deferrable constraint, you could do this:

#!sql
--- start a transaction
BEGIN;
-- These two rows are fine
INSERT INTO assignments (store_id, employee_id, role) VALUES (1, 1, 'worker');
INSERT INTO assignments (store_id, employee_id, role) VALUES (1, 2, 'worker');

INSERT INTO assignments (store_id, employee_id, role) VALUES (1, 3, 'manager');
-- This violates the "one_manager_per_store", but won't be rejected yet...
INSERT INTO assignments (store_id, employee_id, role) VALUES (1, 4, 'manager');

-- When we try to commit the transaction, since we haven't removed one of
-- those managers, we'll get a constraint violation
COMMIT;

Constraints can use other comparisons

So far, I've shown exclusion constraints using = (equality) and && (overlap). Overlap is pretty interesting: it could be used to compare ranges of weights, dates or times, or to check for overlapping areas on a map (eg, to say that sales territories may not overlap).

But you can use any commutative comparison in exclusion constraints - that is, any comparison where the order of the things being compared doesn't matter. So > isn't allowed, because 5 > 4 is different from 4 > 5, and we have no way to specify the order of the rows in the comparison. However, <> ("not equal") is allowed. This would make the constraint "un-unique" for a given column: all values must be the same.

To explain, I'll borrow an example from a presentation by Jeff Davis, who implemented exclusion constraints for PostgreSQL 9.0). Jeff said that in a zoo management system, you could express the idea "all animals assigned to the same enclosure must be the same species" by saying "don't allow new assignments rows where the enclosure_id is equal (=) to another assignment's AND the species_id is unequal (<>) to the other assignment's".

Whew! Given all these options, I'm sure there are neat applications I haven't thought of yet. Send me an email if you know of others.

Back to less exotic features...

Level 4: Foreign key constraints (row vs another table)

You can't rent a property that doesn't exist, so every row in rentals must have a property_id that matches the id of a row in properties.

#!sql
ALTER TABLE reservations ADD FOREIGN KEY(property_id)
REFERENCES properties(id);

Besides guaranteeing that nobody reserves a property that never existed, or was deleted a moment before, this allows things like ON DELETE RESTRICT, meaning you can't delete any property that has reservations, or ON DELETE CASCADE meaning that if you delete a post, its comments should also be deleted.

You can't put a WHERE on a foreign key constraint, but you can make it deferrable:

#!sql
ALTER TABLE reservations
ADD CONSTRAINT reservations_property_id_fk
FOREIGN KEY(property_id) REFERENCES properties(id)
DEFERRABLE INITIALLY DEFERRED;

Conclusion

Some of these checks can and should be done by application code before it sends data to the database - for instance, "user must provide an email address".

But some must be done by the database, or else the application may corrupt your data.

And rather than "do a SELECT before my INSERT and try really hard to use the right kind of transactions for that", it's simpler to take an optimistic approach: try to make the change, let the database do the check and raise an error if the change fails. That's the approach Elixir's Ecto library takes - let the database enforce these constraints, and if they fail, wrap the error in a user-friendly message like any other validation problem.

I think that's a very interesting approach, and one that has a good chance of letting my data live a long and healthy life.


Thanks to Kevin Burke, whose excellent Weird Tricks to Write Faster, More Correct Database Queries got me started learning about some of these things.