What should validate your data - your application code, or your database?
If you said "both", your answer is correct and you are entitled to feel smug for 1.5 seconds. And... stop.
While application code can generally produce friendlier user errors, only the database knows what data it has right now - even if the application code ran a
SELECT a few milliseconds ago, there may have been an
(You may think, "oh, that won't happen because I'll run them both in a transaction". OK, but you're still getting the database to figure it out, and it still won't work unless 1) you use serializable transactions (you probably don't) and 2) the database implements those correctly (PostgreSQL has a bug). See this paper for lots more.)
Ahem. So any attempts to validate data in relation to other data can only be reliably done by the database itself at the moment you insert or update data. That includes things like "does any user already have this username?" or "is this room already reserved?" or "is this account balance still sufficient to cover this charge?"
All this is is why I'm excited about some of PostgreSQL's correctness-enforcing features, like "exclusion constraints".
"What are those", you say? I'll tell you! But let's work our way up to that. Here's a quick run through of the techniques you can use in PostgreSQL to guarantee that your data is correct.
Level 0: Types
These are simple. PostgreSQL makes you define a type for every column, and you can't store "Jessica Jones" in a date column, even if you did take her to prom.
Less obviously, PostgreSQL will yell if you try to store a 50-byte string in a
VARCHAR(30), the alternative being to throw away the rest of the string, which some databases may do. 😱
This is fine, but your application code can reliably format dates, so it's not very exciting.
You've seen this: declare a column as
NOT NULL, and you won't be able to insert or update rows without supplying a value. (Unless you're using MySQL without "strict" mode and a transactional storage engine, in which case, whoops.😱 )
Again, your application code knows whether it has a real value to insert, so this is not exciting. Moving on.
Level 2: Check constraints (single row)
Now it's getting interesting. Check constraints let you say "what data is valid for a single row?" The database will reject any inserts or updates that violate the rule.
For instance, suppose we are building a reservation system for vacation properties. We have these tables:
CREATE TABLE properties( id serial PRIMARY KEY NOT NULL, name varchar NOT NULL ); CREATE TABLE users( id serial PRIMARY KEY NOT NULL, email varchar NOT NULL, username varchar NOT NULL ); CREATE TABLE reservations ( id serial PRIMARY KEY NOT NULL, property_id integer NOT NULL, user_id integer NOT NULL, checkin_time TIMESTAMP NOT NULL, checkout_time TIMESTAMP NOT NULL, status varchar NOT NULL default 'tentative' );
NOT NULLs, we might ask the database to ensure that nobody can check out before they check in. We can use a check constraint for that:
ALTER TABLE reservations ADD CONSTRAINT positive_duration CHECK (checkout_time > checkin_time);
If we try to violate it:
INSERT INTO reservations (property_id, user_id, checkin_time, checkout_time) VALUES (2, 1, '2015-01-08 14:00:00', '2015-01-07 08:00:00');
PostgreSQL responds with an error:
ERROR: 23514: new row for relation "reservations" violates check constraint "positive_duration" DETAIL: Failing row contains (3, 2, 1, 2015-01-08 14:00:00, 2015-01-07 08:00:00, tentative). SCHEMA NAME: public TABLE NAME: reservations CONSTRAINT NAME: positive_duration LOCATION: ExecConstraints, execMain.c:1758
Other common-sense checks might include things like "price must be greater than 0", or "birth date can't be in the future".
In some cases, application code could verify these things just as well as the database. For example, if we want to say "name must be at least three characters long", we could do that with a
CHECK constraint, but application code could handle it more easily.
But suppose (to borrow an example from Kevin Burke) we have multiple application processes updating the
balance on an account. We don't want them both to simultaneously check the balance, find that it's $100, subtract $20, and issue an
UPDATE to set
balance = 80. In that case, we just lost $20! A better approach would be to have them both issue an
UPDATE balance = balance - 20. That will result in two subtractions and the correct final amount. And a
CHECK constraint could reject the second
UPDATE if resulted in a negative balance.
Level 3: Exclusion constraints (row vs rest of table)
This is the really cool one. Unlike check constraints, exclusion constraints let you say "what data is valid for a row, considering the rows that already exist in this table?"
You've probably seen
UNIQUE constraints. They say: "in what ways can this row not be the same as another row?" Conceptually, that's an exclusion constraint that tests for equality.
So these two constraints are equivalent (though an explicit index declaration may be the best approach):
ALTER TABLE users ADD CONSTRAINT unique_username UNIQUE (username); -- OR ALTER TABLE users ADD CONSTRAINT unique_username EXCLUDE USING gist (username WITH =);
In the exclusion constraint version, we say the same thing, but in more general terms: "compare the new/updated row with the others, considering only the
username values using a
= comparison, and exclude the new/updated row if the comparison returns true - that is, if the
username values are equal." The
gist part is there because exclusion constraints are implemented using indexes.
Note that before trying any of these exclusion constraint examples, you'll need to enable the
btree_gist extension, like this:
CREATE EXTENSION btree_gist
Again, the exclusion constraint above works exactly the same as a
UNIQUE constraint. But we can do a lot more with exclusion constraints. Exclusion constraints tell the database "here's a list of comparisons to make between a new or updated row and all the existing rows in the table. If all the comparisons return true, reject the change."
And we can do comparisons other than equality.
For instance, we might say "don't let multiple users rent the same property at the same time." In other words, if a new reservation has the same
property_id as an existing reservation, AND their dates ranges overlap, reject the new reservation.
In a PostgreSQL query, we can build a range of two timestamps with
tsrange, and check for overlap using the
&& operator. (The
 part is there to make the range inclusive.)
ALTER TABLE reservations ADD CONSTRAINT no_overlapping_rentals EXCLUDE USING gist (property_id WITH =, tsrange("checkin_time", "checkout_time", '') WITH &&);
This time we have two conditions in the exclusion constraint: the new reservation row is invalid if it has a
= to that of another row, AND it has a date range that overlaps (
&&) with the date range of another reservation. (An exclusion constraint fails if ALL the conditions it checks are true.)
Now let's make an initial reservation:
INSERT INTO reservations (property_id, user_id, checkin_time, checkout_time) VALUES (1, 1, '2015-01-08 14:00:00', '2015-01-09 10:00:00');
User 1 plans to check out at 10am on Friday. If we try to make another reservation for the same room and check in at 9am on Friday, before User 1 checks out:
INSERT INTO reservations (property_id, user_id, checkin_time, checkout_time) VALUES (1, 2, '2015-01-09 09:00:00', '2015-01-10 09:00:00');
...we get an error:
ERROR: 23P01: conflicting key value violates exclusion constraint "no_overlapping_rentals" ...
However, we can reserve a non-overlapping time (by checking in at 11:00am):
INSERT INTO reservations (property_id, user_id, checkin_time, checkout_time) VALUES (1, 2, '2015-01-09 11:00:00', '2015-01-10 11:00:00'); -- OK
...or we can check into a different, unreserved property at 9am:
INSERT INTO reservations (property_id, user_id, checkin_time, checkout_time) VALUES (2, 2, '2015-01-09 09:00:00', '2015-01-10 09:00:00'); -- OK
We could try to go even further and say, for example, "we need 2 hours between one rental and the next to clean the property." But that's business logic that's likely to change, and database constraints are not as easy to change as code. Personally, I'd have the database constraints enforce only inviolable rules for the data, and do any context-dependent validations in the application code.
And be careful what you consider "inviolable." I'm fairly sure you don't want one multiple renters trying to occupy the same cabin at once. If you don't use a database constraint to prevent overlapping rentals, you'll need some other way to detect conflicts after they're created and do the messy cleanup - both with your database, and with your users' expectations.
But if your company has some internal calendaring software and somebody books herself for conflicting events, that's not a stop-everything situation. Maybe she plans to duck out early from one and peek in on the other. A warning in the UI is probably the best solution.
Can you believe it: Even More About Exclusion Constraints!
There are other ways to customize our constraints. Exclusion constraints are based on indexes, and PostgreSQL supports "partial indexes", meaning indexes with a
WHERE clause. A typical use case might be to only include unpaid invoices in an index because you expect to search them more often than paid ones; this results in a smaller index and faster searches for the data you care about.
If we add a
WHERE to an exclusion constraint's index, we can say that the constraint only applies to some of the rows. We could say, for example, "only one user can be the manager of a given store location":
CREATE TABLE assignments( id serial PRIMARY KEY NOT NULL, employee_id integer, store_id integer, role varchar(128) ); ALTER TABLE assignments ADD CONSTRAINT one_manager_per_store EXCLUDE USING gist (store_id with =) WHERE (role = 'manager');
WHERE means that the constraint ignores rows with role
worker, but prevents multiple rows from being inserted that represent a
manager for the same
Or, going back to our cabin reservation system, we might say "we can't have two overlapping rentals for the same property, but we can ignore cancelled reservations when enforcing this rule."
-- remove the previous version of this constraint ALTER TABLE reservations DROP CONSTRAINT no_overlapping_rentals; ALTER TABLE reservations ADD CONSTRAINT no_overlapping_rentals EXCLUDE USING gist (property_id WITH =, tsrange("checkin_time", "checkout_time", '') WITH &&) WHERE (status != 'cancelled');
Now if somebody cancels their reservation, we're free to make another one in its place.
Constraints can be deferred
Constraints can be set to
DEFERRABLE, meaning "check them at the end of the transaction". This could be useful if, for example, you need to reassign employees; a store could have two managers momentarily, as long as one of them is reassigned by the end of the database transaction.
-- Let's redo this to make it deferrable ALTER TABLE assignments DROP CONSTRAINT one_manager_per_store; ALTER TABLE assignments ADD CONSTRAINT one_manager_per_store EXCLUDE USING gist (store_id with =) WHERE (role = 'manager') DEFERRABLE INITIALLY DEFERRED;
DEFERRABLE means "it's OK to wait until the end of the transaction to check this one", and
INITIALLY DEFERRED means "actually do that by default; within a given transaction someone can
SET CONSTRAINT one_manager_per_store IMMEDIATE to make it check immediately if they'd rather."
So with the deferrable constraint, you could do this:
--- start a transaction BEGIN; -- These two rows are fine INSERT INTO assignments (store_id, employee_id, role) VALUES (1, 1, 'worker'); INSERT INTO assignments (store_id, employee_id, role) VALUES (1, 2, 'worker'); INSERT INTO assignments (store_id, employee_id, role) VALUES (1, 3, 'manager'); -- This violates the "one_manager_per_store", but won't be rejected yet... INSERT INTO assignments (store_id, employee_id, role) VALUES (1, 4, 'manager'); -- When we try to commit the transaction, since we haven't removed one of -- those managers, we'll get a constraint violation COMMIT;
Constraints can use other comparisons
So far, I've shown exclusion constraints using
= (equality) and
&& (overlap). Overlap is pretty interesting: it could be used to compare ranges of weights, dates or times, or to check for overlapping areas on a map (eg, to say that sales territories may not overlap).
But you can use any commutative comparison in exclusion constraints - that is, any comparison where the order of the things being compared doesn't matter. So
> isn't allowed, because
5 > 4 is different from
4 > 5, and we have no way to specify the order of the rows in the comparison. However,
<> ("not equal") is allowed. This would make the constraint "un-unique" for a given column: all values must be the same.
To explain, I'll borrow an example from a presentation by Jeff Davis, who implemented exclusion constraints for PostgreSQL 9.0). Jeff said that in a zoo management system, you could express the idea "all animals assigned to the same enclosure must be the same species" by saying "don't allow new
assignments rows where the
enclosure_id is equal (
=) to another assignment's AND the
species_id is unequal (
<>) to the other assignment's".
Whew! Given all these options, I'm sure there are neat applications I haven't thought of yet. Send me an email if you know of others.
Back to less exotic features...
Level 4: Foreign key constraints (row vs another table)
You can't rent a property that doesn't exist, so every row in
rentals must have a
property_id that matches the
id of a row in
ALTER TABLE reservations ADD FOREIGN KEY(property_id) REFERENCES properties(id);
Besides guaranteeing that nobody reserves a property that never existed, or was deleted a moment before, this allows things like
ON DELETE RESTRICT, meaning you can't delete any property that has reservations, or
ON DELETE CASCADE meaning that if you delete a post, its comments should also be deleted.
You can't put a
WHERE on a foreign key constraint, but you can make it deferrable:
ALTER TABLE reservations ADD CONSTRAINT reservations_property_id_fk FOREIGN KEY(property_id) REFERENCES properties(id) DEFERRABLE INITIALLY DEFERRED;
Some of these checks can and should be done by application code before it sends data to the database - for instance, "user must provide an email address".
But some must be done by the database, or else the application may corrupt your data.
And rather than "do a
SELECT before my
INSERT and try really hard to use the right kind of transactions for that", it's simpler to take an optimistic approach: try to make the change, let the database do the check and raise an error if the change fails. That's the approach Elixir's Ecto library takes - let the database enforce these constraints, and if they fail, wrap the error in a user-friendly message like any other validation problem.
I think that's a very interesting approach, and one that has a good chance of letting my data live a long and healthy life.
Thanks to Kevin Burke, whose excellent Weird Tricks to Write Faster, More Correct Database Queries got me started learning about some of these things.