This post was originally published on the Big Nerd Ranch blog.
If your data validation doesn't involve your database, you're asking for conflicts.
Many choices in programming are matters of opinion: which framework is better, which code is clearer, etc. But this one is different.
Application code alone cannot prevent conflicting data. To explain why, let's start with a familiar example.
Here Come the Conflicts
Say you're writing a web app where every user needs a unique username. How do you prevent duplicates?
If you're using Active Record or a Rails-inspired ORM, you might write something like validates :username, uniqueness: true
.
But you probably know that's not enough.
Your app probably has a way to process more than one request at a time, using multiple processes, threads, servers, or some combination of the three.
If so, multiple simultaneous requests could claim the same username.
You'll have to add a unique index on the username
column to guarantee there are no duplicates.
This is standard advice. But there's an underlying principle that you may have missed: any validation that depends on what's currently in the database is unreliable without a guarantee from the database itself.
These include:
- Does any user have this username right now?
- Does this blog post still exist for me to add a comment right now?
- Does this user's account have enough money to cover this purchase right now?
- Does this rental property have an existing reservation whose date range overlaps the one I'm requesting right now?
- Is there an employee with the job title "manager" assigned to this store right now?
In each of these cases, your application code can check what's in the database before writing. But in each case, there's a race condition: things may change between the time it checks and the time it writes. Other requests are being handled simultaneously, and your multiple application processes do not coordinate their work. Only the database is involved in every modification of data, so only it can prevent conflicts.
There are multiple ways you can enlist your database's help here, but the simplest and most performant option is to use database constraints.
Why Transactions Aren't Enough
Before I explain constraints, let's look at some of the alternatives.
One alternative is to use database transactions.
That's what ActiveRecord
is doing when I call user.save
with a uniqueness validation in place, according to log/development.log
:
BEGIN
SELECT 1 AS one
FROM "users"
WHERE "users"."username" = $1
LIMIT $2 [["username", "doofenshmirtz"], ["LIMIT", 1]]
-- application code checks the result of the select
-- and decides to proceed
INSERT INTO "users"
("username")
VALUES ($1)
RETURNING "id" [["username", "doofenshmirtz"]]
COMMIT
A transaction guarantees that all its statements will be executed successfully, or else none of them will be executed at all. And in this case, if Active Record finds an existing user with this username, it aborts the transaction. So why isn't it enough?
Well, it could be - depending on your database settings.
Transactions are run with varying isolation levels, meaning "how much can concurrent transactions affect this one?"
The good news is that if you use "serializable isolation" for your transactions, the database will guarantee that no other users are inserted between the SELECT
and INSERT
above.
If two concurrent transactions try to insert a user with the username doofenshmirtz
, the second one to complete will be rolled back, because the database knows that the second SELECT
would be affected by the first INSERT
.
The bad news is that you're probably not using serializable isolation. And the worse news is that (at least in the case of PostgreSQL 9.6.1) using serializable isolation can produce false positives - rolling back your transaction when a different username was inserted. The PostgreSQL error message acknowledges this by saying "HINT: The transaction might succeed if retried."
Writing application code to retry insertions sounds far too messy to me. And if concurrent requests have to wait in line to add a user, that's a serious performance bottleneck - especially if the same request has return to the back of the "retry line" repeatedly.
Lock It Up
Another alternative is to use database locking. Essentially, you say "nobody else can use this data until I've finished." There are various levels of locking - prohibiting reads or only writes, and locking a single row or an entire table.
While locking a row may not be a problem, the "no duplicate usernames" case would require locking the entire table for writes while your application code fetches and examines the results of the SELECT
.
Like serializable isolation, that creates a performance bottleneck.
And if you ever use multiple locks, you have to be careful that you don't create a deadlock.
Using Constraints
Constraints are a much more targeted tool. Constraints are rules you set in the database about what's valid for a row, a table, or the relationships between them. The database will reject any data that violates those rules.
We've already seen unique indexes (which are a kind of uniqueness constraint). They reliably solve the "no duplicates" problem.
You may also know about foreign key constraints, which say "we can't have a row in the comments
table with post_id
of 5
unless the posts
table has a row with id
of 5
".
This guarantees that even if a post is deleted while a comment is being submitted, you won't create an orphan comment.
(It also forces you to decide whether to allow someone to delete a post that already has comments, and if so, what should happen to those comments.
Without foreign keys, you'll likely be cleaning up that mess later.)
But if you're using PostgreSQL, there are other kinds of constraints available.
Check Constraints
For example, a check constraint could be used to say "an account balance may never be less than zero".
ALTER TABLE
accounts
ADD CONSTRAINT
positive_balance CHECK (balance > 0);
This rule applies to each row in the accounts
table.
If you don't see a need for this, you may have another race condition you haven't thought about.
Suppose a business makes a purchase of $20.
Does your code read the current $100 balance, subtract $20, and UPDATE
the balance to $80?
If so, there's a race condition between the read and write.
If two users make simultaneous $20 purchases, the second request to finish processing would set the balance to $80 when it should be $60.
You could prevent that by locking the row (a reasonable solution in this case), or you could ask your database to calculate the balance:
UPDATE
accounts
SET
balance = balance - 20
WHERE
id = 1
With this approach, both purchases would subtract $20 correctly, and the CHECK
constraint would ensure that if there's not enough funds in the account, the purchase fails[1].
Exclusion Constraints
Exclusion constraints are less widely-known, but they could save your bacon.
Imagine a rental reservation system with properties
and reservations
. Each reservation has a property_id
, a start_date
and an end_date
.
Clearly, you can't allow conflicting reservations if you want happy customers. If Alice has rented a cabin from June 15-20, Bob must not be allowed to rent it from June 12-21 or any other overlapping period. But how do you guarantee this doesn't happen?
You can check for overlaps in your application code, but if both rentals are submitted simultaneously, you might accidentally book them both. You could lock the table for writes while performing this check, but that's not very performant. You could go ahead and book both, but have a cron job to check for messes and a customer service team to apologize...
The cleanest solution would be to set this rule in the database: "no two reservations
for the same property_id
can have overlapping date ranges."
ALTER TABLE
reservations
ADD CONSTRAINT
no_overlapping_rentals
EXCLUDE USING
gist (
property_id WITH =,
daterange("start_date", "end_date", '[]') WITH &&
);
This says "if the property_id
of an existing reservation is =
to mine, and its inclusive ([]
) date_range
overlaps mine (&&
), my reservation is invalid".
(Constraints are implemented via indexes, and this one is a gist
index.)
With this in place, we're guaranteed not to get conflicting reservations.
You can do even fancier things, like restricting this rule to reservations WHERE status = active
, so that cancelled reservations don't prevent new ones.
If you want more detail about exclusion constraints, see my write-up here. But before we wrap up, let's talk about why you might not be using constraints already.
Logic in the Database?
One objection to techniques like these is that they put "business logic" in the database.
Way back in 2005, David Heinemeier Hansson, creator of Ruby on Rails, wrote a post called "Choose a Single Layer of Cleverness", in which he dismissed "stored procedures and constraints" as destroying the "coherence" of a system.
I want a single layer of cleverness: My domain model. Object-orientation is all about encapsulating clever. Letting it sieve half ways through to the database is a terrible violation of those fine intentions.
I don't know what systems DHH was reacting to, but I can imagine being frustrated to find that my code only told half the story about how the application behaves, or that the logic in SQL wasn't deployed in lockstep with my application code.
I also don't know to what extent DHH has changed his mind. Rails appears to have supported unique indexes since version 1.0, and finally added foreign key support in Rails 4.2 in December 2014.
But Rails, like the frameworks that sprang up after it in other languages, still doesn't support database constraints nearly as well as it supports validations.
This may be partly due to its early "database agnostic" approach; MySQL, for example, doesn't even have CHECK
or EXCLUDE
constraints.
But even if you use supported constraints (like unique indexes and foreign keys), violations will result in an ActiveRecord
exception; it's up to you to rescue it and provide meaningful user feedback.
And DHH's view is still echoed in The Rails Guides:
Database constraints and/or stored procedures make the validation mechanisms database-dependent and can make testing and maintenance more difficult. However... database-level validations can safely handle some things (such as uniqueness in heavily-used tables) that can be difficult to implement otherwise.... [but] it's the opinion of the Rails team that model-level validations are the most appropriate in most circumstances.
They have a point. Clearly, validations for things like "products must have a price" or "email addresses must contain an '@'" can be safely handled by application code. And even when validations are insufficient to guarantee correctness, your application might not get heavy enough traffic for you to notice.
As the authors of the paper "Feral Concurrency Control" wrote:
Empirically, even under worst-case workloads, these validations [for things like uniqueness] result in order-of-magnitude reductions in inconsistency. Under less pathological workloads, they may eliminate it. It is possible that, in fact, the degree of concurrency and data contention within Rails-backed applications simply does not lead to these concurrency races - that, in some sense, validations are "good enough" for many applications.
The reason validations help so much is this: if you're trying to reserve a username that was already taken, it's much more likely that it was claimed in a request before yours than in a request nearly simultaneous to yours. And if it was taken, say, yesterday, the validation will prevent your duplicate.
Don't Leave It To Chance
But fundamentally, your multiple Rails application instances do not coordinate their actions, so they can't prevent inconsistent data completely without leaning on the database via locks, constraints, or serializable transactions. And in my view, constraints are the most straightforward of these options: they don't require retry loops like serializable transactions do, they can't create deadlocks, and they have very little impact on performance.
Bottom line: I'm not happy with leaving data integrity to chance, especially when - as in the case of a doubly-booked property - an error would lead to very unhappy customers.
In my next post, I'll give some thoughts on when not to use constraints. I'll also suggest how to provide good a user and developer experience when you do use them.
[1] Example borrowed from Kevin Burke's post, "[Weird Tricks to Write Faster, More Correct Database Queries](https://kev.inburke.com/kevin/faster-correct-database-queries/)"