I recently had to combine two requirements that I'd never encountered together:
- getting records in a random order
- paginating those records
The project was a recipe site. Showing the recipes in a random order meant that users saw new things on every visit.
But part of the site featured an "infinite scroll" component, where scrolling the page would load more and more recipes. If they were ordered randomly, this wouldn't work: users who scrolled down would be bound to see some of the same recipes repeated.
After some thought, I came up with a solution that worked well for us.
Use a shuffled (not random) ordering, based on a prime integer parameter.
SELECT whatever FROM table ORDER BY large_integer_field % prime_integer_parameter -- modulo to shuffle LIMIT 10 OFFSET 0;
OFFSET 10 for subsequent page requests.
This technique would be slow with large record sets. Our record set was small.
You probably know how to do pagination, but here's a quick refresher.
For the "first page", you fetch the data with a
SELECT id FROM recipes ORDER BY id LIMIT 2 OFFSET 0; id ---- 1 2
Subsequent pages can be fetched by adding
OFFSET [number of records from previous page(s)], which skips that many records.
SELECT id FROM recipes ORDER BY id LIMIT 2 OFFSET 2; -- skip the 2 records from page 1 id ---- 3 4 SELECT id FROM recipes ORDER BY id LIMIT 2 OFFSET 4; -- skip the 4 records from pages 1 and 2 id ---- 5 6
Ordering records randomly isn't hard, either.
A commonly recommended way is
ORDER BY random().
SELECT id FROM recipes ORDER BY random() DESC LIMIT 2; id ---- 52 79
For this query, PostgreSQL executes
random() once per row.
SELECT random(); random ------------------- 0.810276769567281
Whatever it returns is used as that row's sort value.
PostgreSQL also has a more efficient and explicit way to get random records;
SELECT id FROM users TABLESAMPLE SYSTEM (2) returns very approximately 2 records.
But neither of these queries supports pagination.
For pagination, we need a stable sort order that we can use with
ORDER BY RANDOM() won't work because it's not repeatable; subsequent queries that use an
OFFSET will be skip some records that weren't returned the first time and return some that were returned before.
TABLESAMPLE can take a random seed to make it repeatable, but
OFFSET just reduces the number of records returned.
The key to my solution was realizing that we didn't need a truly random order, just a shuffled one. We were trying to surprise our users, not encrypt sensitive data. And if we merely needed to shuffle the data, we could use a parameter to specify how it to shuffle it.
For this, modulo works quite well:
SELECT id FROM recipes ORDER BY id % 7, id -- the % is the crucial part LIMIT 3; id ------ 588 595 602
Records whose id divides evenly by
7 are returned before records where the division gives a remainder of
Humans aren't naturally good at spotting patterns like this, so it seems "random enough".
To ensure the order is stable, we add a secondary sort by
id; among records with the same remainder, the smallest
This kind of shuffling is stable: running the query again gets the same results.
OFFSET works fine:
SELECT id FROM recipes ORDER BY id % 7, id LIMIT 3 OFFSET 2; id ----- 602 -- last record from previous query; `OFFSET 2` created overlap 609 616
And changing the modulus gives us a very different result:
SELECT id FROM recipes ORDER BY id % 11, id -- mod 11 instead of 7 LIMIT 3 OFFSET 2; id ------ 605 616 649
What column? What modulus?
The key to the shuffling is the modulus (in the examples above,
How should we pick one?
- Prime numbers create the most "random-looking" results. (Many records that divide evenly by
2will also divide evenly by
4, so those two numbers yield similar shufflings; not so with
- Since we're calling
id % modulus, we'd like
modulusto be smaller than all
idvalues. That's because
1 % 7is
2 % 7is
2, etc. The modulus has no effect on any number smaller than itself, which means those small-value rows will be stick at the top of our sort.
In the case I dealt with, the
id values didn't start at
But if they do, we we can't meet the second criteria using
We could do better by converting a timestamp to an integer to get a large number, then applying the modulus.
SELECT id FROM recipes ORDER BY -- convert this timestamp to an integer CAST(extract(epoch from inserted_at) as integer) % 7 LIMIT 2; id ------ 1280 1320
Now the modulus will always be applied to a number larger than itself, so we won't have any rows stuck at the top.
What About Stability?
I've said this gives us a stable sort order. That's true, but only as long as no records are inserted or deleted during pagination.
If someone requests the first page, then a record is inserted, then they request the second page, it could be that the new record should have appeared on the first page. That will bump the last record of the first page onto the second page, where it will appear as a duplicate.
To prevent this, you could record the time that the first page was requested, and ensure that subsequent pages are queried with
WHERE inserted_at > first_page_query_time.
In that case, users will only see records that existed when they started paginating.
Deletes are harder: if a record from the first page is deleted, it will bump a second-page record back to the first page, and anyone who already requested the first page will miss it.
This problem nothing to do with shuffled order; we have the same problem with any ordered pagination.
If it really matters to you, you could use timestamped soft deletes (
SET deleted_at = NOW()) and include records that existed when pagination started.
On my project, inserts were rare, deletions were rarer, and if a record was skipped or duplicated, it was no big deal. So I ignored all this.
What about speed?
OFFSET as I've showed so far is inherently slow; all rows matching the
WHERE must be loaded and sorted using the
ORDER BY before some off them are skipped.
If you're sorting by an indexed column, like
id, you can avoid
OFFSET by doing something like
WHERE id > last_id_seen LIMIT 100; then only the needed records have to be loaded into memory. But this isn't possible when sorting by a calculated value like
id % 7.
One performant way to get a randomized, paginatable set of records would be to pre-calculate one or more randomized lists of ids, each list broken into page-length chunks. Eg, "in ordering 1, page 1 is the following ids...". Each time a page is requested, you load just the records
WHERE id IN (ids_for_this_page). This would be fast, but it would fall over if you needed to add any other
WHERE, which could eliminate some or all of the records on any given page.
The most robust strategy I can think of is fairly annoying:
- To enable shuffling records 3 different ways, add indexed columns like
- For each new row, populate each of those columns with a random, unique integer value.
- Query for records shuffled in order 1 like
ORDER BY order_1 WHERE order_1 > last_seen_order_1_value, and do likewise for the other possible orderings.
This would scale well to a large number of records, be paginatable, and work well with other
WHERE conditions. But it's annoying to set up and maintain.
In our scenario, the number of records was small and would grow slowly, so again, I ignored all this.
Putting it All Together
With all of this in mind, the strategy I landed on is as follows:
- Find a column that contains (or can be converted to) large integers.
- Make a list of the first few prime numbers. The longer the list, the more ways you can shuffle the records.
- When a "shuffled records" request comes in without a modulus value specified, choose one of from your list at random.
- Use the modulus to shuffle the records, using
ORDER BY integer_column % modulus
- When sending back the page 1 response, tell the requester "by the way, your shuffle value is
11; tell me that when you request the second page".
Voila! A shuffled, paginatable order.
This sorting operation is slightly more expensive because PostgreSQL has to do a bit more calculation per row.
But it's not much more, according to
I thought that adding an index per shuffle val like
CREATE INDEX inserted_at_as_integer_mod_7 ON recipes (mod(CAST(extract(epoch from inserted_at) as integer), 7)); would help, but it doesn't appear to.