A Shuffled Order That Works with Pagination

postgresql

Posted on: 2017-11-30

I recently had to combine two requirements that I'd never encountered together:

getting records in a random order
paginating those records

The project was a recipe site. Showing the recipes in a random order meant that users saw new things on every visit.

But part of the site featured an "infinite scroll" component, where scrolling the page would load more and more recipes. If they were ordered randomly, this wouldn't work: users who scrolled down would be bound to see some of the same recipes repeated.

After some thought, I came up with a solution that worked well for us.

TL;DR

Use a shuffled (not random) ordering, based on a prime integer parameter.

#!sql
SELECT whatever
FROM table
ORDER BY large_integer_field % prime_integer_parameter -- modulo to shuffle
LIMIT 10 OFFSET 0;

Reuse the prime_integer_parameter with OFFSET 10 for subsequent page requests.

This technique would be slow with large record sets. Our record set was small.

Details follow.

Pagination

You probably know how to do pagination, but here's a quick refresher.

For the "first page", you fetch the data with a LIMIT and OFFSET 0.

#!sql
SELECT id FROM recipes ORDER BY id LIMIT 2 OFFSET 0;
 id
----
  1
  2

Subsequent pages can be fetched by adding OFFSET [number of records from previous page(s)], which skips that many records.

#!sql
SELECT id FROM recipes ORDER BY id
LIMIT 2 OFFSET 2; -- skip the 2 records from page 1
 id
----
  3
  4

SELECT id FROM recipes ORDER BY id
LIMIT 2 OFFSET 4; -- skip the 4 records from pages 1 and 2
 id
----
  5
  6

Random order

Ordering records randomly isn't hard, either. A commonly recommended way is ORDER BY random().

#!sql
SELECT id FROM recipes ORDER BY random() DESC LIMIT 2;
 id
----
  52
  79

For this query, PostgreSQL executes random() once per row.

#!sql
SELECT random();
      random
-------------------
 0.810276769567281

Whatever it returns is used as that row's sort value.

PostgreSQL also has a more efficient and explicit way to get random records; SELECT id FROM users TABLESAMPLE SYSTEM (2) returns very approximately 2 records.

But neither of these queries supports pagination. For pagination, we need a stable sort order that we can use with OFFSET.

ORDER BY RANDOM() won't work because it's not repeatable; subsequent queries that use an OFFSET will be skip some records that weren't returned the first time and return some that were returned before.

TABLESAMPLE can take a random seed to make it repeatable, but OFFSET just reduces the number of records returned.

Shuffled order

The key to my solution was realizing that we didn't need a truly random order, just a shuffled one. We were trying to surprise our users, not encrypt sensitive data. And if we merely needed to shuffle the data, we could use a parameter to specify how it to shuffle it.

For this, modulo works quite well:

#!sql
SELECT id FROM recipes
ORDER BY id % 7, id -- the % is the crucial part
LIMIT 3;
  id
------
 588
 595
 602

Records whose id divides evenly by 7 are returned before records where the division gives a remainder of 1, 2, etc. Humans aren't naturally good at spotting patterns like this, so it seems "random enough".

To ensure the order is stable, we add a secondary sort by id; among records with the same remainder, the smallest id wins.

This kind of shuffling is stable: running the query again gets the same results. Adding an OFFSET works fine:

#!sql
SELECT id FROM recipes
ORDER BY id % 7, id
LIMIT 3 OFFSET 2;
 id
-----
 602  -- last record from previous query; `OFFSET 2` created overlap
 609
 616

And changing the modulus gives us a very different result:

#!sql
SELECT id FROM recipes
ORDER BY id % 11, id -- mod 11 instead of 7
LIMIT 3 OFFSET 2;
  id
------
 605
 616
 649

What column? What modulus?

The key to the shuffling is the modulus (in the examples above, 7 and 11). How should we pick one? Some considerations:

Prime numbers create the most "random-looking" results. (Many records that divide evenly by 2 will also divide evenly by 4, so those two numbers yield similar shufflings; not so with 7 and 11.)
Since we're calling id % modulus, we'd like modulus to be smaller than all id values. That's because 1 % 7 is 1, 2 % 7 is 2, etc. The modulus has no effect on any number smaller than itself, which means those small-value rows will be stick at the top of our sort.

In the case I dealt with, the id values didn't start at 1. But if they do, we we can't meet the second criteria using id. We could do better by converting a timestamp to an integer to get a large number, then applying the modulus.

#!sql
SELECT id FROM recipes
ORDER BY
  -- convert this timestamp to an integer
  CAST(extract(epoch from inserted_at) as integer) % 7
LIMIT 2;
  id
------
 1280
 1320

Now the modulus will always be applied to a number larger than itself, so we won't have any rows stuck at the top.[1]

What About Stability?

I've said this gives us a stable sort order. That's true, but only as long as no records are inserted or deleted during pagination.

If someone requests the first page, then a record is inserted, then they request the second page, it could be that the new record should have appeared on the first page. That will bump the last record of the first page onto the second page, where it will appear as a duplicate.

To prevent this, you could record the time that the first page was requested, and ensure that subsequent pages are queried with WHERE inserted_at > first_page_query_time. In that case, users will only see records that existed when they started paginating.

Deletes are harder: if a record from the first page is deleted, it will bump a second-page record back to the first page, and anyone who already requested the first page will miss it. This problem nothing to do with shuffled order; we have the same problem with any ordered pagination. If it really matters to you, you could use timestamped soft deletes (SET deleted_at = NOW()) and include records that existed when pagination started.

On my project, inserts were rare, deletions were rarer, and if a record was skipped or duplicated, it was no big deal. So I ignored all this.

What about speed?

Using OFFSET as I've showed so far is inherently slow; all rows matching the WHERE must be loaded and sorted using the ORDER BY before some off them are skipped.

If you're sorting by an indexed column, like id, you can avoid OFFSET by doing something like WHERE id > last_id_seen LIMIT 100; then only the needed records have to be loaded into memory. But this isn't possible when sorting by a calculated value like id % 7.

One performant way to get a randomized, paginatable set of records would be to pre-calculate one or more randomized lists of ids, each list broken into page-length chunks. Eg, "in ordering 1, page 1 is the following ids...". Each time a page is requested, you load just the records WHERE id IN (ids_for_this_page). This would be fast, but it would fall over if you needed to add any other WHERE, which could eliminate some or all of the records on any given page.

The most robust strategy I can think of is fairly annoying:

To enable shuffling records 3 different ways, add indexed columns like order_1, order_2, order_3.
For each new row, populate each of those columns with a random, unique integer value.
Query for records shuffled in order 1 like ORDER BY order_1 WHERE order_1 > last_seen_order_1_value, and do likewise for the other possible orderings.

This would scale well to a large number of records, be paginatable, and work well with other WHERE conditions. But it's annoying to set up and maintain.

In our scenario, the number of records was small and would grow slowly, so again, I ignored all this.

Putting it All Together

With all of this in mind, the strategy I landed on is as follows:

Find a column that contains (or can be converted to) large integers.
Make a list of the first few prime numbers. The longer the list, the more ways you can shuffle the records.
When a "shuffled records" request comes in without a modulus value specified, choose one of from your list at random.
Use the modulus to shuffle the records, using ORDER BY integer_column % modulus
When sending back the page 1 response, tell the requester "by the way, your shuffle value is 11; tell me that when you request the second page".

Voila! A shuffled, paginatable order.

[1]This sorting operation is slightly more expensive because PostgreSQL has to do a bit more calculation per row. But it's not *much* more, according to `EXPLAIN`. I thought that adding an index per shuffle val like `CREATE INDEX inserted_at_as_integer_mod_7 ON recipes (mod(CAST(extract(epoch from inserted_at) as integer), 7));` would help, but it doesn't appear to.