Fast Full-text Search with Ecto and PostgreSQL

elixir ecto postgresql
Posted on: 2018-01-08

I recently built a full-text recipe search feature using Ecto and PostgreSQL for Mealthy.com. You can try it out there, or check out this quick demo video.

I thought this was interesting enough to write up (with Mealthy's permission).

This search feature replaced a simpler one, and needed to:

  1. Support substring matches. For example, "bu" should find recipes like "Vegan Butternut Squash Soup". This was because search could be used globally, or to filter a user's saved recipes as they typed a search term. It would be confusing if typing "bu" filtered out everything, but continuing to type "butternut" made some recipes reappear.
  2. Understand plurals. For example, searches for "strawberries" and "strawberry" should return the same results.
  3. Ignore accents. For example, a search for "sauté" vs "saute" should return the same results.
  4. Search tags. For example, searches for "gluten-free" should return recipes tagged as gluten free, whether or not the recipe title includes those words.
  5. Support ranking. For example, when searching for "strawberry", a recipe with "strawberry" in the title should be shown before a recipe that's merely tagged "strawberries" because that's one of its ingredients.
  6. Honor user preferences. For example, a vegetarian user searching for "burger" should have meaty results filtered out, leaving only matches like "Vegan Black Bean Burgers".
  7. Perform searches quickly. Everybody likes speed, but it was especially important here for the "search after every keystroke" use case. A performance hit when adding or updating recipes was OK, as this was done infrequently, and always by staff.

Substring matches were easy: a simple ILIKE query on recipe titles would do.

Accent-agnostic, pluralization-agnostic, multi-table search (recipes + tags) was slightly harder, and we discussed using something like Solr. But we were already using PostgreSQL as our primary data store, and its full-text search capabilities were enough for this task, which kept us from having to add another dependency.

If you don't already understand functions like to_tsvector, you should really read Postgres full-text search is Good Enough!. It's a fantastic post, both as a tutorial and a reference. For this post, I'll assume you know the relevant PostgreSQL functions.

Honoring user preferences meant, in our context, performing the full-text search as a chainable Ecto query, so that other query conditions like "must be vegetarian" could also be applied. In other words, the full-text search function takes a %Ecto.Query{} and returns a modified one.

After getting search working, I noticed that it was rather slow, so to speed it up I added a materialized view and triggers to update that view as necessary. There may be a better way; I'll cover that at the end.

Elixir Code

First off, let's take a look at the Elixir code behind recipe search. Searching for "thai chicken" on the site ends up calling RecipeSearch.run(query, "thai chicken"), where query is a base %Ecto.Query{} that needs narrowing to match the term. I'll comment on the code below.

defmodule RecipeSearch do
  import Ecto.Query

  def run(query, search_string) do
    _run(query, normalize(search_string))
  end

  defmacro matching_recipe_ids_and_ranks(search_string) do
    quote do
      fragment(
        """
        SELECT recipe_search.id AS id,
        ts_rank(
          recipe_search.document, plainto_tsquery(unaccent(?))
        ) AS rank
        FROM recipe_search
        WHERE recipe_search.document @@ plainto_tsquery(unaccent(?))
        OR recipe_search.title ILIKE ?
        """,
        ^unquote(search_string),
        ^unquote(search_string),
        ^"%#{unquote(search_string)}%"
      )
    end
  end

  defp _run(query, ""), do: query
  defp _run(query, search_string) do
    from recipe in query,
      join: id_and_rank in matching_recipe_ids_and_ranks(search_string),
      on: id_and_rank.id == recipe.id,
      order_by: [desc: id_and_rank.rank]
  end

  defp normalize(search_string) do
    search_string
    |> String.downcase
    |> String.replace("gluten free", "gluten-free")
    |> String.replace("dairy free", "dairy-free")
    |> String.replace("sugar free", "sugar-free")
    |> String.replace("meatless", "no-meat")
    |> String.replace("vegetarian", "no-meat")
    |> String.replace(~r/\n/, " ")
    |> String.replace(~r/\t/, " ")
    |> String.replace(~r/\s{2,}/, " ")
    |> String.trim
  end

end

The run/2 function normalizes the search string (which involves typical things like discarding extra white space, as well as domain-specific things like treating "gluten free" as "gluten-free"), then joins the subquery.

The macro, matching_recipe_ids_and_ranks/1, builds a subquery to search the recipe_search view for matching recipes and return their recipe ids and ranks. It takes care of unaccenting the term and looking for a full-text match against the view's document column, as well as running an alternative ILIKE search on the title for substring matches. It builds the query with plainto_tsquery, which means that special characters like @ are ignored and a string like "thai chicken becomes thai & chicken - that is, all words must match.

Initially, I had the subquery return only the ids of matching recipes and did something like WHERE id IN subquery, but this didn't let me make use of the subquery's rank. Joining the subquery allowed me to ORDER BY the subquery's rank, so that "better" matches for a term could appear first.

The Database View

The subquery selects FROM recipe_search. This is a view that pulls together information from recipes and tags. As I'm about to show, it's a materialized view because that proved to be faster, but for all this module knows, it could be a regular view which simply encapsulates building the document we're searching.

I created recipe_search with migration code like this:

execute(
  """
  CREATE EXTENSION IF NOT EXISTS unaccent
  """
)

execute(
  """
  CREATE MATERIALIZED VIEW recipe_search AS
  SELECT
    recipes.id AS id,
    recipes.title AS title,
    (
    setweight(to_tsvector(unaccent(recipes.title)), 'A') ||
    setweight(to_tsvector(unaccent(coalesce(string_agg(tags.value_slug, ' '), ' '))), 'B')
    ) AS document
  FROM recipes
  LEFT JOIN recipes_tags ON recipes_tags.recipe_id = recipes.id
  LEFT JOIN tags ON tags.id = recipes_tags.tag_id
  GROUP BY recipes.id
  """
)
# to support full-text searches
create index("recipe_search", ["document"], using: :gin)

# to support substring title matches with ILIKE
execute("CREATE INDEX recipe_search_title_trgm_index ON recipe_search USING gin (title gin_trgm_ops)")

# to support updating CONCURRENTLY
create unique_index("recipe_search", [:id])

Essentially, this says:

  • Create a materialized view with one row for each row in recipes
  • For each row, join all the tags for that recipe into one big string
  • Squish the title and tags together into one document, weighting title as more important than tags for search purposes
  • Add some indexes to speed up searches

Using a materialized view means doing work in advance that would otherwise have to be done at query time, and in my testing, it sped up searches from about 60ms to 6ms; I would expect a larger difference as the number of recipes increases.

However, materialized views have a downside: they must be explicitly refreshed, or else searches will return outdated results.

Refreshing the Materialized View

We debated refreshing recipe_search on a timer, but that seemed wrong in a couple of ways.

First, because there will often be no reason to refresh; nothing has changed. That's wasted effort.

Second, because when something does change, we'd like the refresh to happen immediately, not whenever the timer goes off again.

After some thought, I set this up using triggers. These tell the database that whenever recipes, or tags, or recipes_tags changes, recipes_search needs to be refreshed.

Here's the Ecto migration code:

execute(
  """
  CREATE OR REPLACE FUNCTION refresh_recipe_search()
  RETURNS TRIGGER LANGUAGE plpgsql
  AS $$
  BEGIN
  REFRESH MATERIALIZED VIEW CONCURRENTLY recipe_search;
  RETURN NULL;
  END $$;
  """
)

execute(
  """
  CREATE TRIGGER refresh_recipe_search
  AFTER INSERT OR UPDATE OR DELETE OR TRUNCATE
  ON recipes
  FOR EACH STATEMENT
  EXECUTE PROCEDURE refresh_recipe_search();
  """
)

execute(
  """
  CREATE TRIGGER refresh_recipe_search
  AFTER INSERT OR UPDATE OR DELETE OR TRUNCATE
  ON recipes_tags
  FOR EACH STATEMENT
  EXECUTE PROCEDURE refresh_recipe_search();
  """
)

execute(
  """
  CREATE TRIGGER refresh_recipe_search
  AFTER INSERT OR UPDATE OR DELETE OR TRUNCATE
  ON tags
  FOR EACH STATEMENT
  EXECUTE PROCEDURE refresh_recipe_search();
  """
)

A Possible Improvement

The code here seems to work quite well. The remaining downside is that recalculating the entire materialized view when only one row needs to change is a bit wasteful.

This doesn't seem likely to become a problem anytime soon. However, Rob Sullivan (by way of John Mosesman) suggested a way to fix it if it does: instead of a materialized view, make recipe_search be a regular table. And instead of rebuilding the whole thing when recipes or tags change, just update the recipe_search rows for specific recipes that need it.

I like this idea and will keep it in mind for the next time I build a feature like this.

Update 2020-07-08

I just found a post which references this one and makes some improvements on my approach.