Speed up Multitenant Migrations with Database Schema Snapshots

DEPLOY ECTO POSTGRESQL

Ecto Migrations are par for the course when building a database structure in an Elixir/Phoenix application. The management and execution of single-tenant migrations are straightforward as they are executed once in production for all customers.

Multitenant solutions are a bit more tricky to handle. Onboarding a tenant requires all migrations to be run on demand. An extensive suite of migrations can incur an undesirable performance hiccup just when you seek to make a solid first impression.

Today I will showcase how to speed up tenant migrations using database snapshots and explain how the solution was almost derailed at the 25th hour, a fact best echoed by this quote:

Sometimes you get the bear. Other times, the bear gets you! -- Star Trek Next Generation

The Approach

We will take a snapshot of the database structure from a schema dump and load it to onboard each new tenant.

A quick inventory of our tool chest shows that Mix can readily dump and load a database schema.

mix ecto.dump
mix ecto.load

At first glance, this seems to be all that is needed.

Not so fast!

The goal is to create a production-caliber solution that supports Elixir Releases. As we will see shortly, this requirement disqualifies mix ecto.load. But have no fear. We will show how you can maneuver around this restriction.

Dump

Let’s first dump the database structure for a given tenant by running mix ecto.dump in our development environment.

Mix uses pg_dump under the covers to export the schema structure from Postgresql. This command can be run directly with additional options if you desire.

pg_dump -d <database> --schema=<schema> -s > structure.sql

Next, we need to manually search and replace all specific schema references in the file with the placeholder #{prefix}. The reason for this step is forthcoming.

An abbreviated view of the tenant structure is provided.

--
-- PostgreSQL database dump
--

--
-- Name: #{prefix}; Type: SCHEMA; Schema: -
--

CREATE SCHEMA #{prefix};

--
-- Name: users; Type: TABLE; Schema: #{prefix}
--

CREATE TABLE #{prefix}.users (
    id bigint NOT NULL,
    username character varying(50) NOT NULL,
    password character varying(255) NOT NULL,
    inserted_at timestamp(0) without time zone NOT NULL,
    updated_at timestamp(0) without time zone NOT NULL
);

--
-- PostgreSQL database dump complete
--

Readability is not for programmatic workflows, so let us minify the script by running a custom function.

def minify() do
  {:ok, file} = File.open("structure.sql", [:read])

  content =
    file
    |> read_line()
    |> process_line(file, [])
    |> Enum.join(" ")

  file |> File.close()

  File.write("structure.min.sql", content)
end

defp read_line(file), do: file |> IO.read(:line)

defp process_line(:eof, _, content), do: content

defp process_line({:error, _} = failure, _, _), do:
  raise "Encountered error reading file: #{inspect(failure)}"

defp process_line("--" <> _, file, content),
  do: file |> read_line() |> process_line(file, content)

defp process_line("\n" <> _, file, content),
  do: file |> read_line() |> process_line(file, content)

defp process_line(line, file, content) when is_bitstring(line) do
  minified_line =
    line
    |> String.replace_trailing("\n", "")
    |> String.replace("  ", " ")
    |> String.trim_trailing()

  file
  |> read_line()
  |> process_line(file, content ++ [minified_line])
end

This function removes all comments, empty lines, and double spaces from the file and outputs the minified result to a new file.

Load

As stated earlier, we cannot use Mix in production to run the load step. However, we can peek under the covers and replicate the essence of the Mix command.

Mix’s ecto.load uses the System module in Elixir to do its work. In particular, it uses the cmd function to execute a psql script to load the schema.

First, create a module with a function to hold the contents of the minified script. We take this step to avoid reading the templated structure from the file system each time.

defmodule Migrator do
  defp snapshot do
    fn prefix ->
      """
      CREATE SCHEMA #{prefix}; CREATE TABLE #{prefix}.users (
       id bigint NOT NULL, username character varying(50) NOT NULL,
       password character varying(255) NOT NULL, inserted_at timestamp(0)
       without time zone NOT NULL, updated_at timestamp(0)
       without time zone NOT NULL);
      """
    end
end

Next, add a function to create a schema version for a specific tenant and persist the result to a temporary file.

def copy(prefix) do
  schema =
    snapshot()
    |> apply([prefix])

  path = "/tmp/#{prefix}.sql"

  {File.write(path, schema), path}
end

Lastly, add the means to execute the load operation via psql.

def create_schema_from_snapshot!(prefix) do
  with {:ok, path} <- copy(prefix),
       {_, 0} <- run_psql_cmd(path) do
    :ok
  else
    error ->
      raise inspect(error)
  end
end

defp run_psql_cmd(path) do
  config = Repo.config()

  args = [
    "--quiet",
    "--file",
    path,
    "-vON_ERROR_STOP=1",
    "--single-transaction",
  ]

  env = [
    {"PGHOST", config[:hostname]},
    {"PGPORT", get_port(config[:port])},
    {"PGDATABASE", config[:database]},
    {"PGUSER", config[:username]},
    {"PGPASSWORD", config[:password]},
    {"PGSSLMODE", config[:sslmode]},
  ]

  System.cmd("psql", args, env: env, stderr_to_stdout: true)
end

defp get_port(value) do
  case value do
    value when is_integer(value) -> Integer.to_string(value)
    value when is_bitstring(value) -> value
    _ -> ""
  end
end

What is significant here is that we execute psql using a connection from our main application pool for each tenant that is onboarded.

The port must be converted to a string to avoid the all arguments for System.cmd/3 must be binaries argument error.

Now you may ask why not use a psql environment variable to set the prefix at execution time and bypass the need to write a file to the system for each tenant?

Good question indeed, as we wondered the same!

While you may set this argument as "-vPREFIX=first_tenant", only top-level blocks are substituted. Prefix placeholders that are embedded inside the body of blocks, such as join conditions inside a function or view, remain unchanged.

CREATE MATERIALIZED VIEW first_tentant.view1 AS
 SELECT ...
 FROM PREFIX.table1
 JOIN PREFIX.table2 ON ...

Production Database SSL Considerations

So should you go this route in your application?

My answer is a resounding yes, conditioned that you can invoke psql with System.cmd in production.

I was confident proceeding with this approach as all lights were green in my development and preview environments.

Who knew that a nasty surprise was about to rear its head?

My production database cluster uses SSL, and it turns out that System.cmd does not play well with this configuration.

Ecto’s ability to seamlessly connect to the database with the single sslmode=require condition fostered a false sense of expectation. A direct connection to the database via psql in the terminal further reinforced my confidence.

However, all attempts by System.cmd to shell out to psql from Elixir failed miserably in this single mode configuration.

Adding the server’s root CA and a client-side cert and key did not improve the situation.

env = env ++ [
    {"PGSSLROOTCERT", "root.crt"},
    {"PGSSLCERT", "client-cert.pem"},
    {"PGSSLKEY", "client-key.pem"}
]

System.cmd uses the Port module to start operating system processes external to the Erlang VM. This separation appears to be rife with certificate permission issues, unknown CA failures, and a total lack of disregard for the SSL mode.

Typical Failures

* FATAL:  SSL required
* Could not open certificate file ~/.postgresql/postgresql.crt: Permission denied
* SSL error: tlsv1 alert unknown ca

The purpose of the chosen SSL mode is to instruct the X509 handshake to skip certificate verification, yet each attempt to use System.cmd triggered this validation.

After spinning my wheels for several hectic days and engaging Digital Ocean support and ElixirForum, I was ready to concede the fight and fall back to migrations.

But then viola!

In one of a thousand attempts, I switched the specification of the root CA to the PGSSLCERT key and accidentally forgot to type the .crt suffix.

Yes, Jimmy, the bear did not get me on this day.

  env = [
    {"PGHOST", config[:hostname]},
    {"PGPORT", get_port(config[:port])},
    {"PGDATABASE", config[:database]},
    {"PGUSER", config[:username]},
    {"PGPASSWORD", config[:password]},
    {"PGSSLCERT", "root"},
  ]

I wish to claim an understanding of why this configuration worked, but your guess is as good as mine, and honestly, I am just thrilled that the challenge is behind me.

I still believe that the three credentials are the correct approach and would have worked had I been able to access the server key to create a proper signing request for the client cert.

It would be great if Digital Ocean provided a self-service option to generate the client certificate and key on demand, as does Google Cloud.

Summary

We have demonstrated how you can leverage psql commands to dump the database structure and quickly load it for each tenant that onboards and in a manner that allows us to use this workflow in production with Elixir Releases.

New migrations are applied on top of the snapshot to keep each tenant updated. All that is needed as these migrations grow is to update the image and repeat the process.

The uncertainty in this picture was the monumental undertaking needed to resolve the lines of communication between System.cmd and psql in the presence of SSL.

Your feedback is welcomed as always.

Michael


You no doubt have an opinion bubbling to the surface.
Let's go one step farther and add your voice to the conversation.
Your email is used to display your Gravatar and is never disclosed. As always, do review our moderation guidelines to keep the converstion friendly and respectful.