Engineering

Durable Execution Explained — The Pattern That Makes Your SaaS Actually Work

Durable execution is the difference between a SaaS that silently drops data during a restart and one that picks up exactly where it left off. Here's what it is, why most implementations get it wrong, and how to build it without overengineering.

06 Jun 20269 min readAnkur

Every SaaS application has background work. Send a welcome email. Generate an invoice PDF. Run a weekly report. Sync data to an external API. These jobs take seconds to minutes, sometimes hours. And they fail — the database restarts midway, the API returns a 503, the server runs out of memory.

Durable execution is the guarantee that your background work completes correctly even when things fail. Not "retry from scratch and hope for the best." Not "check the logs and manually fix the half-processed data." But: the system remembers exactly where it was, resumes from that point, and finishes the job. No duplicates, no dropped work, no manual cleanup.

Most SaaS teams implement this badly. They build something that works 99% of the time and spend their on-call hours cleaning up the 1%. Here's what durable execution actually requires, the patterns that work, and the ones that don't.

The Four Properties of Durable Execution

A durable execution system provides four guarantees. Miss any one of them, and you don't have durable execution — you have a wish.

Exactly-once semanticsEach step runs to completion at least once, but its effects appear exactly once. No duplicate emails, no double-charged invoices.
Progress checkpointingAfter each step completes, its result is persisted. A crash doesn't mean restarting from the beginning.
Automatic retryTransient failures (network blips, API timeouts) trigger retries without human intervention. Permanent failures surface cleanly.
ObservabilityYou can answer "what's running right now?" and "why did job #8472 fail?" from a query, not from grep'ing logs across five machines.

These aren't nice-to-haves. If your payment processing pipeline doesn't have exactly-once semantics, you will double-charge someone someday. If your data export pipeline doesn't checkpoint, a crash at 95% completion means restarting an 8-hour job. These are production incidents waiting to happen.

Pattern 1: The Jobs Table (What Most Teams Build)

The instinctive approach: a background_jobs table with status, attempts, last_error, and a worker process that polls it.

CREATE TABLE background_jobs (
  id UUID PRIMARY KEY,
  job_type TEXT NOT NULL,
  payload JSONB,
  status TEXT DEFAULT 'pending',  -- pending, running, done, failed
  attempts INTEGER DEFAULT 0,
  last_error TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);

This gets you 80% of the way there. It handles retries. It gives you visibility. But it fails on the hard cases:

  • Mid-execution crashes. If your worker dies between updating two rows, you've committed half the work. Restarting means figuring out which rows were touched. Most teams don't build this — they accept the risk and deal with it manually.

  • Exactly-once semantics. If your worker sends an email, crashes before updating the job status, restarts, and sends the email again — congratulations, your customer got two welcome emails. Idempotency keys help, but most teams don't implement them until after the first duplicate.

  • Long-running jobs blocking connections. A job that takes 30 minutes holds a database connection for 30 minutes. Connection pool exhaustion during a batch is a classic production incident.

💡 Key Insight The jobs table pattern isn't wrong — it's incomplete. The missing piece is checkpointing inside the job, not just before and after. Without checkpoints, a crash anywhere in the job body means the whole job reruns. With checkpoints, it picks up where it left off.

Pattern 2: External Orchestrators (Temporal, Airflow, Step Functions)

The opposite extreme: hand your workflows to a purpose-built orchestrator. Temporal workers execute workflows in Go, TypeScript, Java, or Python. Airflow runs Python DAGs on a scheduler with a metadata database. AWS Step Functions gives you a JSON state machine.

These systems are powerful. Temporal in particular has solved the hard distributed systems problems — exactly-once execution, retry with backoff, workflow versioning, long-polling for external events. It's the gold standard for durable execution at scale.

The cost: operational complexity.

OrchestratorWhat You GetWhat You Pay
TemporalGeneral-purpose durable execution, SDKs in 4+ languages, workflow versioning, external event signalsTemporal server cluster, worker deployment, SDK integration, operational monitoring for the orchestrator itself
Apache AirflowPython DAGs, rich scheduling, extensive provider ecosystem, web UIScheduler, web server, metadata database, worker pool, DAG versioning headaches, notorious for operational drift
AWS Step FunctionsFully managed, JSON state machine, native AWS integrationsVendor lock-in, JSON-only workflow definitions, AWS-specific knowledge, cost at scale
pg_durable (new)SQL-native workflows, automatic checkpointing, zero extra infrastructurePostgreSQL extension required, SQL-shaped — arbitrary code goes behind HTTP endpoints or SQL functions, v0.x maturity

The decision framework is simple: if your workflows are mostly SQL with some HTTP calls, and your state lives in PostgreSQL, you don't need an external orchestrator. You need checkpointing inside the database. pg_durable or a well-implemented jobs table with explicit checkpointing gets you there.

If your workflows span multiple services, involve long-running human approvals, or need polyglot SDK support, Temporal is the right call. Accept the operational cost and move on.

Pattern 3: Event Sourcing (The Nuclear Option)

Event sourcing stores every state change as an immutable event. Replaying events reconstructs current state. Durable execution comes for free because every step is an event — if the process crashes, replay the events.

This is overkill for 95% of SaaS backends. It's the right pattern for financial ledgers, audit systems, and domains where every state transition must be traceable and reproducible. For sending welcome emails, it's architecture astronaut territory.

When event sourcing makes sense: Your domain is accounting, payments, compliance, or inventory — anything where the audit trail is the product. The operational cost of maintaining event streams and projections is justified by the business requirement.

When it doesn't: Everything else. A jobs table with checkpointing achieves the same durability guarantees with 10% of the complexity.

Pattern 4: Database-Native Durable Execution (The New Option)

pg_durable represents a fourth pattern: durable execution as a database feature. Define your workflow in SQL, let the extension handle checkpointing, retries, and recovery.

The advantages are significant:

  • Zero additional infrastructure. You're already running PostgreSQL. No new services to deploy, monitor, or pay for.
  • Same operational model. Backups, auth, monitoring — it's all PostgreSQL. Your existing tooling covers pg_durable workflows.
  • Colocated with data. No network hops between orchestrator and database. Workflow state and application data live in the same place.
  • SQL visibility. SELECT * FROM df.instances WHERE status = 'running' — that's your workflow dashboard.

The constraints are equally real:

  • SQL-shaped workflows. If a step needs arbitrary application code, it must live behind an HTTP endpoint or a SQL function. This is fine for data pipelines and API orchestration. It's limiting for complex business logic.
  • PostgreSQL extension dependency. Not available on managed PostgreSQL services that don't support extensions. AWS RDS supports extensions, but pg_durable needs a background worker — check your provider's extension policy.
  • v0.x maturity. The extension is new. Production adoption means accepting some rough edges and contributing fixes.

What We Recommend

For the typical SaaS backend — an Indian startup with 2-5 engineers, PostgreSQL as the primary database, and background jobs for emails, reports, data syncs, and scheduled maintenance — here's the escalation path:

  1. Start with a jobs table. It's simple, it works, and it teaches you where durability breaks. Add idempotency keys from day one.

  2. When jobs start failing mid-execution, add explicit checkpointing. A checkpoints table or a progress JSONB column. This is the line between "we handle retries" and "we have durable execution."

  3. When checkpointing becomes a pattern you reimplement every time, adopt pg_durable or Temporal. The decision: pg_durable if your workflows are SQL-shaped, Temporal if they're not.

  4. When you need human-in-the-loop approvals or multi-day workflows, Temporal. pg_durable's sweet spot is minutes-to-hours; Temporal handles days-to-weeks with signals and timers.

The mistake most teams make is jumping to step 4 before step 2 breaks. You don't need Temporal to send onboarding emails. You need a jobs table with idempotency keys and basic checkpointing. Build the simple thing first, let it break in production, and only then reach for the heavier tool.

Durable execution is a solved problem. The question isn't whether to implement it — it's whether you implement it with infrastructure you already have or infrastructure you need to add. For most teams in 2026, the answer starts with PostgreSQL.

Tags

  • durable-execution
  • workflow
  • postgresql
  • saas
  • architecture