Engineering
Durable Execution Explained — The Pattern That Makes Your SaaS Actually Work
Durable execution is the difference between a SaaS that silently drops data during a restart and one that picks up exactly where it left off. Here's what it is, why most implementations get it wrong, and how to build it without overengineering.
Every SaaS application has background work. Send a welcome email. Generate an invoice PDF. Run a weekly report. Sync data to an external API. These jobs take seconds to minutes, sometimes hours. And they fail — the database restarts midway, the API returns a 503, the server runs out of memory.
Durable execution is the guarantee that your background work completes correctly even when things fail. Not "retry from scratch and hope for the best." Not "check the logs and manually fix the half-processed data." But: the system remembers exactly where it was, resumes from that point, and finishes the job. No duplicates, no dropped work, no manual cleanup.
Most SaaS teams implement this badly. They build something that works 99% of the time and spend their on-call hours cleaning up the 1%. Here's what durable execution actually requires, the patterns that work, and the ones that don't.
The Four Properties of Durable Execution
A durable execution system provides four guarantees. Miss any one of them, and you don't have durable execution — you have a wish.
These aren't nice-to-haves. If your payment processing pipeline doesn't have exactly-once semantics, you will double-charge someone someday. If your data export pipeline doesn't checkpoint, a crash at 95% completion means restarting an 8-hour job. These are production incidents waiting to happen.
Pattern 1: The Jobs Table (What Most Teams Build)
The instinctive approach: a background_jobs table with status, attempts, last_error, and a worker process that polls it.
CREATE TABLE background_jobs (
id UUID PRIMARY KEY,
job_type TEXT NOT NULL,
payload JSONB,
status TEXT DEFAULT 'pending', -- pending, running, done, failed
attempts INTEGER DEFAULT 0,
last_error TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);
This gets you 80% of the way there. It handles retries. It gives you visibility. But it fails on the hard cases:
Mid-execution crashes. If your worker dies between updating two rows, you've committed half the work. Restarting means figuring out which rows were touched. Most teams don't build this — they accept the risk and deal with it manually.
Exactly-once semantics. If your worker sends an email, crashes before updating the job status, restarts, and sends the email again — congratulations, your customer got two welcome emails. Idempotency keys help, but most teams don't implement them until after the first duplicate.
Long-running jobs blocking connections. A job that takes 30 minutes holds a database connection for 30 minutes. Connection pool exhaustion during a batch is a classic production incident.
Pattern 2: External Orchestrators (Temporal, Airflow, Step Functions)
The opposite extreme: hand your workflows to a purpose-built orchestrator. Temporal workers execute workflows in Go, TypeScript, Java, or Python. Airflow runs Python DAGs on a scheduler with a metadata database. AWS Step Functions gives you a JSON state machine.
These systems are powerful. Temporal in particular has solved the hard distributed systems problems — exactly-once execution, retry with backoff, workflow versioning, long-polling for external events. It's the gold standard for durable execution at scale.
The cost: operational complexity.
| Orchestrator | What You Get | What You Pay |
|---|---|---|
| Temporal | General-purpose durable execution, SDKs in 4+ languages, workflow versioning, external event signals | Temporal server cluster, worker deployment, SDK integration, operational monitoring for the orchestrator itself |
| Apache Airflow | Python DAGs, rich scheduling, extensive provider ecosystem, web UI | Scheduler, web server, metadata database, worker pool, DAG versioning headaches, notorious for operational drift |
| AWS Step Functions | Fully managed, JSON state machine, native AWS integrations | Vendor lock-in, JSON-only workflow definitions, AWS-specific knowledge, cost at scale |
| pg_durable (new) | SQL-native workflows, automatic checkpointing, zero extra infrastructure | PostgreSQL extension required, SQL-shaped — arbitrary code goes behind HTTP endpoints or SQL functions, v0.x maturity |
The decision framework is simple: if your workflows are mostly SQL with some HTTP calls, and your state lives in PostgreSQL, you don't need an external orchestrator. You need checkpointing inside the database. pg_durable or a well-implemented jobs table with explicit checkpointing gets you there.
If your workflows span multiple services, involve long-running human approvals, or need polyglot SDK support, Temporal is the right call. Accept the operational cost and move on.
Pattern 3: Event Sourcing (The Nuclear Option)
Event sourcing stores every state change as an immutable event. Replaying events reconstructs current state. Durable execution comes for free because every step is an event — if the process crashes, replay the events.
This is overkill for 95% of SaaS backends. It's the right pattern for financial ledgers, audit systems, and domains where every state transition must be traceable and reproducible. For sending welcome emails, it's architecture astronaut territory.
When event sourcing makes sense: Your domain is accounting, payments, compliance, or inventory — anything where the audit trail is the product. The operational cost of maintaining event streams and projections is justified by the business requirement.
When it doesn't: Everything else. A jobs table with checkpointing achieves the same durability guarantees with 10% of the complexity.
Pattern 4: Database-Native Durable Execution (The New Option)
pg_durable represents a fourth pattern: durable execution as a database feature. Define your workflow in SQL, let the extension handle checkpointing, retries, and recovery.
The advantages are significant:
- Zero additional infrastructure. You're already running PostgreSQL. No new services to deploy, monitor, or pay for.
- Same operational model. Backups, auth, monitoring — it's all PostgreSQL. Your existing tooling covers pg_durable workflows.
- Colocated with data. No network hops between orchestrator and database. Workflow state and application data live in the same place.
- SQL visibility.
SELECT * FROM df.instances WHERE status = 'running'— that's your workflow dashboard.
The constraints are equally real:
- SQL-shaped workflows. If a step needs arbitrary application code, it must live behind an HTTP endpoint or a SQL function. This is fine for data pipelines and API orchestration. It's limiting for complex business logic.
- PostgreSQL extension dependency. Not available on managed PostgreSQL services that don't support extensions. AWS RDS supports extensions, but pg_durable needs a background worker — check your provider's extension policy.
- v0.x maturity. The extension is new. Production adoption means accepting some rough edges and contributing fixes.
What We Recommend
For the typical SaaS backend — an Indian startup with 2-5 engineers, PostgreSQL as the primary database, and background jobs for emails, reports, data syncs, and scheduled maintenance — here's the escalation path:
Start with a jobs table. It's simple, it works, and it teaches you where durability breaks. Add idempotency keys from day one.
When jobs start failing mid-execution, add explicit checkpointing. A
checkpointstable or aprogressJSONB column. This is the line between "we handle retries" and "we have durable execution."When checkpointing becomes a pattern you reimplement every time, adopt pg_durable or Temporal. The decision: pg_durable if your workflows are SQL-shaped, Temporal if they're not.
When you need human-in-the-loop approvals or multi-day workflows, Temporal. pg_durable's sweet spot is minutes-to-hours; Temporal handles days-to-weeks with signals and timers.
The mistake most teams make is jumping to step 4 before step 2 breaks. You don't need Temporal to send onboarding emails. You need a jobs table with idempotency keys and basic checkpointing. Build the simple thing first, let it break in production, and only then reach for the heavier tool.
Durable execution is a solved problem. The question isn't whether to implement it — it's whether you implement it with infrastructure you already have or infrastructure you need to add. For most teams in 2026, the answer starts with PostgreSQL.
Tags
- durable-execution
- workflow
- postgresql
- saas
- architecture
More on engineering
- When PostgreSQL Is Enough — Stop Adding Infrastructure Your SaaS Doesn't NeedMost SaaS backends running on PostgreSQL don't need Redis, Kafka, Elasticsearch, or a separate queue. Here's when the database you already have is the right tool — and when it isn't.
- Microsoft Open-Sources pg_durable — Durable Execution Moves Inside PostgreSQLMicrosoft released pg_durable on June 5, 2026, bringing durable execution directly into PostgreSQL as an extension. Define workflows in SQL, let Postgres checkpoint each step, survive crashes without external orchestrators. Here's what it does, who it's for, and why it matters.
- Connection Pooling Is Not Optional — PostgreSQL at Scale for Multi-Tenant SaaSEvery Rails/Django/Node.js tutorial ships with a database.yml that opens 5 connections. Multi-tenant SaaS at 200 tenants means 1,000 connections. PostgreSQL falls over around 300. Here's how connection pooling — specifically pgbouncer — prevents the crash you're heading toward.