RPC vs NATS: It's Not About Sync vs Async — It's About Who Owns Completion
The most common microservices mistake isn't picking the wrong transport. It's misreading who is responsible for knowing the work finished. A field guide to completion ownership in RPC, message bus, and event-driven systems.
Table of Contents
A team I worked with once migrated an order-placement path from gRPC to NATS because “it’s decoupled and faster.” The old flow was simple: the web service called PlaceOrder via gRPC, got back an order ID, rendered success to the user. The new flow: web service publishes order.place to NATS, an order-service consumes it and processes asynchronously.
Within three weeks they had three kinds of incidents on rotation:
- Duplicate orders — retry on the publisher side meant the same order was placed twice when the first publish actually succeeded but the ack was slow.
- Lost orders — consumer crashed mid-process; no ack meant NATS redelivered, but the consumer had already partially committed state, so redelivery was rejected by a dedup check. The order just… disappeared from the user’s perspective.
- Dark-failure support tickets — users reported “I clicked buy and nothing happened.” From the publisher side, everything looked fine. From the consumer side, processing time had drifted from 50 ms to 45 seconds because a downstream DB had a slow query, and the web team had no telemetry on the consumer side.
The retro landed on a single sentence: we thought we were changing the transport; we actually changed who owned the completion of the work.
tl;dr — RPC and pub/sub messaging look like two points on a sync-vs-async spectrum. They aren’t. They’re two fundamentally different ownership contracts. In RPC, the caller owns knowing the work finished. In messaging, the receiver owns it. Swapping one for the other without inverting retry, idempotency, ack, and observability is how you turn a clean migration into a three-month incident parade.
The Sync-vs-Async Trap
The most common framing I see is this: RPC is synchronous, messaging is asynchronous, pick based on whether you need the answer immediately. That framing is almost useless in practice. It conflates two separate axes.
Axis 1: Does the caller wait? Sync vs async. This is a latency question.
Axis 2: Who is responsible for knowing the work completed? Caller or receiver. This is a contract question.
You can have synchronous messaging (request-reply over NATS with a reply subject — caller waits, but transport is pub/sub). You can have asynchronous RPC (fire-and-forget gRPC — stream.Send with no ack). What matters isn’t how long the caller waits. It’s who’s on the hook if the work doesn’t happen.
Two Clean Ownership Models
flowchart LR
subgraph RPC["RPC — caller owns completion"]
C1[Client] -->|"1. call · wait"| S1[Server]
S1 -->|"2. did the thing · return result
(or timeout → client decides)"| C1
end
subgraph Msg["Messaging — receiver owns completion"]
P1[Publisher] -->|"1. send · fire and forget"| B1[[Message bus]]
B1 -->|"2. eventually"| R1[Consumer]
R1 -->|"3. ack (or NACK · redeliver)"| B1
end
classDef rpc fill:#e8f4f8,stroke:#2c5282
classDef msg fill:#f0fff4,stroke:#2f855a
class RPC rpc
class Msg msg
Two models. Opposite error semantics. Opposite retry semantics. Opposite observability alignment. Swapping one for the other changes every downstream engineering assumption.
RPC: caller owns completion
When a client makes an RPC, they hold the socket open until a response comes back. That response is a statement by the server: I did the thing, here’s the result. If the call times out, the client assumes failure (possibly partial) and has to decide what to do about it.
What this means operationally:
- Retry is a caller decision. The caller knows whether the work is idempotent, how important it is, and how much budget is left.
- Errors propagate naturally. A gRPC status code goes right back up the call chain.
- Observability aligns. The caller’s span includes the work’s duration. If it’s slow, the caller sees it.
- Backpressure is immediate. Callers block on slow servers, limiting their own rate.
This is why RPC feels “simple” — the ownership contract is tight. The downside: the caller’s fate is coupled to the callee’s fate. A slow server propagates slowness back to every caller.
Messaging: receiver owns completion
When a publisher sends a message, the bus accepts it. The publisher’s job is done. Whether the work happens — when, in what order, how many times, whether at all — is now somebody else’s problem. Usually the consumer’s.
What this means operationally:
- Retry is a consumer decision. The bus may redeliver on no-ack; the consumer has to decide how to handle that (idempotency key, dedup table, upsert).
- Errors are silent on the publisher side. A failed consumer doesn’t tell the publisher. A dead-letter queue or out-of-band alerting has to be built.
- Observability splits in two. Publisher metrics say “I sent it.” Consumer metrics say “I processed it.” The gap between those — lag — is its own story.
- Backpressure is decoupled. Publishers can happily overwhelm consumers, which means you need consumer-side rate limits or bounded queues.
This is why pub/sub feels “flexible” — producers and consumers are independent. The downside: nothing is automatic. Every property that RPC gave you for free (retry policy, error propagation, aligned observability, flow control) is now a thing you have to design and build.
The Real Decision
Once you see it as an ownership question, the decision becomes clearer:
- Does the caller need the answer to decide what happens next? → RPC. Auth check. Balance read. Inventory reservation. Any synchronous business flow.
- Is the work a notification that something already happened? → Messaging. “Order was placed.” “User signed up.” Downstream consumers that don’t gate the primary flow.
- Can the work tolerate delay and be retried independently? → Messaging. Email send. Indexing. Analytics.
- Is the work idempotent by construction, or can it be made so cheaply? → Messaging works. If not, RPC’s caller-owned retry is simpler to reason about.
You can mix them. Most mature microservice stacks do. The mistake is picking messaging because “decoupled is better” without doing the consumer-side engineering that decoupling requires.
What Actually Has to Change in the Migration
Here’s the minimum checklist for every RPC → messaging migration. If any of these aren’t in place, the old code was better.
1. Idempotency keys, enforced at the consumer
Every message carries an operation ID. The consumer dedup-checks before committing. This is not optional. Without it, any redelivery (and there will be redeliveries) creates duplicate state.
func (c *Consumer) Handle(msg Message) error {
if alreadyProcessed(msg.OpID) {
return nil // idempotent: we already did this
}
tx, err := db.Begin()
if err != nil { return err }
defer tx.Rollback()
if err := doTheWork(tx, msg); err != nil {
return err // message will be redelivered
}
if err := markProcessed(tx, msg.OpID); err != nil {
return err
}
return tx.Commit()
}
The markProcessed call has to be in the same transaction as the actual work, or you have a race where the work commits but the dedup record doesn’t. Then the next redelivery re-does it.
2. Explicit ack semantics
Know whether your bus is at-most-once (send and forget, messages can be lost), at-least-once (redelivery on no-ack, duplicates possible), or effectively-once (at-least-once plus receiver-side dedup). Most production systems run on at-least-once with dedup. NATS core is at-most-once by default; NATS JetStream is at-least-once. Kafka is at-least-once with offset-based replay. RabbitMQ is configurable — check both sides agree.
3. Dead-letter path
Messages that fail repeatedly have to go somewhere other than “redelivered forever.” A dead-letter queue (or topic, or subject) plus an alert when non-trivial traffic hits it. Without this, a poison message takes a consumer out of service.
4. Consumer-side observability
At minimum: consumer lag (messages in flight), processing time per message, error rate, redelivery rate. The publisher’s metrics tell you about the bus, not about the work. If you can’t see “how fast is the consumer chewing through the queue right now,” you’re flying blind during the next incident.
5. Replay and reprocessing
What happens when the consumer has a bug that corrupts data for a day, you fix the bug, and now you need to reprocess yesterday’s messages? In RPC, you’d re-run the caller. In messaging, you need the ability to replay from an offset or from a backup. If the bus doesn’t give you that (NATS core doesn’t, JetStream does), you need a separate event log.
A Specific Pattern I Like: The Request-Reply on a Bus
One thing that confuses the discussion: you can do synchronous-looking work on a message bus. NATS has request-reply built in (nc.Request(subject, payload, timeout)), where the publisher gets a correlated reply on a temporary subject. This gives you the RPC ergonomics while using the messaging infrastructure.
When is this useful?
- When you want the operational simplicity of RPC (caller waits, caller decides) but your service mesh is the message bus and adding a gRPC stack is overhead.
- When you want transparent failover — multiple consumers can listen, any can reply, and the bus handles the routing.
- When you want unified observability — both “notify” and “ask” flows go through the same substrate.
Request-reply over NATS gives you back caller-owned completion semantics on messaging infrastructure. It’s the “pick ownership model separately from transport” option. Many good designs use it.
The one that doesn’t work: request-reply where the reply is supposed to happen later, via a different message. At that point the caller has moved on, the completion is truly transferred, and you’re back in consumer-owned territory. Don’t pretend otherwise.
The Framing I Use in Design Reviews
When someone says “let’s use NATS/Kafka/RabbitMQ for this,” I ask exactly one question: who is responsible if the work doesn’t happen?
If the answer is “the caller will notice and retry,” they want RPC. If the answer is “the receiver will eventually catch up,” they want messaging. If the answer is “I don’t know,” the design isn’t ready.
Everything else — transport, framing, protocol — is implementation. The ownership contract is the architecture.
Related
- NATS vs Kafka vs MQTT: Same Category, Very Different Jobs — once you’ve decided messaging, how to pick among the three.
- Why Your “Fail-Fast” Strategy is Killing Your Distributed System — retry and resilience on the RPC side of the boundary.
- Go Context in Distributed Systems: What Actually Works in Production — cancellation propagation in caller-owned flows.
Comments
This space is waiting for your voice.
Comments will be supported shortly. Stay connected for updates!
This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.
Have questions? Reach out through:
Want to see your comment featured? Mention us on X or tag us on Reddit.