Designing a Resilient Order Fulfillment Architecture with Saga Pattern, RabbitMQ, Redis, and SQL Server
Handling retries, idempotency, dead-letter queues, and eventual consistency in distributed order processing systems.

Modern order processing systems rarely exist as a single application.
As systems evolve, order management, inventory, payment processing, fulfillment, shipping, and customer notifications are often separated into independent services.
While this improves scalability and team autonomy, it also introduces a new challenge:
How do you maintain a reliable workflow when multiple services participate in a single business process?
In this article, I'll walk through the architecture of an event-driven order fulfillment system built using .NET, RabbitMQ, Redis, SQL Server, Prometheus, and Grafana.
The focus is not on building another microservice demo.
The focus is on handling failures, duplicate messages, retries, observability, and eventual consistency in a distributed environment.
The Problem
Consider a simple order workflow:
Create Order
↓
Reserve Inventory
↓
Process Payment
↓
Create Fulfillment Request
↓
Generate Shipment
↓
Send Notification
The happy path is easy.
The real challenge begins when failures occur.
Questions such as these quickly appear:
What happens if payment succeeds but shipment creation fails?
What happens if a service becomes temporarily unavailable?
What happens if RabbitMQ delivers the same event multiple times?
How do we prevent partial workflow completion?
These are common problems in distributed systems and require architectural decisions beyond basic CRUD operations.
Architecture Overview
High-level architecture showing event-driven communication between independent services.
The architecture consists of the following services:
Order Service
Inventory Service
Payment Service
Fulfillment Service
Shipping Service
Notification Service
Supporting infrastructure:
RabbitMQ
Redis
SQL Server
Prometheus
Grafana
Each service owns a specific business capability and communicates through asynchronous events.
Rather than making direct service-to-service calls throughout the workflow, services react to business events and publish new events when processing completes.
This reduces coupling and allows services to evolve independently.
Why Event-Driven Communication?
Direct service dependencies often create cascading failures.
For example:
Order Service
↓
Inventory Service
↓
Payment Service
↓
Shipping Service
If a downstream dependency becomes unavailable, the entire workflow may be affected.
Using RabbitMQ allows services to communicate asynchronously.
Benefits include:
Reduced coupling
Better fault isolation
Independent scaling
Improved resilience
Easier workflow extension
A new service can subscribe to existing events without requiring modifications to existing services.
Using the Saga Pattern
Maintaining consistency across multiple services is one of the most common distributed systems challenges.
Traditional distributed transactions are often avoided because of their complexity and operational overhead.
Instead, this architecture uses the Saga Pattern.
Each service performs its own local transaction and publishes an event for the next stage of the workflow.
Successful flow:
OrderCreated
↓
InventoryReserved
↓
PaymentProcessed
↓
FulfillmentCreated
↓
ShipmentGenerated
Failure flow:
PaymentFailed
↓
InventoryReleased
↓
OrderMarkedFailed
The workflow remains consistent without requiring a distributed transaction coordinator.
Order lifecycle coordinated through asynchronous business events.
Separating Business Failures from Technical Failures
One design decision that significantly simplified the architecture was separating business failures from technical failures.
Business Failure
Example:
Payment Declined
Result:
Inventory Released
Order Marked Failed
Technical Failure
Example:
RabbitMQ Consumer Exception
Result:
Retry Processing
Move Message To DLQ
Operational Investigation
A declined payment is a valid business outcome.
A processing exception is not.
Treating them differently leads to simpler workflows and clearer operational processes.
Handling Duplicate Messages with Redis
RabbitMQ provides at-least-once delivery guarantees.
This means duplicate messages are possible.
Common causes include:
Consumer restarts
Retry processing
Network interruptions
Broker recovery
Without protection, duplicate deliveries can result in duplicate business actions.
Examples:
Duplicate shipment creation
Duplicate inventory updates
Duplicate payment processing
Redis is used as an idempotency store.
Before processing an event, consumers verify whether the event has already been handled.
If the event exists, processing is skipped safely.
Redis storing processed event identifiers used for idempotency checks.
Retry Strategy and Dead-Letter Queues
Failures are expected.
The goal is not to eliminate failures.
The goal is to recover safely.
When event processing fails:
Retry processing
Continue if successful
Move to DLQ if retries are exhausted
Dead-letter queues provide:
Visibility
Replay capability
Operational recovery
Failure investigation
Failed messages should never disappear silently.
RabbitMQ queues and dead-letter queues supporting resilient message processing.
Auditability
As workflows become more distributed, understanding historical state transitions becomes increasingly important.
Common operational questions include:
Why was an order cancelled?
Which event triggered the failure?
What was the last successful processing step?
To support troubleshooting and auditing, important workflow transitions are stored in SQL Server.
Audit records showing workflow state transitions across services.
Observability with Prometheus and Grafana
Observability was treated as a first-class concern in the architecture.
Prometheus collects metrics exposed by participating services.
Grafana visualizes these metrics through dashboards.
Metrics include:
Request throughput
Retry activity
Service health
Failed event counts
Queue behavior
Grafana dashboard displaying service-level metrics and HTTP request activity across the order fulfillment workflow.
Prometheus target health confirming that metrics are being collected successfully from all participating services.
Lessons Learned
A few recurring lessons emerged while building this architecture.
Duplicate Events Are Normal
Consumers should always assume events may arrive more than once.
Business Failures Are Not System Failures
A declined payment and a service outage require different responses.
Eventual Consistency Requires Planning
Services may temporarily disagree about the current state of an order.
This is often an acceptable tradeoff for reduced coupling.
Observability Should Be Built Early
Metrics, logs, and audit trails become critical once systems move beyond development environments.
Conclusion
Distributed systems become difficult when failures occur.
Retries, duplicate deliveries, unavailable services, and partial workflow completion are realities of production environments.
By combining RabbitMQ, Redis, SQL Server, and the Saga Pattern, it is possible to build workflows that remain reliable even when individual services fail.
The architecture discussed in this article follows a simple principle:
Failures should be expected, visible, and recoverable.
That principle often makes the difference between a system that works in development and one that remains reliable in production.
GitHub Repository
https://github.com/Rumman90/resilient-order-fulfillment-architecture




