Modern order processing systems rarely exist as a single application.

As systems evolve, order management, inventory, payment processing, fulfillment, shipping, and customer notifications are often separated into independent services.

While this improves scalability and team autonomy, it also introduces a new challenge:

How do you maintain a reliable workflow when multiple services participate in a single business process?

In this article, I'll walk through the architecture of an event-driven order fulfillment system built using .NET, RabbitMQ, Redis, SQL Server, Prometheus, and Grafana.

The focus is not on building another microservice demo.

The focus is on handling failures, duplicate messages, retries, observability, and eventual consistency in a distributed environment.

The Problem

Consider a simple order workflow:

Create Order
    ↓
Reserve Inventory
    ↓
Process Payment
    ↓
Create Fulfillment Request
    ↓
Generate Shipment
    ↓
Send Notification

The happy path is easy.

The real challenge begins when failures occur.

Questions such as these quickly appear:

What happens if payment succeeds but shipment creation fails?
What happens if a service becomes temporarily unavailable?
What happens if RabbitMQ delivers the same event multiple times?
How do we prevent partial workflow completion?

These are common problems in distributed systems and require architectural decisions beyond basic CRUD operations.

Architecture Overview

High-level architecture showing event-driven communication between independent services.

The architecture consists of the following services:

Order Service
Inventory Service
Payment Service
Fulfillment Service
Shipping Service
Notification Service

Supporting infrastructure:

RabbitMQ
Redis
SQL Server
Prometheus
Grafana

Each service owns a specific business capability and communicates through asynchronous events.

Rather than making direct service-to-service calls throughout the workflow, services react to business events and publish new events when processing completes.

This reduces coupling and allows services to evolve independently.

Why Event-Driven Communication?

Direct service dependencies often create cascading failures.

For example:

Order Service
    ↓
Inventory Service
    ↓
Payment Service
    ↓
Shipping Service

If a downstream dependency becomes unavailable, the entire workflow may be affected.

Using RabbitMQ allows services to communicate asynchronously.

Benefits include:

Reduced coupling
Better fault isolation
Independent scaling
Improved resilience
Easier workflow extension

A new service can subscribe to existing events without requiring modifications to existing services.

Using the Saga Pattern

Maintaining consistency across multiple services is one of the most common distributed systems challenges.

Traditional distributed transactions are often avoided because of their complexity and operational overhead.

Instead, this architecture uses the Saga Pattern.

Each service performs its own local transaction and publishes an event for the next stage of the workflow.

Successful flow:

OrderCreated
    ↓
InventoryReserved
    ↓
PaymentProcessed
    ↓
FulfillmentCreated
    ↓
ShipmentGenerated

Failure flow:

PaymentFailed
    ↓
InventoryReleased
    ↓
OrderMarkedFailed

The workflow remains consistent without requiring a distributed transaction coordinator.

Order lifecycle coordinated through asynchronous business events.

Separating Business Failures from Technical Failures

One design decision that significantly simplified the architecture was separating business failures from technical failures.

Business Failure

Example:

Payment Declined

Result:

Inventory Released
Order Marked Failed

Technical Failure

Example:

RabbitMQ Consumer Exception

Result:

Retry Processing
Move Message To DLQ
Operational Investigation

A declined payment is a valid business outcome.

A processing exception is not.

Treating them differently leads to simpler workflows and clearer operational processes.

Handling Duplicate Messages with Redis

RabbitMQ provides at-least-once delivery guarantees.

This means duplicate messages are possible.

Common causes include:

Consumer restarts
Retry processing
Network interruptions
Broker recovery

Without protection, duplicate deliveries can result in duplicate business actions.

Examples:

Duplicate shipment creation
Duplicate inventory updates
Duplicate payment processing

Redis is used as an idempotency store.

Before processing an event, consumers verify whether the event has already been handled.

If the event exists, processing is skipped safely.

Redis storing processed event identifiers used for idempotency checks.

Retry Strategy and Dead-Letter Queues

Failures are expected.

The goal is not to eliminate failures.

The goal is to recover safely.

When event processing fails:

Retry processing
Continue if successful
Move to DLQ if retries are exhausted

Dead-letter queues provide:

Visibility
Replay capability
Operational recovery
Failure investigation

Failed messages should never disappear silently.

RabbitMQ queues and dead-letter queues supporting resilient message processing.

Auditability

As workflows become more distributed, understanding historical state transitions becomes increasingly important.

Common operational questions include:

Why was an order cancelled?
Which event triggered the failure?
What was the last successful processing step?

To support troubleshooting and auditing, important workflow transitions are stored in SQL Server.

Audit records showing workflow state transitions across services.

Observability with Prometheus and Grafana

Observability was treated as a first-class concern in the architecture.

Prometheus collects metrics exposed by participating services.

Grafana visualizes these metrics through dashboards.

Metrics include:

Request throughput
Retry activity
Service health
Failed event counts
Queue behavior

Grafana dashboard displaying service-level metrics and HTTP request activity across the order fulfillment workflow.

Prometheus target health confirming that metrics are being collected successfully from all participating services.

Lessons Learned

A few recurring lessons emerged while building this architecture.

Duplicate Events Are Normal

Consumers should always assume events may arrive more than once.

Business Failures Are Not System Failures

A declined payment and a service outage require different responses.

Eventual Consistency Requires Planning

Services may temporarily disagree about the current state of an order.

This is often an acceptable tradeoff for reduced coupling.

Observability Should Be Built Early

Metrics, logs, and audit trails become critical once systems move beyond development environments.

Conclusion

Distributed systems become difficult when failures occur.

Retries, duplicate deliveries, unavailable services, and partial workflow completion are realities of production environments.

By combining RabbitMQ, Redis, SQL Server, and the Saga Pattern, it is possible to build workflows that remain reliable even when individual services fail.

The architecture discussed in this article follows a simple principle:

Failures should be expected, visible, and recoverable.

That principle often makes the difference between a system that works in development and one that remains reliable in production.

GitHub Repository

https://github.com/Rumman90/resilient-order-fulfillment-architecture

Designing a Resilient Order Fulfillment Architecture with Saga Pattern, RabbitMQ, Redis, and SQL Server

The Problem

Architecture Overview

Why Event-Driven Communication?

Using the Saga Pattern

Separating Business Failures from Technical Failures

Business Failure

Technical Failure

Handling Duplicate Messages with Redis

Retry Strategy and Dead-Letter Queues

Auditability

Observability with Prometheus and Grafana

Lessons Learned

Duplicate Events Are Normal

Business Failures Are Not System Failures

Eventual Consistency Requires Planning

Observability Should Be Built Early

Conclusion

GitHub Repository

Comments

More from this blog

Testing Index Assumptions Across PostgreSQL, MySQL, SQL Server, and Oracle

Building a Document Intelligence Platform with NestJS, n8n, and PostgreSQL

Blockchain Meets Distributed Systems: The Similarities Most Engineers Overlook

Designing a Scalable Crypto Wallet Architecture That Handles Retries, Ledger Integrity, and Distributed Failures

Command Palette

The Problem

Architecture Overview

Why Event-Driven Communication?

Using the Saga Pattern

Separating Business Failures from Technical Failures

Business Failure

Technical Failure

Handling Duplicate Messages with Redis

Retry Strategy and Dead-Letter Queues

Auditability

Observability with Prometheus and Grafana

Lessons Learned

Duplicate Events Are Normal

Business Failures Are Not System Failures

Eventual Consistency Requires Planning

Observability Should Be Built Early

Conclusion

GitHub Repository

Comments

More from this blog