What is distributed transaction?
Consider you have a application composed of microservices, where each Microservices have its own database:

In this example:

All these operations have to happen as part of one transaction, and if one of the operations fails, all previous operations have to rollback.
Transactions provide atomicity of multiple operations but on a single database. Here, this is not a single database, but multiple databases.
So what we need here is Distributed Transaction.
Distributed Transaction is not something which is supported by databases, but we have to write our services in such a way that we can implement distributed transactions.
What are different patterns for distributed transactions?

What is SAGA Pattern?
The saga design pattern is a way to manage data consistency across microservices in distributed transaction scenarios. A saga is a sequence of transactions that updates each service and publishes a message or event to trigger the next transaction step. If a step fails, the saga executes compensating transactions that counteract the preceding transactions.
The first transaction in a saga is initiated by an external request corresponding to the system operation, and then each subsequent step is triggered by the completion of the previous one.
Subsequent services will get an event over messaging bus to perform the operation. If any of service fails to perform the operation, it publishes a failure event which is listened by previous services which have performed their operation and based on that they rollback their operation.
In a very high-level design a saga pattern implementation would look like the following:

Why Saga when we have 2 phase commits?
Distributed transactions like the two-phase commit (2PC) protocol require all participants in a transaction to commit or rollback before the transaction can proceed. However some participant implementations, such as NoSQL databases and message brokering, don’t support this model.
How to implement Saga Pattern?

What is Events/Choreography approach for implementing Saga Pattern?
In Choreography approach participants exchange events without a centralized point of control.
The first service executes a transaction and then publishes an event. This event is listened by one or more services that execute local transactions and publish (or not) new events.
The distributed transaction ends when the last service executes its local transaction and does not publish any events or the event published is not heard by any of the saga’s participants.
[private]

[/private]
How rollbacks are performed under Events/Choreography approach for implementing Saga Pattern?
Rolling back a distributed transaction does not come for free. You have to implement another compensating transaction for what has been done before.
[private]
Suppose that Stock Service has failed during a transaction. Let’s see what the rollback would look like:

- Stock Service produces PRODUCT_OUT_OF_STOCK_EVENT;
- Both Order Service and Payment Service listen to the previous message:
- Payment Service refund the client
- Order Service set the order state as failed
Note that it is crucial to define a common shared ID for each transaction, so whenever you throw an event, all listeners can know right away which transaction it refers to.
[/private]
What are the benefits and drawbacks of using Saga’s Event/Choreography design?
Benefits:
- Good for simple workflows that require few participants and don’t need a coordination logic.
- Doesn’t require additional service implementation and maintenance.
- Doesn’t introduce a single point of failure, since the responsibilities are distributed across the saga participants.
Drawbacks
- Workflow can become confusing when adding new steps, as it’s difficult to track which saga participants listen to which commands.
- There’s a risk of cyclic dependency between saga participants because they have to consume each other’s commands.
- Integration testing is difficult because all services must be running to simulate a transaction.
[private]
What is Command/Orchestration approach for implementing Saga Pattern?
Orchestration is a way to coordinate sagas where a centralized controller tells the saga participants what local transactions to execute. The saga orchestrator handles all the transactions and tells the participants which operation to perform based on events.
The orchestrator executes saga requests, stores and interprets the states of each task, and handles failure recovery with compensating transactions.

How rollbacks are performed under Command/Orchestration approach for implementing Saga Pattern?

- Stock Service replies to OSO with an Out-Of-Stock message;
- OSO recognizes that the transaction has failed and starts the rollback
- In this case, only a single operation was executed successfully before the failure, so OSO sends a Refund Client command to Payment Service and set the order state as failed
What are benefits and drawbacks of Using Saga’s Command/Orchestration Design?
Benefits:
- Good for complex workflows involving many participants or new participants added over time.
- Doesn’t introduce cyclical dependencies, because the orchestrator unilaterally depends on the saga participants.
Drawbacks:
- Additional design complexity requires an implementation of a coordination logic.
- There’s an additional point of failure, because the orchestrator manages the complete workflow.
What are the crucial points to keep in mind while implementing Saga Pattern?
- The saga pattern is particularly hard to debug, and the complexity grows as participants increase.
- Create a Unique Id per Transaction: Having a unique identifier for each transaction is a common technique for traceability, but it also helps participants to have a standard way to request data from each other.
- For Orchestrator pattern: Add the Reply Address Within the Command: Instead of designing your participants to reply to a fixed address, consider sending the reply address within the message, this way you enable your participants to reply to multiple orchestrators.
- Idempotent Operations: If you are using queues for communication between services (like SQS, Kafka, RabbitMQ, etc.), most of those queues might deliver the same message twice. It also might increase the fault tolerance of your service. Quite often a bug in a client might trigger/replay unwanted messages and mess up with your database.
- Avoiding Synchronous Communications: As the transaction goes, don’t forget to add into the message all the data needed for each operation to be executed. The whole goal is to avoid synchronous calls between the services just to request more data. It will enable your services to execute their local transactions even when other services are offline. The downside is that your orchestrator will be slightly more complex as you will need to manipulate the requests/responses of each step, so be aware of the tradeoffs.
[/private]