What is distributed transaction?
Consider the following Microservices architecture for e-commerce website. Each of the service shown below has its own database.
In this example, when an order is placed:
- Order is saved by order service in its DB
- Customer payment data is updated by Payment service and saved in Payment DB
- Stock is checked by stock serve and updated in its own DB
- And delivery details are updated by delivery service and updated its own DB.
All these operations have to happen as part of one transaction, and if one of the operations fails, all previous operations have to rollback.
Transactions provide atomicity of multiple operations but on a single database. Here, this is not a single database, but multiple databases.
So what we need here is Distributed Transaction.
Distributed Transaction is not about some mechanism present in DB so they can manage transactions across multiple databases. But it is about how we can write our services so they take care of the transaction feature we need across multiple services/databases.
How transaction behaves in monolithic architecture?
- In monolithic transaction is not distributed.
- If something fails, say while updating order, all previous actions will be rollback.
What is 2 Phase commit?
Two-phase commit protocol is used for distributed transaction. It breaks a database commits into two phases.
In this pattern, to take care of distributed transactions we need a separate service called Coordinator Service (a separate service). This service manages the commits and rollbacks to all microservices involved in a distributed transactions.
Can you explain the two phases of 2-phase commit protocol?
- Transaction starts with the coordinator sending a “PREPARE” message to all slaves (micro-services)
- Coordinator generally sends all the nodes suggested value/change proposed by the client and solicit their response. The participants execute the transaction up to the point where they will be asked to commit.
- Consider records being locked at this point of time on which transactions need to be committed.
- No other request can modify the same row.
- Each slave (micro-service) responds by sending a “READY” message back to the coordinator.
- If a slave responds with a “NOT READY” message or does not respond at all, then the coordinator sends a global “ABORT” message to all the other slaves.
- Only upon receiving an acknowledgment from all the slaves that the transaction has been aborted does the coordinator consider the entire transaction aborted.
- Once the transaction coordinator has received the “READY” message from all the slaves, it sends a “COMMIT” message to all of them, which contains the details of the transaction that needs to be stored in the databases.
- Each slave applies the transaction and returns a “DONE” acknowledgment message back to the coordinator.
- The coordinator considers the entire transaction to be completed once it receives a “DONE” message from all the slaves.
What are Pros and Cons of 2 Phase commit?
- The protocol makes the data consistent and available, either all the databases get an update or none do.
- The greatest disadvantage of the two-phase commit protocol is that it is a blocking protocol, the failure of a single node blocks progress until the node recovers:
- Consider one of the Service-A fails and not able to reply (in either Prepare or Commit Phase), the coordinator will keep waiting for Service-A to respond.
- Consider after Prepare phase is over, the coordinator fails before the commit phase. Now nodes have locked the data and are waiting for commit or rollback command from coordination, but they will never receive it.
- Timeouts should be in place for nodes to check till what time they will keep the data locked or keep waiting for the coordinator to send commit message, and vice versa coordinator should have the timeout to keep a check on time till what it will wait for a reply to come from nodes.
- If a participant node fails during the commit phase, the system is left to lurch in the dark because the coordinator doesn’t know whether the participant failed after committing or before committing.
- The protocol’s latency depends on the slowest node. Since it waits for all the nodes to send acknowledgment messages, a single slow node will slow down the entire transaction.
- Coordination is over HTTP, which is slow.
Even if bring timeouts as a solution to any of issue, there is still latency getting in. Resources are getting blocked and no one else can use them.
Can you give example, why a Slave will send Not Ready message back?
This may happen when node has conflicting concurrent transaction or Node have some internal issue going on.