This project implements a fault-tolerant distributed transaction processing system designed to support a simple banking application.
The system incorporates key distributed systems concepts such as sharding, replication, Paxos consensus algorithm, and the two-phase commit protocol to achieve scalability, fault tolerance, and correctness in transaction handling.
Data is partitioned across multiple shards, each replicated across servers within a cluster. Fault tolerance is achieved by ensuring data availability even if one server in a cluster fails.
![Screenshot 2025-01-19 at 10 50 25 PM](https://private-user-images.githubusercontent.com/32031518/404681822-1ea73657-ad37-4ae8-a378-97b6e457d86a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyOTAxODYsIm5iZiI6MTczOTI4OTg4NiwicGF0aCI6Ii8zMjAzMTUxOC80MDQ2ODE4MjItMWVhNzM2NTctYWQzNy00YWU4LWEzNzgtOTdiNmU0NTdkODZhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDE2MDQ0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTBiNjkyNGZhYTBiNjJkOWQ1MzgyYzI0OGU0NjgzMjU5NzkxNjQxYmM3ZjNjZDY0YWFkM2E1MzJkNDlmOGU0MjQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Cc9W0qNzSiTKmMMm0Yl402HZU5-vXPhkZdHY_kXBHkA)
![Screenshot 2025-01-19 at 10 51 39 PM](https://private-user-images.githubusercontent.com/32031518/404681886-6ac177e2-c496-4fed-9e21-43f7d2cc3f9b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyOTAxODYsIm5iZiI6MTczOTI4OTg4NiwicGF0aCI6Ii8zMjAzMTUxOC80MDQ2ODE4ODYtNmFjMTc3ZTItYzQ5Ni00ZmVkLTllMjEtNDNmN2QyY2MzZjliLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDE2MDQ0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY0OGRiYjMxMzAxMWJjNGM2ZDcyOGI5NDkzYTYzNTIzNTAxMTkzMzA0NDMzZmQxYWFhZDRhZTIxOWI1ZTk3MWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.ecmCjnsND0BpqbUE_btppDXIcRa_WMxMSEhONgbwq38)
Intra-Shard Transactions: Access data within a single shard and are processed using a modified Paxos protocol.
Cross-Shard Transactions: Access data across multiple shards and are handled using the two-phase commit (2PC) protocol with Paxos for intra-shard consensus.
Modified Paxos Protocol: Ensures fault-tolerant ordering and execution of transactions within a shard.
Two-Phase Commit Protocol: Coordinates cross-shard transactions to maintain atomicity and consistency across clusters.
Handles fail-stop failure models with robust locking mechanisms. Transactions abort in scenarios like insufficient balances, lock contention, or lack of quorum during consensus.
Measures throughput (transactions per second) and latency (time to process a transaction).
- Consensus Protocols: Paxos and Multi-Paxos for intra-shard consistency.
- Atomicity and Consistency: Achieved using two-phase commit for cross-shard transactions.
- Fault Tolerance: Replication within clusters and recovery mechanisms like write-ahead logs (WAL).
- Concurrency Control: Two-phase locking to manage concurrent transactions and prevent conflicts.
- Scalability: Partitioning and replication ensure the system can handle large-scale datasets and distributed workloads.
- Distributed systems design: sharding, replication, consensus protocols.
- Implementation of fault-tolerant protocols like Paxos and 2PC.
- Concurrency handling with locks and conflict resolution.
- RPC Communications using net/rpc package.
Provides the balance of any data item across all servers.
Logs committed transactions for debugging and auditing.
Performance metrics of each node in the system.
The following output is from a tmux session running 9 nodes as separate processes on different ports.
The full spec doc of this project can be found here.