This repository contains implementations of all the Labs from MIT's 6.824: Distributed Systems course. Each lab progressively deepens understanding of building fault-tolerant, parallel, and replicated systems using the Go programming language.
Goal: Build a simplified distributed MapReduce system that runs user-defined map and reduce tasks in parallel.
- Implemented Master–Worker coordination via Go RPC, handling dynamic task allocation and worker crashes.
- Supported fault recovery by re-assigning tasks after timeout detection.
- Generated intermediate files using JSON encoding for deterministic reduce-phase aggregation.
- Achieved 100% pass rate on the parallelism, crash recovery, and correctness tests.
- Designing distributed task scheduling under failure conditions.
- Managing concurrency with Go goroutines and synchronization primitives.
- Applying atomic file operations (
os.Rename) to ensure crash-safe writes. - Gaining deep insight into the MapReduce paper through practical re-implementation.
Goal: Implement the Raft consensus protocol to maintain replicated logs and ensure consistent state across unreliable networks.
- Built a leader election, log replication, and persistence mechanism across simulated servers.
- Implemented all three parts of the lab:
- 2A: Leader election and heartbeat mechanism.
- 2B: Log replication and follower consistency.
- 2C: State persistence and recovery after crash or reboot.
- Verified correctness with 100% passing scores on all test suites (2A, 2B, 2C).
- Optimized election timeouts and RPC scheduling for deterministic recovery and efficient consensus.
- Developed an in-depth understanding of distributed consensus and fault tolerance.
- Learned how to maintain replicated state machines that remain consistent under partial failure.
- Practiced lock management, concurrency control, and Go RPC message flow debugging.
- Experienced real-world reliability engineering: heartbeat intervals, election backoffs, and log compaction design trade-offs.
Goal: Build a linearizable, fault-tolerant key/value storage service using Raft for replication, providing strong consistency guarantees.
- Implemented a replicated state machine architecture with KVServers backed by Raft consensus.
- Built two major components:
- 3A: Key/value service with linearizability and exactly-once semantics
- 3B: Log compaction via snapshotting to prevent unbounded memory growth
- Key features implemented:
- Client request deduplication using ClientID and sequence numbers for idempotency
- Notification channels for efficient waiting on Raft commit confirmations
- Leader detection and retry logic with smart leader caching
- Snapshot installation with InstallSnapshot RPC for catching up lagging followers
- Conditional snapshot installation (
CondInstallSnapshot) to prevent stale snapshot overwrites
- Linearizability: All operations (Get/Put/Append) appear to execute atomically at some point between their invocation and response
- Exactly-once semantics: Handled duplicate client requests through sequence number tracking
- Memory management: Implemented log compaction when Raft state approaches
maxraftstatethreshold - State persistence: Snapshot includes both key-value database and deduplication state
- Fault tolerance: Service continues operating as long as a majority of servers are available
- Mastered building applications on top of consensus protocols (Raft as a black box)
- Implemented linearizable distributed storage with strong consistency guarantees
- Designed efficient client-server interaction patterns for retry and leader discovery
- Learned snapshot-based log compaction strategies for long-running services
- Practiced cross-layer coordination between application (KVServer) and consensus (Raft) layers
- Understood the critical importance of idempotency in distributed systems
- Gained experience with state machine replication and deterministic execution
- Language: Go (1.13+)
- Concurrency: goroutines, channels, mutexes,
sync.Cond - Persistence: Custom in-memory persister abstraction with snapshot support
- RPC Framework: Go net/rpc
- Encoding: GOB encoding for state serialization
- Testing: Comprehensive test suites including linearizability checkers
- Architecture: Layered design (Client → KVServer → Raft → Network)
- Built production-grade distributed systems patterns from scratch
- Achieved robust fault-tolerant computation and storage with proven correctness
- Developed practical understanding of:
- CAP theorem trade-offs in distributed systems
- Consensus-based replication for high availability
- State machine replication for deterministic distributed computation
- Log-structured storage and compaction strategies
- Foundation for real-world systems like:
- Distributed databases (CockroachDB, TiDB)
- Coordination services (Zookeeper, etcd, Consul)
- Replicated state stores in microservices architectures
- Race conditions: Careful mutex management across concurrent RPC handlers and background goroutines
- Deadlock prevention: Structured locking hierarchy between KVServer and Raft layers
- Network partitions: Robust handling of split-brain scenarios and leader changes
- Memory efficiency: Balancing log retention with snapshot frequency
- Duplicate detection: Maintaining deduplication state across crashes and snapshots
- Stale data prevention: Ensuring followers never install outdated snapshots
This project is for educational purposes as part of MIT's 6.824 course.