Systems Problems and Solutions

This chapter is for advanced systems problems where the real challenge is not only syntax. The challenge is coordinating memory, disk, concurrency, and recovery while keeping the design explainable.

The goal is to recognize a few recurring high-level patterns:

publish new state atomically
separate read paths from write paths
stage changes before commit
use logs for durability and recovery
use snapshots to avoid blocking readers
compact history so recovery does not become unbounded

1. Durable Atomic Store With WAL

Problem shape

Build a store where a transaction is only considered committed if it is safely written to a write-ahead log first. If the process crashes, the state must be recoverable from the log.

Core design

A clean order is:

serialize the transaction
append it to the WAL
sync_all() so the log is durable on disk
apply the change to an in-memory Arc snapshot
publish the new snapshot atomically

The important rule is:

Disk before memory publication.

If the process crashes after the WAL is durable but before the new in-memory snapshot is published, recovery can replay the log. If memory were updated first, a crash could lose acknowledged data.

Type shape

#![allow(unused)]
fn main() {
use std::collections::HashMap;
use std::fs::File;
use std::sync::{Arc, Mutex};

struct WalletDb {
    state: Arc<Mutex<Arc<HashMap<String, i32>>>>,
    log_file: Mutex<File>,
}
}

Here:

Arc<HashMap<...>> is the published in-memory snapshot
the outer Mutex protects pointer publication
the file handle is protected separately for serialized appends

Why interviewers like it

This problem forces you to explain:

durability versus visibility
why sync_all() is expensive but necessary
crash points and recovery order
why readers should stay on old snapshots while writers prepare a new one

Interview-safe summary

A WAL-based store makes the log the durable source of truth and the Arc snapshot the fast in-memory view. The commit order is append, fsync, apply, publish. That preserves durability and lets recovery rebuild the latest consistent state after a crash.

2. Sharded 2PC With `Arc`

Problem shape

Transfer data atomically across two shards without corrupting state if one side fails.

Core design

A good Arc-based two-phase-commit shape is:

lock the involved shards in a fixed order
clone the current shard snapshots
use Arc::make_mut to stage private changes
validate everything during prepare
if validation succeeds, swap the master pointers to commit
if validation fails, drop the staged copies

That gives a strong property:

rollback is implicit because unpublished staged copies are dropped.

Type shape

#![allow(unused)]
fn main() {
use std::collections::HashMap;
use std::sync::{Arc, Mutex};

type ShardData = HashMap<String, i32>;
type Shard = Arc<Mutex<Arc<ShardData>>>;
}

Why this is advanced

This problem combines:

lock ordering for deadlock prevention
staging versus commit
atomic pointer publication
dirty-shard-only cloning
readers continuing on old snapshots while writers prepare new state

Interview-safe summary

In a sharded Arc design, the prepare phase happens in private staged copies and the commit phase is a pointer swap. That makes rollback cheap and lets readers stay on stable snapshots while writers stage multi-shard updates.

3. Async 2PC With Tokio

Problem shape

Now add async logging or network I/O inside the commit path. The system must hold transactional control across .await points without blocking the executor.

Core design

The async version keeps the same snapshot and staging logic, but replaces blocking synchronization with async-aware primitives.

#![allow(unused)]
fn main() {
use std::collections::HashMap;
use std::sync::Arc;
use tokio::sync::Mutex;

type ShardData = HashMap<String, i32>;
type Shard = Arc<Mutex<Arc<ShardData>>>;
}

Why tokio::sync::Mutex instead of std::sync::Mutex?

std::sync::Mutex blocks the executor thread if held across waiting
tokio::sync::Mutex lets the task yield while waiting for the lock
this matters when the transaction must .await logging or RPC work before commit

Why this is a level up

Now the design must satisfy all the earlier Arc and 2PC constraints plus:

async lock acquisition
holding the commit gate across .await
keeping tasks movable enough for Tokio scheduling
avoiding executor-thread blocking during slow I/O

Interview-safe summary

Async 2PC keeps the same staging-and-publish model, but the commit gate becomes an async mutex because the transaction may need to await disk or network work before publication. The goal is still the same: stable reader snapshots, explicit writer staging, and atomic publication.

4. Distributed WAL With Quorum

Build a replicated key-value store across three nodes where a write is committed only after a quorum of nodes durably records it. Readers must always see the latest committed local snapshot without waiting on network operations.

Core design

A clean high-level flow is:

create a log entry
replicate it to peers
wait for quorum success
once quorum is reached, apply the entry to each node's in-memory snapshot
let readers continue using their current Arc snapshots during replication

Replica type shape

#![allow(unused)]
fn main() {
use std::collections::HashMap;
use std::sync::Arc;
use tokio::sync::{Mutex, RwLock};

struct Replica {
    state: Arc<RwLock<Arc<HashMap<String, i32>>>>,
    wal: Mutex<Vec<LogEntry>>,
}
}

This separates concerns cleanly:

WAL: durability history
Arc snapshot: fast committed read view
RwLock: cheap read access to the current published pointer

Why this is advanced

This problem forces you to coordinate:

local durability
quorum logic
local publication order
non-blocking reads during replication
recovery from each node's WAL

Interview-safe summary

In a quorum store, the WAL is the durable agreement surface and the Arc snapshot is the local committed read view. Replication decides whether the write is committed; snapshot publication decides when that committed value becomes visible in memory.

5. Log Compaction

Problem shape

A WAL that grows forever eventually makes restart time and storage cost unacceptable. The system needs to compact history into a snapshot and discard old log entries.

Core design

A compacted recovery model is:

capture the current Arc snapshot
serialize it as a checkpoint file
record the last included log index or version
truncate or rotate old WAL segments before that point
on recovery, load the snapshot first, then replay only the tail of the WAL

Why `Arc` helps

Because the snapshot can be cloned cheaply, compaction can work from a stable frozen view while the live system continues moving forward.

Interview-safe summary

Log compaction turns a long event history into a state checkpoint plus a short tail. The WAL gives immediate durability, the snapshot gives fast restart, and Arc lets checkpointing work from a stable view without stopping readers.

6. Incremental Snapshots

Problem shape

Full snapshots may be too expensive if the database is large and only a small fraction changes between checkpoint cycles.

Core design

Instead of rewriting the whole state every time:

track dirty keys or dirty ranges
atomically swap out the dirty set for a fresh empty one
capture a current snapshot of only those changed keys
persist a delta batch
periodically merge many deltas into a new base snapshot

A high-yield optimization here is std::mem::take on the dirty set, because it minimizes lock hold time before slow disk work starts.

Why this matters

This reduces:

disk write volume
checkpoint latency
interference with the hot path

But it increases recovery complexity because the system now replays a base snapshot plus a chain of deltas.

Interview-safe summary

Incremental snapshots trade simpler recovery for much lower write volume. The system checkpoints only what changed, then periodically folds those deltas back into a new base snapshot so recovery does not require replaying an unbounded chain.

7. Hot-Join Distributed Recovery

Problem shape

Add a new empty follower to a live cluster while the leader is still processing writes. The leader must stream a large snapshot without pausing foreground traffic, and the follower must become active without missing any transactions that happened during the transfer.

This is harder than ordinary recovery because the node is not recovering from its own disk alone. It is catching up from a moving live source.

Core design

A strong high-level protocol is:

capture a point-in-time Arc snapshot on the leader
record the snapshot index or last included log sequence number
stream that frozen snapshot to the follower in chunks
buffer or stream all post-snapshot log entries separately
once the snapshot is fully loaded, replay every entry after the snapshot index
perform an atomic handover from syncing mode to active mode

The key invariant is:

the follower must know exactly where the snapshot stops and where live replay begins.

Without that boundary, the system cannot prove that it closed the gap between past state and present writes.

Why `Arc` helps

Arc is what makes the leader-side snapshot cheap enough to take under load.

the leader briefly clones the current published Arc snapshot
the long-running streaming task holds that frozen view
foreground writers continue preparing and publishing newer versions independently

That means the leader does not have to stop writes during the transfer.

Type and protocol shape

A useful mental model is:

published state: Arc<RwLock<Arc<HashMap<K, V>>>>
live write stream: tokio::sync::mpsc or an indexed log subscription
replay boundary: AtomicU64 or some explicit monotonically increasing sequence number
chunk transfer: async reader/writer with bounded buffering

What the follower must do

The follower has two jobs at once while syncing:

receive and assemble the old snapshot
collect all newer writes that happen during that transfer window

That is why a tokio::select! style receive loop is a natural fit:

one branch handles snapshot chunks
one branch handles buffered live transactions
completion happens only after the snapshot is installed and the buffered tail is replayed

More advanced interview angles

Once the basic answer is in place, these are the stronger follow-up angles:

Snapshot index pinning: The leader should tag the snapshot with the exact last included index or term so the follower knows where replay starts.
Idempotent replay: Replayed transactions should be safe to apply more than once, or the follower should deduplicate by sequence number.
Bounded buffering: If the follower takes minutes to load the snapshot, an in-memory delta buffer may grow without bound. A stronger design spills the tail to a WAL or requires a resumable indexed log stream.
Backpressure: Snapshot transfer must slow down when the follower or network is slow instead of allowing unbounded RAM growth.
Chunk integrity: Chunks should carry checksums, sequence numbers, and finalization markers so corruption or missing chunks are detectable.
Resumable transfer: If the follower disconnects halfway through 20GB, the protocol should resume from a known chunk or restart from a known snapshot generation.
Membership epoch or term: The follower should reject stale transfers from an old leader term.
Leader log retention: The leader must not compact away the tail that the follower still needs before catch-up is finished.
Atomic cutover: The follower should switch from syncing mode to active serving only after snapshot installation and tail replay are both complete.
Read policy while syncing: Decide whether the follower serves no reads, stale reads only, or reads only after cutover.
Failure during handover: If the follower crashes after loading the snapshot but before finishing replay, recovery should resume from durable local metadata rather than restart from scratch.

Why this is a serious systems problem

This problem forces you to coordinate:

stable published memory state
a moving write stream
network transfer of a large snapshot
replay boundaries and sequence tracking
mode transition from syncing to active service

It is not just a replication problem. It is a staging-and-publication problem stretched across machines.

Interview-safe summary

Hot-join recovery works by freezing a point-in-time snapshot, tagging it with a replay boundary, streaming it to the new follower, and then replaying every later write before cutover. Arc makes the frozen leader view cheap to hold while foreground writes continue, and the real correctness issue is proving that the follower closed the gap between snapshot time and live time without missing or duplicating entries.

8. SQL In Systems Interviews

Short answer

For many top backend and systems-adjacent interviews, yes, SQL matters. Not always as a separate database round, but often as part of reasoning about storage, querying, transactions, consistency, and production behavior.

A good rule is:

backend roles: usually yes
storage, infra, and database-adjacent roles: definitely yes
distributed systems roles: often yes, especially when metadata stores, job state, or operational tooling are involved
low-level kernel or compiler roles: less central, but still useful background

What interviewers usually care about

In stronger interviews, SQL is usually not about obscure syntax trivia. It is about whether you can reason about how structured data is queried and maintained under real workloads.

The highest-yield areas are:

SELECT, WHERE, ORDER BY, LIMIT
JOIN reasoning
GROUP BY and aggregation
basic window-function awareness
indexes and query shape
transactions and atomicity
isolation-level intuition
schema design and key choice
when SQL is the right tool and when it is not

Why this belongs in systems prep

SQL shows up in systems interviews because many system components eventually touch durable relational state:

account and billing records
metadata services
job queues and schedulers
audit logs and reporting tables
control-plane state
internal tooling and observability stores

Even if the hot path is not SQL-backed, the surrounding control plane often is.

Interview-safe topics to be able to explain

1. Query shape

You should be comfortable writing and reading queries like:

fetch one row by key
join two or three tables
aggregate counts or sums by group
find top-N rows by some metric
filter by time range
compute rolling or partitioned values with window functions at a basic level

2. Index intuition

A good compact explanation is:

An index is a trade-off: faster reads for certain query shapes in exchange for extra memory, disk, and write cost.

You should be able to explain:

why a full table scan may be slow
why an index helps only for matching query patterns
why too many indexes hurt write-heavy systems
why composite index order matters

3. Transactions

A minimal strong explanation is:

A transaction groups operations so they succeed or fail as one unit and are isolated from conflicting concurrent work according to the database's isolation model.

You should be able to connect that to:

transfers between accounts
race conditions
partial failure
retry behavior
uniqueness and consistency guarantees

4. Isolation intuition

You do not need to be a database theorist, but you should know the general shape:

weaker isolation improves concurrency but allows more anomalies
stronger isolation prevents more anomalies but costs more coordination
real systems often choose isolation based on workload and correctness requirements

5. Schema design

You should be able to discuss:

primary keys
foreign keys
normalization versus denormalization
append-only versus mutable tables
how query patterns should influence schema shape

Interview-safe summary

In top backend and systems interviews, SQL matters because it is the language of structured durable state. The goal is not to memorize every clause. The goal is to understand query shape, indexes, transactions, and schema trade-offs well enough to reason about real production systems.

The Big Pattern

All of these problems are variations of the same deeper system shape:

log or stage first
publish atomically second
let readers stay on stable snapshots
let recovery rebuild state from durable history

That is why Arc shows up repeatedly in advanced systems work. It is not only a shared pointer. It is often the mechanism that makes stable published views possible while the next version is being prepared elsewhere.

Interview-Safe Closing

In advanced systems problems, the core design question is often how to separate staging from publication. Logs, locks, snapshots, and recovery all orbit that same boundary. Arc is powerful in these designs because it gives a cheap, stable published view while writers prepare the next version privately.

Keyboard shortcuts

Bobby Interview Notes