Distributed Databases

IB Syllabus: A3.4.4: Describe the features of distributed databases.

HL Only. This page covers content assessed at HL level only.

Key Concepts

What a Distributed Database Is

A distributed database is a single logical database whose contents are stored across multiple physical machines, often in multiple data centres, sometimes on multiple continents. The user sees one database; the system underneath coordinates many.

Distributed databases exist because no single machine can serve a global user base, hold petabytes of data, or stay available through hardware failures. Spreading the data across many machines unlocks scale, resilience, and locality, at the cost of significant operational complexity.

Examples include Google Spanner, Amazon DynamoDB, Cassandra, CockroachDB, MongoDB sharded clusters, and almost every modern cloud-hosted relational database in its larger configurations.

Why Distribute?

Three pressures drive the move from one big machine to many smaller ones:

Volume. A dataset that no longer fits on the biggest available single machine must be split across many.
Throughput. Sites serving hundreds of thousands of writes per second cannot do that on one machine; the work has to be spread.
Geography. Users in Tokyo experience a database in Frankfurt as slow even on the best network, light takes time to cross continents. Replicas near each user region cut latency dramatically.

The trade-off: every operation that needed to touch several rows on one machine may now have to coordinate across machines, which is fundamentally harder.

The Eight Features You Need to Describe

Eight specific features matter when describing what a distributed database has to address. Each is a distinct concern with its own mechanism and trade-off.

1. Concurrency Control

Many transactions are running simultaneously across many machines. The DBMS still has to make them behave as if they ran one at a time (the Isolation property from Transactions and Views). Distributed concurrency control uses:

Distributed locks: a lock manager coordinates which transaction holds which row across the cluster.
Multi-version concurrency control (MVCC): each row keeps several versions; readers see a consistent snapshot, writers create new versions.
Timestamp ordering: transactions are assigned global timestamps and serialised in order.

Concurrency control is harder than in a single-machine database because messages between machines take time and can be reordered, lost, or duplicated. A naive lock manager becomes a bottleneck the moment many machines have to consult it.

2. Data Consistency

When the same logical fact is stored on multiple machines (replicas), they have to stay in agreement. A change to one replica must propagate to the others. The big question is when:

Strong consistency: every read sees the latest committed write, regardless of which replica it touches. Achieved by waiting for a quorum of replicas to acknowledge each write before the write is considered committed. Costs latency.
Eventual consistency: replicas may briefly disagree, but converge once writes finish propagating. Used by many large-scale systems where some staleness is acceptable.
Causal consistency and other in-between models also exist.

This is where the CAP theorem lives (covered in the “Beyond the Syllabus” section below): under a network partition, a distributed system can keep either consistency or availability, not both.

3. Partitioning (Sharding)

Partitioning splits the dataset across machines so that each one holds only a portion. Common strategies:

Hash partitioning: compute a hash of the row’s key, send the row to the machine responsible for that hash range. Spreads load evenly but makes range queries hard.
Range partitioning: split rows by ranges of the key (e.g. MemberID 0--999 on machine A, 1000--1999 on machine B). Range queries are fast; risk of “hot spots” if data is unevenly distributed.
Geographic partitioning: shard by user region (Asian users on Asian machines). Cuts latency dramatically; complicates cross-region queries.

A query that touches many partitions (a JOIN across two tables sharded differently) is fundamentally slower than the same query on one machine. Schema design in a distributed database often revolves around picking a partition key that keeps related rows on the same node.

4. Security

A distributed database has more attack surface than a single-machine one: network traffic between nodes can be intercepted; a compromised node can leak data; a misconfigured replica can be exposed to the public internet. Security features include:

Encryption in transit between nodes (TLS).
Encryption at rest on each disk.
Per-node authentication (each node proves identity to its peers).
Role-based access control consistent across all nodes.
Audit logging of every administrative action across the cluster.

A weak link anywhere in the cluster can compromise the whole system; security must be enforced at every node, not just at the edge.

5. Transparency

Users and applications should see the distributed system as if it were one database. Transparency is the set of features that hide the distribution complexity:

Type of transparency	What it hides
Location transparency	Which physical machine actually holds the row
Replication transparency	That the row exists on multiple machines
Partition transparency	That the table is split across machines
Failure transparency	That a node failed and the system recovered
Concurrency transparency	That many transactions are running at once

Done well, the application writes SELECT * FROM Member WHERE MemberID = 1001; and gets back the row, regardless of whether that row is on node 7 in Frankfurt or being replicated to nodes 3 and 14. Done badly, every developer has to know the cluster topology to write correct queries.

6. Fault Tolerance

In a cluster of 100 machines, the question is not whether a machine will fail; it is when, and how often. Fault tolerance is the property of continuing to operate correctly through failures:

Replication (next section) ensures the data survives the loss of a node.
Automatic failover: if the primary replica of a partition dies, a secondary is promoted to take over automatically, within seconds.
Quorum reads / writes: accepting a write once a majority of replicas have acknowledged means losing a minority of nodes does not lose data.
Self-healing: the cluster detects a missing node, redistributes its partitions, and re-replicates them to restore the target replication level.

Without fault tolerance, every hardware failure becomes a customer-facing outage. With it, the cluster routinely loses and replaces machines without users noticing.

7. Replication

Replication stores multiple copies of the same data on different machines. It is the foundation of both availability and fault tolerance:

Synchronous replication: writes are not committed until all (or a quorum of) replicas have stored them. Strong consistency, higher write latency.
Asynchronous replication: writes commit immediately on the primary; replicas catch up shortly after. Lower latency, but the most recent writes can be lost if the primary fails before the replicas receive them.
Read replicas: secondary copies used to serve read queries, taking load off the primary.

Replication factor (typically 3 in production) means each piece of data exists on at least three nodes; losing two simultaneously still leaves one copy intact.

8. Scalability

A distributed database scales by adding machines, not by replacing the existing one with a bigger one. This is the headline reason to choose a distributed system:

Horizontal scaling (scale out): add nodes; the cluster rebalances partitions and replicas to use the new capacity. Practically unlimited if the schema is shard-friendly.
Vertical scaling (scale up): replace a node with a bigger one. Has a hard ceiling at the largest available hardware.

The contrast with a traditional single-machine database is sharp: scaling up has a ceiling; scaling out does not. The cost is that the application has to be designed for the distributed model from the start, bolting it on later is painful.

Quick Summary Table

Feature	What it guarantees	Typical mechanism
Concurrency control	Transactions appear to run one at a time even when many run in parallel across nodes	Distributed locks, MVCC, timestamp ordering
Data consistency	Replicas agree on the value of every row (or converge on it)	Quorum writes, consensus protocols (Paxos, Raft)
Partitioning	Dataset is split across nodes so no one machine has to hold or process it all	Hash, range, or geographic sharding
Security	The whole cluster is protected, not just one node	TLS between nodes, encryption at rest, RBAC, audit logs
Transparency	Applications see one logical database	Routing layer that hides node identity, replicas, partitions
Fault tolerance	The cluster keeps running through node failures	Replication + automatic failover + self-healing
Replication	Multiple copies of every piece of data on different nodes	Sync or async replication; replication factor of 3+
Scalability	Capacity grows by adding nodes	Horizontal scaling with online rebalancing

Beyond the Syllabus: the CAP Theorem

(Useful context, not assessable.) The CAP theorem states that any distributed system can offer at most two of the following three guarantees:

Consistency: every read sees the latest write.
Availability: every request gets a response (not necessarily the most recent value).
Partition tolerance: the system continues to operate even when network links between nodes fail.

Because network partitions are inevitable in any large distributed system, the real choice is between consistency (CP systems, e.g. traditional ACID-style replicas: refuse the write if you cannot reach the majority) and availability (AP systems, e.g. many NoSQL stores: accept the write locally, reconcile later).

This is why some NoSQL databases use the BASE model (Basically Available, Soft state, Eventually consistent) rather than strict ACID: in a globally distributed system, eventual consistency is often the right trade-off.

What Makes Distributed Databases Hard

Two truths that explain the operational complexity:

Networks fail. Packets get lost, links go down, nodes become unreachable mid-transaction. Every distributed feature is, at heart, a strategy for working through network failures gracefully.
Time is not consistent across machines. Two machines’ clocks can drift by milliseconds even with NTP. “Which write happened first?” turns out to be a deeply hard question once you go beyond one machine. Google Spanner famously solved this with atomic clocks and GPS receivers in every data centre.

These two realities are why distributed databases are entire engineering specialisations, and why most teams start with a single machine and distribute only when forced to.

Worked Examples

Example 1: Designing a Sharding Strategy

A global ride-sharing app has 50 million users distributed roughly equally across 100 countries. Where should the User table live?

Option A, single machine. Simple, but cannot handle the write load (1000+ rides per second globally), and Tokyo users wait 200 ms to talk to Frankfurt.

Option B, hash-shard on UserID across 50 nodes. Spreads load evenly; any one user lives on exactly one node. But cross-user queries (e.g. matching a rider with a driver in the same city) may touch many shards.

Option C, geographic shard by user’s home region across 8 nodes (one per macro-region). Asian users are served from Asian nodes; cuts latency dramatically; matches the natural locality of the ride-matching workload (riders and drivers are normally in the same region). Cross-region queries (rare in this business) are slower.

For this workload, option C is the strongest fit. The schema decision (partition by region) is locked in by the workload; choosing wrong here would force a painful re-sharding later.

Example 2: Trading Consistency for Availability

A global messaging app has two failure modes to think about during a network partition between its US and EU data centres:

Strong consistency (CP). Refuse to accept messages from EU users until the partition heals, so the global message history stays consistent. EU users see “service unavailable”.

Eventual consistency (AP). Accept messages locally in both regions. After the partition heals, reconcile the two sets of writes. Users may briefly see slightly different views of a conversation; nothing is lost.

For a messaging app, availability is more important than consistency, users will tolerate a slightly delayed message arrival but will switch apps if the service refuses to accept their messages. The reasonable design is AP with eventual consistency.

For a bank’s transfer system, the answer flips: temporary unavailability is preferable to two regions agreeing to spend the same money. CP wins.

Example 3: Replication for Fault Tolerance

The Harbour Run national sports-club chain has grown from one location to 200. The customer-facing booking app cannot afford an outage. The team configures a replication factor of 3:

flowchart LR
    P[("<b>Primary</b><br/>Bookings<br/>Dublin")] -->|replicate| R1[("<b>Replica 1</b><br/>Bookings<br/>Frankfurt")]
    R1 -->|replicate| R2[("<b>Replica 2</b><br/>Bookings<br/>Singapore")]
    classDef primary fill:#3DAA8A22,stroke:#3DAA8A,stroke-width:2px,color:#1f5d4a
    classDef replica fill:#61B0DD20,stroke:#61B0DD,color:#1a4f6e
    class P primary
    class R1,R2 replica

Each booking is written to all three nodes before the user sees “confirmed”. If the Dublin primary fails, Frankfurt is promoted automatically within seconds; users see only a brief blip. If two of the three fail simultaneously (a very unlikely scenario for nodes on three continents), the system falls back to read-only mode until at least two replicas are reachable again. Total data loss requires all three to fail at once, which is a vanishingly rare event.

Quick Check

Q1. Which best describes a distributed database?

Q2. Which feature of a distributed database stores multiple copies of the same data on different nodes for fault tolerance?

Q3. A distributed database splits its 100-billion-row UserActivity table across 50 machines so each holds about 2 billion rows. Which feature describes this?

Q4. An application sends SELECT * FROM Member WHERE MemberID = 1001; to a distributed database without knowing which node actually holds that row. Which feature makes this possible?

Q5. A distributed database's main advantage over a single-machine database when traffic grows tenfold is:

Q6. A globally distributed database allows brief disagreements between replicas but guarantees they will converge to the same value within a few seconds of the last write. Which consistency model is this?

Match the Feature

For each scenario, type the feature it most directly demonstrates. Use one of: concurrency control, consistency, partitioning, security, transparency, fault tolerance, replication, scalability.

Scenario	Feature
A node fails at 03:17; another node is automatically promoted within 8 seconds and users notice nothing.
A dataset of 200 billion rows is spread across 80 machines so each holds about 2.5 billion rows.
An application reads from the cluster without knowing or caring which physical node holds each row.
Each row exists on three different nodes; losing one or two still leaves the data available.
Traffic triples after a marketing campaign; the operations team adds 20 new nodes and the cluster rebalances automatically.
All traffic between nodes is encrypted using TLS, and each node authenticates to its peers with a unique certificate.

Fill in the Blanks

Complete the description of distributed-database features.

FEATURES OF A DISTRIBUTED DATABASE
==================================
 splits the dataset across nodes so each one
holds only a portion. Sometimes called sharding.

 stores multiple copies of the same data on
different nodes, for fault tolerance and read scaling.

 hides the distribution complexity from
applications, they see one logical database.

 keeps the cluster operating through node
failures, using automatic failover and replication.

 means adding capacity by adding nodes,
rather than replacing machines with bigger ones.

Spot the Error

A student wrote revision notes about distributed databases. One line is wrong. Click the line with the error, then choose the correct fix.

1A distributed database stores its contents across many physical machines 2Replication splits different data across many machines so each holds only a portion 3Transparency hides distribution complexity from applications 4Fault tolerance uses automatic failover so node failures do not cause outages 5Horizontal scaling adds capacity by adding more nodes rather than bigger ones

Pick the correct fix for line 2:

Identify the Feature

"A distributed database loses one of its 30 nodes to a hardware failure. Within seconds, another node is automatically promoted to take over its responsibilities, and the cluster continues serving every user request without interruption."

Which feature is being described? (Type the feature name.)

"A global messaging platform splits its 100-billion-row Message table across 80 machines, with each machine responsible for a different range of user IDs."

Which feature is being described? (Type the feature name.)

Practice Exercises

Core (HL)

[Core] Distributed DBs (HL) [3 marks] Define a distributed database and outline one reason an organisation might choose to deploy one.
[Core] Distributed DBs (HL) [4 marks] State four features of a distributed database from the eight covered on this page, and write a one-sentence description of each.
[Core] Distributed DBs (HL) [3 marks] Distinguish between partitioning and replication in a distributed database.

Extension (HL)

[Extension] Distributed DBs (HL) [4 marks] Describe how a distributed database achieves fault tolerance. Refer to replication, automatic failover, and the practical effect on users.
[Extension] Distributed DBs (HL) [4 marks] Explain what is meant by transparency in a distributed database. Name at least three different kinds and what each hides from the application.
[Extension] Distributed DBs (HL) [5 marks] Discuss the trade-off between strong consistency and high availability in a globally distributed database. Refer to one scenario where consistency is more important and one where availability is.

Challenge (HL)

[Challenge] Distributed DBs (HL) [6 marks] A ride-sharing app has 50 million users across 100 countries. Suggest an appropriate partitioning strategy for the User table, and justify your choice. Refer to load distribution, query patterns, and latency.
[Challenge] Distributed DBs (HL) [8 marks] Evaluate the suitability of a distributed database for a national health-records system covering 60 million patients. Refer to at least five of the eight features (concurrency control, consistency, partitioning, security, transparency, fault tolerance, replication, scalability), and weigh up the benefits against the operational complexity.

Note for IB CS learners: A3.4.4 is new in the 2027 syllabus. The named feature list (concurrency control, data consistency, partitioning, security, transparency, fault tolerance, replication, scalability) is entirely new to the 2027 syllabus. Expect “Describe the features of distributed databases” questions that reward 1 mark per feature with an example. The CAP theorem is useful context but is not on the assessable syllabus.

Connections

Previous: Data Warehouses. Warehouses at very large scale are themselves distributed.
Related: Alternative Databases, NoSQL and cloud databases are usually distributed by default; the features on this page apply to most of them.
Related: Transactions and Views. The ACID guarantees are far harder to provide in a distributed setting; the BASE model is the alternative many distributed systems adopt.
Communication: Network Architecture. The network is the unreliable medium that distributed databases must work around.
Communication: Network Security. The security features in a distributed database build on TLS, encryption in transit, and access control.
Hardware: Cloud Computing. Modern distributed databases are almost always deployed on cloud infrastructure.