Data Warehouses

IB Syllabus: A3.4.2: Explain the primary objectives of data warehouses in data management and business intelligence. A3.4.3: Explain the role of OLAP and data mining for business intelligence.

HL Only. This entire page covers content assessed at HL level only.


Part 1: Data Warehouses (A3.4.2)

What a Data Warehouse Is

A data warehouse is a central repository that consolidates data from many operational sources, relational databases, application logs, NoSQL stores, external feeds, into a single, integrated, query-optimised store designed for analysis, not day-to-day transactions.

The canonical definition, from Bill Inmon (one of the pioneers of the field): a data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making. The 2027 syllabus uses the same four properties (plus two more), discussed below.

OLTP vs OLAP

The fundamental contrast here is OLTP (Online Transaction Processing) versus OLAP (Online Analytical Processing):

Aspect OLTP (operational database) OLAP (data warehouse)
Workload Many small reads and writes, one row at a time Few large reads, often summing or aggregating millions of rows
Users End-users running the business (cashiers, bookers, customer-facing apps) Analysts, managers, decision-makers
Data shape Highly normalised (3NF) Deliberately denormalised (star or snowflake schema)
Currency Live, up-to-the-second Daily / hourly snapshots, plus historical archive
Volume per query One or a few rows Millions of rows
Updates Constant (UPDATE, INSERT, DELETE) Rare; mostly bulk loads
Indexes Focused on point-lookup performance Focused on range and aggregation performance
Example Booking table for the Harbour Run club Booking_Summary warehouse for the same club’s 5-year revenue analysis

A useful framing: OLTP is “data in”, OLAP is “information out”. Both matter; they have different design priorities and usually live in separate systems.

The Six Properties of a Data Warehouse

Six characteristics distinguish a warehouse from an OLTP database:

Property What it means
Subject-oriented Organised around business subjects (Customer, Product, Sale) rather than around applications (HR system, billing system, web site)
Integrated Data from many source systems is cleaned, conformed to common formats, and joined into one consistent view
Time-variant Holds historical data so trends can be analysed (last 5 years of sales, not just today’s)
Non-volatile Once loaded, data does not change; the warehouse appends new data rather than overwriting old
Append-only Writes are bulk loads, normally during off-hours; no row-by-row updates from end users
Optimised for query performance Indexes, partitioning, materialised summaries, and a denormalised schema all geared towards reading huge slices of data fast

The first four are Inmon’s original definition (1992); the last two are explicit operational consequences worth calling out alongside the original four.

Why Build a Warehouse?

The benefits typically rewarded by mark schemes (“Explain the primary objectives of a data warehouse”):

  1. Single source of truth. Analysts query one place instead of stitching reports together from a dozen operational systems with inconsistent formats.
  2. Historical analysis. Years of cleaned, time-stamped data let you ask “how have sales evolved over five years?”, something the operational database, which is constantly overwritten, cannot answer.
  3. Performance. The warehouse is structured and indexed for big aggregating queries that would crush an OLTP system.
  4. Operational isolation. Heavy analytical queries do not interfere with the live transactional workload because they run in a separate system.
  5. Data integration. Different source systems are reconciled (currencies converted, customer IDs harmonised, dates normalised) once during loading rather than every time someone runs a report.
  6. Supports business intelligence (BI). Dashboards, KPIs, executive reports, and ad-hoc analysis all run against the warehouse.

The ETL Pipeline

A warehouse is fed by an ETL process: Extract, Transform, Load.

flowchart LR
    Sources["<b>Source systems</b><br/>OLTP DB, App logs,<br/>CRM, external feeds"]
    Staging["<b>Staging area</b><br/>raw, unconformed data"]
    Warehouse["<b>Data warehouse</b><br/>star schema,<br/>fact + dimension tables"]
    Sources -->|Extract| Staging
    Staging -->|Transform and Load: cleanse, conform, combine, write| Warehouse
  • Extract pulls raw data from each source. Usually overnight, sometimes streaming.
  • Transform cleans (fix bad data), conforms (same date format, same currency), enriches (add derived fields), and combines (join across sources).
  • Load writes the transformed rows into the warehouse, normally appending to fact tables.

A modern variant, ELT: loads raw data first and transforms inside the warehouse using its compute power. Same three steps, different order.

Schema Shape: Star and Snowflake

Warehouses use denormalised schemas because the analytical query workload prefers a wide flat structure over normalised JOINs. Two common shapes:

Star schema: one central fact table (sales, bookings, page views) surrounded by dimension tables (date, customer, product, store):

erDiagram
    FACT_SALES }o--|| DIM_CUSTOMER : "by customer"
    FACT_SALES }o--|| DIM_DATE     : "on date"
    FACT_SALES }o--|| DIM_PRODUCT  : "of product"
    FACT_SALES }o--|| DIM_STORE    : "at store"
    FACT_SALES {
        int sale_id PK
        int customer_id FK
        date date_id FK
        int product_id FK
        int store_id FK
        decimal amount
        int quantity
    }
    DIM_CUSTOMER {
        int customer_id PK
        string name
        string city
        string region
        string country
    }
    DIM_DATE {
        date date_id PK
        int year
        int quarter
        int month
        string day_of_week
    }
    DIM_PRODUCT {
        int product_id PK
        string name
        string category
        string brand
    }
    DIM_STORE {
        int store_id PK
        string name
        string city
        string region
    }

The fact table is large (one row per sale, billions of rows possible). Each dimension is small and contains denormalised attributes (DIM_CUSTOMER may include city, region, country, age bracket, all on one row).

Snowflake schema: a star schema where the dimension tables are themselves normalised into sub-tables (e.g. DIM_CUSTOMER -> DIM_CITY -> DIM_COUNTRY). Saves a bit of storage, costs an extra JOIN per query.

Star schemas dominate in practice because the simpler structure makes BI tools faster and easier to use.


Part 2: OLAP and Data Mining (A3.4.3)

OLAP, Online Analytical Processing

OLAP is the category of tooling that lets analysts interrogate a warehouse interactively, typically through a multi-dimensional “cube” view:

$$$ $$$ $$$ Product (categories) Geography (regions) Time (months)

A cube cell holds an aggregate, total sales of electronics in Asia in March 2026. Users slice and dice the cube interactively:

  • Slice: fix one dimension to a single value (just March 2026).
  • Dice: fix several dimensions to ranges (Asia and Europe; Q1 and Q2; electronics only).
  • Drill down: expand a dimension (March 2026 -> week by week -> day by day).
  • Roll up: collapse a dimension (Asia + Europe + Americas -> the world).
  • Pivot: swap which dimension is on which axis.

OLAP is interactive, exploratory, and aggregate-heavy. It is the technical backbone of business intelligence dashboards, the kind a CEO opens to see “where is revenue growing this quarter?” without writing SQL.

OLAP is sometimes contrasted with OLTP (operational transaction processing); these are the two ends of a database’s job, not a hierarchy. OLAP is what data warehouses are designed to support.

Data Mining, Finding Patterns

Data mining is the broader practice of automatically discovering useful patterns, structures, or anomalies in large datasets. Six techniques are worth knowing:

Technique What it does Example
Classification Assigns each record to one of several known categories Marking an email “spam” or “not spam” based on labelled training data
Clustering Groups records by similarity, without pre-existing labels Segmenting customers into natural buying-behaviour groups for marketing
Regression Predicts a numeric value from other attributes Forecasting next month’s sales from this month’s leading indicators
Association-rule discovery Finds rules like “people who buy X also buy Y” “Customers who buy nappies often buy beer” (the classic Walmart anecdote)
Sequential pattern discovery Finds events that tend to occur in a particular order “Customers who buy a phone often buy a case within 14 days”
Anomaly detection Identifies records that deviate significantly from the norm Flagging a fraudulent transaction; detecting an outbreak from sudden spikes in symptom searches

The first four also belong to the Machine Learning toolkit, this topic sits at the intersection of databases and ML. Anomaly detection is sometimes called deviation detection in older mark schemes; they are the same technique under different names.

Classification vs Clustering, the Key Contrast

A standard exam pattern is to compare classification and clustering. The headline differences:

Aspect Classification Clustering
Learning style Supervised: uses labelled training data Unsupervised: no labels needed
What you start with A set of pre-defined classes (“spam” / “not spam”) An unlabelled dataset; the groups emerge from similarity
Output A model that assigns new records to known classes A grouping of records into clusters whose meaning you interpret afterwards
Typical use Email spam filters, fraud detection, image labelling Customer segmentation, market research, exploratory data analysis
Validation Compare predicted vs actual labels on held-out data Look at the cluster shapes; harder to validate objectively

Both produce groups; the difference is whether the groups are pre-defined (classification) or emergent (clustering).

Data Mining in Practice: Three Worked Scenarios

Customer segmentation (clustering). A clothing retailer wants to identify natural customer groups. They feed purchase history, demographics, and browsing patterns into a clustering algorithm. The output is something like “5 distinct groups: bargain hunters, occasional gift buyers, premium loyalists, athletic-wear focused, seasonal browsers.” The marketing team then designs targeted campaigns per segment.

Fraud detection (anomaly detection). A bank’s fraud system models normal spending patterns for each cardholder. A transaction that deviates sharply (different country, much larger amount, unusual merchant category) is flagged for additional verification. The system does not pre-label transactions as “fraud” or “legitimate”, it learns each customer’s normal and alerts on outliers.

Disease outbreak prediction (sequential patterns + anomaly detection). A public health agency mines emergency-room visit data, looking for unusual spikes in specific symptom combinations within geographic regions. A sudden cluster of patients in one city presenting with fever and cough may indicate the start of an outbreak before any single hospital notices.

Business Intelligence: Putting It Together

Business intelligence (BI) is the umbrella term for using data to support decisions. A typical BI architecture combines all three pieces from this page:

  1. Data warehouse as the source of clean, integrated, historical data.
  2. OLAP as the interactive aggregation engine that lets analysts explore.
  3. Data mining algorithms that find patterns analysts would miss by eye.

The output goes into dashboards (executive scorecards), reports (monthly KPI documents), and predictive models (next-quarter forecasts) that feed real business decisions.


Worked Examples

Example 1: OLTP vs OLAP for the Same Business

The Harbour Run sports club has both kinds of database:

OLTP database (operational):

  • The Booking table from the earlier pages, holding live bookings.
  • Used by the website to take new bookings and by staff to look up “who is in the next class?”
  • Highly normalised (3NF); foreign keys enforced; ACID transactions; up-to-the-second.

OLAP warehouse (analytical):

  • A FACT_BOOKING table holding every booking ever made, now several years’ worth, joined with denormalised dimension tables for DIM_MEMBER, DIM_CLASS, DIM_DATE.
  • Used by the club’s managers to answer “what was our busiest week last year?”, “which coaches retain members best?”, “how has Premium membership grown month over month?”
  • Loaded overnight from the OLTP database; rows are never updated, only appended.
  • Star-schema layout; aggressively indexed for range and aggregation queries.

The same business needs both. Trying to run the historical analysis directly against the live booking system would slow down day-to-day operations; trying to run live bookings against the warehouse would not be possible (it has stale data and no constraints).

Example 2: Clustering vs Classification

A library wants to better understand its readers.

Classification example: the library already labels members as “fiction lovers”, “non-fiction lovers”, or “balanced readers” based on a survey. A classification model learns the pattern from labelled members, then assigns each new member to one of those three categories based on their first borrowings. The labels were known up front.

Clustering example: the library does not pre-classify members at all. A clustering algorithm groups members by borrowing patterns and produces (say) five emergent clusters. The librarians look at each cluster and afterwards interpret them (“group 1 = children’s-section regulars; group 2 = academic-journal users; …”). The groups emerged from the data, not from a prior label.

Example 3: Anomaly Detection Pipeline

A national rail operator wants to detect signal failures before they cause incidents:

  1. Build a baseline. Aggregate normal signal-response times under different conditions (time of day, weather, train load) from years of warehoused data.
  2. Set thresholds. Decide what “abnormal” means, e.g. response times above the 99th percentile for the current conditions.
  3. Score new data in real time. As each signal cycles, compute the response time and compare to the threshold for current conditions.
  4. Alert on outliers. A response time significantly outside the threshold raises an alert for engineers.
  5. Feed back. Confirmed real failures become training examples that improve the model; false alarms are tagged to refine thresholds.

This is the canonical anomaly-detection workflow: baseline -> threshold -> score -> alert -> refine. The same pipeline shape covers fraud, intrusion detection, equipment failure prediction, and disease outbreak monitoring.


Quick Check

Q1. Which best describes a data warehouse?

Q2. Which of these is an OLAP workload?

Q3. Which of the following is NOT one of the canonical properties of a data warehouse?

Q4. What is the key difference between classification and clustering?

Q5. A credit-card company wants to flag transactions that deviate significantly from the customer's normal spending pattern. Which data-mining technique fits?

Q6. Which technology lets a business analyst interactively slice and drill into a multi-dimensional data cube?


Classify the Technique

For each scenario, identify the data-mining technique. Use one of: classification, clustering, regression, association, sequential, anomaly.

ScenarioTechnique
Predicting next quarter's revenue from the past five years of sales data.
Grouping streaming-service viewers into emergent taste profiles without prior labels.
Flagging a server log entry that is wildly different from the recent baseline of normal requests.
Discovering that supermarket shoppers who buy strawberries also buy cream in the same trip.
Sorting incoming emails into "spam" or "not spam" using a model trained on labelled examples.
Finding that customers who buy a phone often buy a case within two weeks.

Fill in the Blanks

Complete the description of the data-warehouse and BI ecosystem.

WAREHOUSE AND BUSINESS INTELLIGENCE
===================================
A data warehouse is a central store of integrated, time-variant,
 data optimised for analytical queries.

The pipeline that loads a warehouse from many sources is called
, Extract, Transform, Load.

The interactive tooling that lets analysts slice, dice, and drill
into the warehouse is called .

The broader practice of finding patterns automatically is called
.

The umbrella term for using data to support decisions is called
.

Spot the Error

A student wrote revision notes about data warehouses. One line is wrong. Click the line with the error, then choose the correct fix.

1A data warehouse is subject-oriented, integrated, time-variant, and non-volatile 2It is optimised for analytical queries rather than transactional updates 3It is normalised to 3NF to minimise redundancy 4The ETL pipeline loads it from operational source systems 5OLAP is the interactive layer used by business analysts to query the warehouse

Pick the correct fix for line 3:


Identify the Concept

A system holds 7 years of completed orders, is never updated row-by-row, accepts bulk loads only at 02:00 each morning, and is used by analysts to track long-term trends. Which type of system is being described?

Type the term (e.g. OLTP database).

A supermarket data team finds that 70% of customers who buy nappies also buy beer in the same trip. Which data-mining technique discovered this rule?

Type the technique name.


Practice Exercises

Core (HL)

  1. [Core] Warehouses and OLAP (HL) [2 marks] Define the term data warehouse.

  2. [Core] Warehouses and OLAP (HL) [4 marks] State the four canonical properties of a data warehouse and give a one-sentence summary of each.

  3. [Core] Warehouses and OLAP (HL) [4 marks] Distinguish between OLTP and OLAP workloads, giving one example of each.

Extension (HL)

  1. [Extension] Warehouses and OLAP (HL) [6 marks] Explain the primary objectives of a data warehouse for a chain of supermarkets that operates in 12 countries.

  2. [Extension] Warehouses and OLAP (HL) [4 marks] Compare classification and clustering as data-mining techniques. Refer to the type of input data, the kind of output, and a typical use case for each.

  3. [Extension] Warehouses and OLAP (HL) [4 marks] Describe the process of anomaly detection. Identify two real-world applications where it adds value.

Challenge (HL)

  1. [Challenge] Warehouses and OLAP (HL) [6 marks] Discuss how a data warehouse, OLAP tooling, and data mining work together to support business intelligence in a large hospital chain. Refer to specific examples of decisions the combined stack would enable.

  2. [Challenge] Warehouses and OLAP (HL) [6 marks] Evaluate whether a small charity with 3,000 supporters and a single fundraising platform should invest in building a data warehouse, or whether it can rely on querying its operational database directly. Refer to data volume, query complexity, cost, and value delivered.

Note for IB CS learners: A3.4.2 (warehouses) and A3.4.3 (OLAP + mining) are core HL extension topics. Vocabulary has shifted between syllabi: older sources say “forecasting” where the new syllabus says “regression”, and “deviation detection” where the new syllabus says “anomaly detection”. OLAP is new to the 2027 syllabus as a named term. “Link analysis” and “data matching” appear in older mark schemes but are removed from the 2027 list. Mark schemes for “Explain the objectives of a data warehouse” reward 1 mark per objective + 1 mark for the expansion.


Connections

  • Previous: Alternative Databases. The wider context of non-relational and specialised database approaches.
  • Related: Normalisation. Warehouses deliberately denormalise (the opposite direction); see also denormalisation evaluation in A3.2.7.
  • Related: Transactions and Views. Materialised views are a smaller-scale, single-database version of the same idea (precompute for fast reads).
  • Next (HL): Distributed Databases. Warehouses at the very largest scale are themselves distributed.
  • Machine Learning: Classification. Clustering, regression, and anomaly detection are the same techniques covered under A4 Machine Learning, this topic sits at the boundary between databases and ML.

© EduCS.me — A resource hub for Computer Science education

This site uses Just the Docs, a documentation theme for Jekyll.