Compression

IB Syllabus: A1.1.8 — Explain compression techniques

Key Concepts
Worked Examples
Quick Code Check
Trace Exercise
Spot the Error
Fill in the Blanks
Predict the Output
Practice Exercises
1. Core
2. Extension
3. Challenge
Connections

Key Concepts

Why Compress? (A1.1.8)

Compression is the process of reducing the number of bits needed to represent data. Every file you download, every video you stream, and every photo you share benefits from compression.

Save storage space — compressed files take up less room on disk
Faster transmission — smaller files transfer more quickly over networks
Enable streaming — real-time media (video calls, music streaming) would be impractical without compression
Compression ratio = original size / compressed size (e.g., 10 MB reduced to 2 MB = ratio of 5:1)

Lossy vs Lossless

The most important distinction in compression is whether any data is permanently lost during the process.

Lossless Compression

With lossless compression, the original data can be perfectly reconstructed from the compressed version. Not a single bit is lost.

Used when accuracy is critical: medical images, source code, legal documents, databases
Examples: PNG images, FLAC audio, ZIP archives, text files
Typically achieves lower compression ratios (2:1 to 3:1)

Lossy Compression

With lossy compression, some data is permanently discarded during compression. The original can never be perfectly recovered.

Used when small quality loss is acceptable: photos, music, video
Achieves much higher compression ratios (10:1 to 50:1 or more)
Examples: JPEG images, MP3/AAC audio, MP4/H.264 video

Comparison Table

Feature	Lossless	Lossy
Data loss	None	Some
Compression ratio	Lower (2:1 to 3:1 typical)	Higher (10:1 to 50:1+)
Reconstruct original?	Yes	No
Best for	Text, code, medical images	Photos, music, video
Examples	PNG, ZIP, FLAC	JPEG, MP3, MP4

Run-Length Encoding (RLE)

Run-Length Encoding is the simplest lossless compression technique. It replaces consecutive identical values with a count + value pair.

Example: AAABBBCCDDDDDD becomes 3A3B2C6D

RLE works well when data contains long runs of repeated values (e.g., simple graphics with large areas of the same colour). It works poorly with highly varied data — encoding ABCDEF as 1A1B1C1D1E1F actually increases the size.

Step-by-Step Algorithm

Read the input left to right
Count consecutive identical characters
Write the count followed by the character
Repeat until end of input

Transform Coding (DCT)

Transform coding is used in JPEG, MPEG, and audio compression formats. Rather than compressing the raw data directly, it transforms data into a different representation where compression is more effective.

The Discrete Cosine Transform (DCT) converts image blocks from the spatial domain (pixel values) into the frequency domain (frequency components)
Low-frequency components represent smooth gradients and broad shapes — these carry most of the visual information
High-frequency components represent fine detail and sharp edges — these can often be reduced with less visual impact
The quantization step reduces the precision of less important frequencies — this is the lossy step where data is discarded
Higher compression = more aggressive quantization = more visible quality loss

Enrichment: Huffman Encoding

This goes beyond the IB syllabus but helps build understanding.

Huffman encoding is a lossless technique that assigns variable-length codes to characters based on their frequency. More frequent characters receive shorter codes, reducing the overall size.

Build a frequency table — count how often each character appears
Create a binary tree — combine the two least frequent nodes repeatedly until one tree remains
Assign codes — left branches = 0, right branches = 1; read codes from root to leaf

Example: In the string AABBBCCCC:

C appears 4 times — gets the shortest code (e.g., 0)
B appears 3 times — gets a medium code (e.g., 10)
A appears 2 times — gets the longest code (e.g., 11)

Huffman encoding is always lossless — the original data can be decoded perfectly because no code is a prefix of another.

Enrichment: JPEG Compression Pipeline

This goes beyond the IB syllabus but helps build understanding.

JPEG compression uses a multi-step pipeline:

Divide the image into 8x8 pixel blocks
Apply DCT to each block — transforms pixel values into frequency coefficients
Quantize — reduce the precision of frequency coefficients (this is where data is lost)
Encode — apply lossless encoding (RLE + Huffman) to the quantized values

The “quality slider” in image editors controls the aggressiveness of step 3. Lower quality = more quantization = smaller file = more visible artifacts (blockiness, colour banding).

Enrichment: MPEG and Temporal Compression

This goes beyond the IB syllabus but helps build understanding.

Video compression exploits the fact that consecutive frames in a video are usually very similar. MPEG uses three types of frames:

I-frames (Intra-coded): complete images compressed like JPEG — serve as reference points
P-frames (Predicted): store only the differences from the previous frame
B-frames (Bi-directional): store differences from both the previous and next frames

This exploits temporal redundancy — instead of storing every pixel of every frame, only the changes between frames are stored. A typical sequence might be: I B B P B B P B B I …

Worked Examples

Example 1: Step-by-Step RLE Encoding

Problem: Apply RLE to the string WWWWBBBWWWWWBBB

Step 1: Read from the left. Count W’s: 4 consecutive W’s. Write 4W.

Step 2: Count B’s: 3 consecutive B’s. Write 3B.

Step 3: Count W’s: 5 consecutive W’s. Write 5W.

Step 4: Count B’s: 3 consecutive B’s. Write 3B.

Output: 4W3B5W3B

Compression ratio: 15 characters reduced to 8 characters = 15/8 = 1.88:1

Example 2: RLE Decoding

Problem: Decode the RLE string 2A5B1C3D

Step 1: 2A means 2 copies of A: AA

Step 2: 5B means 5 copies of B: BBBBB

Step 3: 1C means 1 copy of C: C

Step 4: 3D means 3 copies of D: DDD

Output: AABBBBBCDDD

Example 3: Choosing Lossy vs Lossless

Scenario A: A hospital needs to compress X-ray images for archival storage.

Answer: Must use lossless compression (e.g., PNG). Medical images require exact accuracy — even a small loss of detail could lead to misdiagnosis. The lower compression ratio is an acceptable trade-off for guaranteed accuracy.

Scenario B: A website needs to display photo thumbnails that load quickly.

Answer: Can use lossy compression (e.g., JPEG). Small quality loss is acceptable for web thumbnails, and the much higher compression ratio means faster page loads and less bandwidth usage.

Quick Code Check

Q1. Which type of compression allows the original data to be perfectly reconstructed?

Q2. What does RLE stand for?

Q3. JPEG is an example of which type of compression?

Q4. When would RLE perform poorly?

Q5. What is the purpose of the Discrete Cosine Transform (DCT) in compression?

Trace Exercise

Given the input string AAABBCCCCDDAB, trace the RLE encoding process step by step.

At each step, identify the current run of identical characters, count them, and build up the encoded output.

Step	Characters	Count	Encoded Output So Far
1	AAA
2	BB
3	CCCC
4	DD
5	A
6	B

Spot the Error

A student applied RLE to the string AABBBCC. They wrote the following steps. Click the line with the error, then pick the fix.

1Input: AABBBCC 2A appears 2 times → 2A 3B appears 3 times, C appears 3 times → 3B3C 4Result: 2A3B3C

Pick the correct fix for this line:

Fill in the Blanks

Complete the following statements about compression techniques:

Compression Types
=================
 compression allows the original data to be perfectly reconstructed.
 compression achieves higher ratios by permanently discarding some data.

Run-Length Encoding
===================
RLE replaces consecutive identical values with a  and  pair.

Transform Coding
================
JPEG uses  to convert image blocks from spatial to frequency domain.

File Format Classification
==========================
PNG is an example of  compression.

Predict the Output

Apply Run-Length Encoding to the following string:

Input: AAABBBCCDDDDDD

What is the compressed output?

Practice Exercises

Core

RLE Encode — Apply Run-Length Encoding to each of the following strings:
- RRRRGGGGBB
- AABBCC
- XXXXXXX
RLE Decode — Decode the following RLE string back to the original: 4A2B3C1D
Classify — For each of the following formats, state whether it uses lossy or lossless compression: ZIP, MP3, PNG, JPEG, FLAC, MP4

Extension

JPEG vs PNG — Compare JPEG and PNG compression. For each of the following scenarios, recommend which format to use and explain why:
- A photograph of a sunset for a travel blog
- A screenshot of source code for a programming tutorial
- A company logo with transparent background
Compression Ratio — Apply RLE to the string AAAAABBBCCCCCCCC (16 characters). Calculate the compressed length and the compression ratio.
Why RLE is Lossless — Explain why RLE is classified as lossless compression. How can you guarantee that the original data is perfectly recovered from the compressed version?

Challenge

Streaming Service Design — A streaming service needs to compress both audio and video for real-time playback. For each media type, evaluate which compression methods they should use, considering bandwidth constraints, quality requirements, and latency. Explain the trade-offs involved.
MPEG Frame Types — Explain why MPEG uses three different frame types (I-frames, P-frames, and B-frames) rather than compressing each frame independently like JPEG. How does this approach exploit temporal redundancy, and what are the implications for seeking (jumping to a specific time) in a video?

Connections

Prerequisites: None specific (standalone topic)
Related: Secondary Storage — compression reduces storage needs for all secondary storage media
Related: Cloud Computing — compression reduces bandwidth for cloud services and data transfer
Forward: Communication Layer — compression is essential for efficient network data transmission

Compression

Table of Contents

Key Concepts

Why Compress? (A1.1.8)

Lossy vs Lossless

Lossless Compression

Lossy Compression

Comparison Table

Run-Length Encoding (RLE)

Step-by-Step Algorithm

Transform Coding (DCT)

Enrichment: Huffman Encoding

Enrichment: JPEG Compression Pipeline

Enrichment: MPEG and Temporal Compression

Worked Examples

Example 1: Step-by-Step RLE Encoding

Example 2: RLE Decoding

Example 3: Choosing Lossy vs Lossless

Quick Code Check

Trace Exercise

Spot the Error

Fill in the Blanks

Predict the Output

Practice Exercises

Core

Extension

Challenge

Connections