Diffusion Models
SL and HL. This is the core technique everyone must know in depth. Booklet section: Diffusion models. Syllabus links: A4 (the model) and A1 (the CNN denoiser and computational resources).
Diffusion models are the technique behind most modern text-to-image tools, including DALL-E and Stable Diffusion. They are the one model in this case study that both SL and HL must understand thoroughly. The central idea is simple to state and worth saying out loud: a diffusion model builds an image by starting from random noise and removing that noise step by step until a clear image is left.
Table of Contents
- The big idea: from noise to image
- The neural denoiser is a CNN
- The training data decides the content
- DDPM: the framework behind it
- Strengths and costs
- Why this matters to Visionary Studios
- Quick check
- Practice exercises
- Connections
The big idea: from noise to image
A diffusion model does not draw an image in one go. It starts with noise injection: a field of random pixels with no structure. It then runs a repeated denoising process, removing a little noise on each pass and adding a little structure, until the random starting point has become a coherent image.
flowchart LR
N["Pure noise"] --> D1["denoise"] --> D2["denoise"] --> D3["denoise"] --> D4["..."] --> D5["denoise"] --> IMG["Clear image"]
A schematic of the denoising process. Each step removes a little noise and adds a little structure. (Original diagram, drawn for this page.)
The noise is random, so the same model can produce a different image from each new starting point. That is what gives a diffusion model its variety: change the noise and you change the output.
The neural denoiser is a CNN
The work of each step is done by a neural denoiser, which is typically a convolutional neural network (CNN). A CNN is a neural network designed for images: it slides small learnable filters across the picture to detect features such as edges, textures, and shapes, then builds those simple features into more complex ones.
During training the CNN learns what real images look like. During generation it uses that learning to reconstruct plausible features out of the noise, a little more on each pass. This connects the case study directly to Theme A1: the denoiser is a concrete example of a CNN doing real work.
The training data decides the content
A diffusion model can generate entirely new images that were not in its training set, but what kind of images it produces is decided by what it was trained on. A model trained on photographs of animals tends to denoise toward cats and dogs; one trained on cityscapes tends to denoise toward urban scenes. The model is not copying a stored picture; it is revealing the patterns it learned, shaped by the random noise it started from.
This is why dataset curation matters so much for diffusion models, a point the ethics page develops: the training data is the single biggest influence on what the model can and cannot make, and on whether its outputs are fair and legally clean.
DDPM: the framework behind it
Modern diffusion models are based on the denoising diffusion probabilistic model (DDPM). DDPM is the mathematical framework that formalises the two halves of the process: gradually adding noise to training images (the forward process) and gradually removing it during generation (the reverse process). You do not need the mathematics for the exam, but you should be able to name DDPM as the framework and say what it describes: a principled, step-by-step noise schedule that makes diffusion models reliable to train.
flowchart LR
IMG["Training image"] -->|"forward process: add noise"| NOISE["Random noise"]
NOISE -->|"reverse process: remove noise"| GEN["Generated image"]
DDPM formalises both directions: noise is added to training images (the forward process), and removed step by step to generate a new image (the reverse process). (Original diagram.)
Strengths and costs
The headline strength of diffusion models is image quality. They produce photorealistic results and are generally more stable to train than GANs, which is a large part of why they have become the default for high-quality image generation.
The headline cost is computation. Because generation is iterative, producing a single image means running the neural denoiser many times over. Each step is another full pass through the CNN, so one image can take dozens or hundreds of passes. That makes diffusion computationally expensive, and it usually runs on hardware built for parallel work such as GPUs. For a studio generating images at scale, this cost is a real constraint on time and budget, which is why optimisation of the process is so important in practice. The evaluating models page returns to this as “computational efficiency.”
Exam framing. A common question is why diffusion models are computationally expensive. The mark scheme wants the link spelled out: the process is iterative (many denoising steps), and each step runs the neural denoiser, so many repeated passes demand a lot of computation. State the point, then give the reason.
Why this matters to Visionary Studios
For Visionary Studios, a diffusion model is a strong default: it gives the photorealistic, on-brief quality that advertising and concept art need, and it trains reliably. The studio has to weigh that against the compute cost of generating many images for a campaign, and against the ethics of the data the model was trained on. Those trade-offs are exactly what the extended-response question will ask you to argue.
Quick check
Q1. What does a diffusion model start from when it generates a new image?
Q2. The neural denoiser inside a diffusion model is typically which kind of network?
Q3. Why are diffusion models computationally expensive?
Q4. A diffusion model trained only on photographs of cities is asked to generate an image from fresh noise. What is it most likely to produce?
Q5. What does DDPM refer to in the context of diffusion models?
Practice exercises
Mark allocations and command terms match the case study’s exam style. Anchor every answer to Visionary Studios, not to “AI” in general.
Core
- Outline (2 marks) - Outline how a diffusion model generates an image.
- Describe (3 marks) - Describe the role of the convolutional neural network (CNN) inside a diffusion model.
Extension
- Explain (4 marks) - Explain why two images generated by the same diffusion model can look completely different. Refer to noise injection.
- Explain (4 marks) - Explain why diffusion models are computationally expensive, and why this matters to a studio generating images at scale. Write in prose, with no diagram.
Challenge
- Discuss (6 marks) - Visionary Studios needs photorealistic images for a national advertising campaign but has a limited compute budget. Discuss whether a diffusion model is the right choice, weighing image quality against computational cost, and reach a conclusion.
Connections
- Previous: Image generation techniques - the modes a diffusion model can run in.
- Next (HL): GANs - a different approach, with sharper images but harder training.
- Related: Evaluating generative AI models - where computational efficiency is compared.
- Course link: Hardware: GPUs and accelerators - why parallel hardware suits this work (A1).