Image Generation Techniques
SL and HL. Booklet section: Image generation techniques. Syllabus link: A4 (machine learning). These are the categories of generation that sit above the specific models in the rest of the case study.
Before comparing diffusion models, GANs, and hybrid models, you need the vocabulary for what kind of generation a tool is doing. Generative image models are usually grouped by how much control the user has over the output. Visionary Studios cares about this directly: some jobs need a precise result from a brief, others need free exploration of ideas.
Table of Contents
- Three ways to generate an image
- Text-to-image generation
- Conditional image generation
- Unconditional image generation
- How the categories compare
- Quick check
- Practice exercises
- Connections
Three ways to generate an image
The booklet names three categories. They differ in what you feed the model and therefore how much you can steer the result.
- Text-to-image: you describe the image in words.
- Conditional: you give the model a specific input to follow (a label, a sketch, a layout).
- Unconditional: you give no steering input and let the model produce something from what it has learned.
flowchart LR
C["Conditional: label, sketch,<br/>or segmentation map"] --> IMG["Generated image"]
P["Text-to-image: a written prompt"] --> IMG
U["Unconditional: no steering input"] --> IMG
The three modes feed the same kind of model; they differ in how much you steer the result. Conditional gives the most control, text-to-image less, unconditional the least. (Original diagram.)
Text-to-image generation
In text-to-image generation the input is a written prompt and the output is a matching image. Tools such as DALL-E and Stable Diffusion work this way. The model has learned to connect language with visual features, so the wording of the prompt steers what appears.
For Visionary Studios this is the fastest way to turn a creative brief into a draft. A designer can type “a minimalist poster of a city at dawn, muted colours” and get several concepts to react to, with no drawing required. The trade-off is control: the result depends heavily on prompt wording and the model can misread intent, so text-to-image is strong for early ideas and weaker when an exact, repeatable result is needed.
Conditional image generation
Conditional image generation produces an image from a specific input that constrains the output. The conditioning input is what makes the result predictable. The booklet gives two important kinds.
Class-conditional models generate an image of a chosen category. If the model is told “landscape” it produces a landscape; told “animal” it produces an animal. This is useful when you know the kind of image you want but not the exact details.
Image-to-image translation transforms one image into another while keeping its structure. Examples include turning a sketch into a realistic render, or colourising a black-and-white photo. Visionary Studios could feed a designer’s line drawing and get back a finished product shot in the brand style, reusing the existing composition instead of starting from noise.
A related conditioning input is a segmentation map, an image in which each region is labelled by what it represents (sky, building, product, logo). The model uses the map as a layout guide, so the right content appears in the right place. This gives precise spatial control, at the cost of having to prepare the map first.
Unconditional image generation
Unconditional image generation uses no steering input at all. The model relies only on the patterns it learned during training, so the output is driven entirely by the training data. There is little control over any single result.
This sounds less useful, but it has two real uses. The first is creative exploration: generating a batch of varied, unexpected images to spark ideas. The second is making synthetic datasets, that is, generating artificial images to train other AI systems when real labelled data is scarce or expensive.
How the categories compare
| Category | What you provide | Control over the result | Good for |
|---|---|---|---|
| Text-to-image | A written prompt | Medium (steered by wording) | Fast concepts from a brief |
| Conditional | A label, sketch, or segmentation map | High | Predictable, on-brief results |
| Unconditional | Nothing | Low | Exploration, synthetic data |
The categories are not rival models; they are modes of generation. The same underlying model (a diffusion model, for example) can be used unconditionally or conditioned on text or an image. The point to take into the exam is that more steering input means more control and less surprise, and Visionary Studios will pick the mode that fits the job.
Quick check
Q1. A designer types "a watercolour landscape at sunset" and receives several matching images. Which technique is being used?
Q2. Visionary Studios feeds a line drawing into a model and gets back a photorealistic product shot in the same composition. This is best described as:
Q3. Which technique gives the user the least control over the specific image produced?
Q4. What is a segmentation map used for in conditional generation?
Practice exercises
Mark allocations and command terms match the case study’s exam style. Use them to practise precise, scenario-anchored answers.
Core
- Distinguish (4 marks) - Distinguish between conditional and unconditional image generation, with one example of when Visionary Studios would use each.
- Outline (2 marks) - Outline one benefit of text-to-image generation for producing advertising concepts.
Extension
- Describe (3 marks) - Describe how image-to-image translation could let Visionary Studios reuse a designer’s existing sketches.
- Explain (4 marks) - Explain why “more steering input” generally means “more control but less surprise” in image generation. Write in prose, with no diagram.
Challenge
- Discuss (6 marks) - A junior designer argues that text-to-image generation makes the other techniques unnecessary. Discuss whether Visionary Studios should rely on text-to-image generation alone, considering control, consistency, and the different jobs the studio does.
Connections
- Next: Diffusion models - the core technique that powers most text-to-image tools.
- Related: Evaluating generative AI models - how the studio chooses between options.
- Course link: Ethics of Machine Learning - the wider A4 ethics that the case study draws on.