Training and Evaluating a Model

IB Syllabus: A4.2.1 data cleaning, A4.2.2 feature selection, A4.2.3 dimensionality reduction; A4.3.1 linear regression, A4.3.3 hyperparameter tuning and evaluation metrics (accuracy, precision, recall, F1; overfitting, underfitting), A4.3.9 CNNs, A4.3.10 model selection and comparison.

This page is HL only. It covers how a model is built from data and how you judge whether it is any good. The core overview of machine learning is on What Machine Learning Is.

Why This Page Exists
Stage 1: Preparing the Data
Stage 2: Training a Model, with Linear Regression (A4.3.1)
Stage 3: Is It Any Good? Evaluation (A4.3.3)
Stage 4: Deep Learning for Images, with CNNs (A4.3.9)
Stage 5: Choosing Between Models (A4.3.10)
Quick Check
Match the Concept
Practice Exercises
1. Core
2. Extension
3. Challenge
Connections

Why This Page Exists

A model is only as good as the process that built it. That process is a pipeline: prepare the data, train a model, evaluate it honestly, and choose between the options you have. Skip or fumble any stage and even a clever algorithm produces a model that fails quietly in the real world.

This page walks that pipeline. The theme running through all of it is honesty: being honest about the quality of your data, and honest about how good your model really is rather than how good it looks.

Try it yourself: the machine learning simulators include live demos for regression, overfitting, evaluation metrics, and CNNs. They make this page’s ideas visible instead of abstract.

Stage 1: Preparing the Data

Most of the real work in machine learning is not the model; it is getting the data into a state worth learning from. There is a reason practitioners say a model is only as good as its data.

Data cleaning (A4.2.1)

Real data is messy: values are missing, records are duplicated, some entries are plain errors, and some are extreme outliers that may be mistakes or may be real. Data cleaning is fixing these before training: filling in or removing missing values, deleting duplicates, correcting or discarding errors, and deciding what to do with outliers.

It matters because a model faithfully learns whatever is in its data, including the mistakes. “Garbage in, garbage out” undersells it: a model learns the garbage and applies it confidently to every future case. Cleaning is not glamorous, but it decides whether the rest is worth doing.

Feature selection (A4.2.2)

A feature is one input the model can look at. More features are not automatically better: irrelevant or redundant ones add noise, slow training, and can make a model worse. Feature selection is choosing the features that actually help. There are three broad approaches:

Filter methods rank features by a statistic (such as how strongly each relates to the target) before training, independent of any model.
Wrapper methods try different subsets of features and evaluate each subset by actually training the model with it, then keep the best.
Embedded methods let the selection happen during training, as part of how the model itself is built.

Good feature selection often improves a model more than a fancier algorithm would. Choosing what to look at is frequently more important than choosing how to look.

Dimensionality reduction (A4.2.3)

When data has a very large number of features, dimensionality reduction shrinks it to fewer while keeping as much of the useful information as possible. This speeds up training, reduces noise, and helps avoid the problems that come with too many dimensions. Established techniques exist for this; you are not required to know their internal maths, only why reducing dimensions is useful: fewer, more meaningful inputs usually mean a faster, more reliable model.

Try it yourself: the bias simulator trains on a flawed dataset and shows an accurate-looking model treating two groups very differently, a direct reminder that data problems survive training and reappear as unfairness.

Stage 2: Training a Model, with Linear Regression (A4.3.1)

Linear regression is the clearest example of training. It predicts a continuous number (not a category) by fitting a straight line to the data: a line of the form y = mx + b, where the model learns the slope m and intercept b.

“Fitting” means adjusting the line to make its predictions as close as possible to the real values. The model measures its total error (how far off its predictions are), then repeatedly nudges the line to reduce that error, a process often done by gradient descent (stepping downhill on the error). When the error stops shrinking, the line has settled into the best fit the model can find.

Application: predicting a house price from its size, a temperature trend from past readings, or sales from advertising spend.

Try it yourself: the regression simulator fits a line by gradient descent step by step, with the error falling as the line settles. Set the learning rate too high and watch the training diverge instead of converge.

Stage 3: Is It Any Good? Evaluation (A4.3.3)

A model that looks perfect can be useless. Evaluation is how you find out the truth, and it starts with one rule.

Never judge a model on the data it trained on

You split your data: train the model on one part, then test it on data it has never seen. A model judged on its own training data is like a student marking their own homework with the answer sheet open. Only performance on unseen data tells you whether it actually learned the pattern or just memorised the examples.

Overfitting and underfitting

This gives us the two ways a model goes wrong:

Overfitting: the model learns the training data too well, including its noise and quirks. It scores brilliantly on training data and poorly on new data. It memorised instead of generalising.
Underfitting: the model is too simple to capture the real pattern. It does poorly on training data and new data.

The goal is the middle: complex enough to capture the real pattern, simple enough to generalise. Notably, simpler models often generalise better than complex ones, because they have less room to memorise noise.

Try it yourself: the fit simulator lets you slide model complexity from underfit through a good fit to overfit, watching training error and test error move apart. The same page’s confusion matrix demonstrates the metrics below.

Accuracy is not enough

Accuracy (the fraction of predictions that are correct) is the obvious metric, and it can badly mislead. If 99% of transactions are genuine, a model that labels everything “genuine” scores 99% accuracy while catching zero fraud. On imbalanced data, accuracy hides failure.

So we use finer metrics, built from four outcomes (a confusion matrix): true positives, false positives, true negatives, false negatives.

Metric	Question it answers	Rough definition
Accuracy	How often is the model right overall?	Correct predictions out of all predictions
Precision	When it says “yes”, how often is it right?	True positives out of all predicted positive
Recall	Of all the real “yes” cases, how many did it catch?	True positives out of all actual positive
F1 score	A single balance of precision and recall	High only when both precision and recall are high

Precision and recall usually trade off: catching more real cases (higher recall) often means more false alarms (lower precision). Which matters more depends on the problem. Missing a disease (low recall) is far worse than a false alarm; flagging a genuine customer as fraud (low precision) has its own cost. F1 gives one number when you want both to be good.

Hyperparameter tuning

Some settings are not learned from data; you choose them: the k in k-NN, the depth of a decision tree, the learning rate in gradient descent. These are hyperparameters. Tuning them means trying different values and keeping those that give the best performance on unseen data. Good tuning is often the difference between a mediocre model and a strong one.

Stage 4: Deep Learning for Images, with CNNs (A4.3.9)

A convolutional neural network (CNN) is a deep-learning design built for images. It learns a spatial hierarchy of features: simple patterns first, then combinations of them, building up to whole objects. It has three kinds of layer:

Convolutional layers slide small filters across the image, each filter detecting a local pattern (an edge, a corner, a patch of colour) and producing a feature map of where that pattern appears.
Pooling layers shrink each feature map by summarising small regions (for example, keeping the strongest signal in each). This reduces size and makes the network less sensitive to exactly where a feature sits.
Fully connected layers at the end take the detected features and make the final decision (for example, “this is a cat”).

Stacking these lets early layers find edges, middle layers assemble them into shapes, and later layers recognise whole objects. That layered build-up of features is what makes CNNs so effective on images.

Try it yourself: the CNN simulator slides a filter across an image patch by patch to build a feature map, then applies pooling to shrink it, so you can see convolution and pooling actually happen.

Stage 5: Choosing Between Models (A4.3.10)

Usually you have several candidate models (different algorithms, or the same algorithm with different settings). Model selection is choosing between them, and done honestly it follows a few rules:

Compare on the same unseen test data. A fair race uses the same finish line; comparing one model’s training score with another’s test score is meaningless.
Use a metric that fits the problem. For imbalanced data, compare on precision, recall, or F1, not accuracy.
Prefer the simpler model when performance is close. It usually generalises better, runs faster, and is easier to understand and maintain.
Weigh more than the score. Interpretability, speed, cost, and fairness can matter as much as a fraction of a percent of accuracy.

The best model is rarely just the one with the highest single number. It is the one that performs well and fits the real constraints of where it will be used.

Quick Check

Q1. What does data cleaning involve?

Q2. A model scores 99% on its training data but only 60% on new data it has never seen. This is:

Q3. A fraud detector reaches 99% accuracy, but 99% of transactions are genuine and it never flags any fraud. What does this show?

Q4. What kind of outcome does linear regression predict?

Q5. In a CNN, what do the convolutional layers do?

Q6. What is the fair way to choose between two candidate models?

Match the Concept

Name the concept each description points to: overfitting, underfitting, precision, recall, or feature selection.

Fill in the blanks.

// Great on training data, poor on new data: it memorised the noise
// Concept: 

// Too simple to capture the pattern; poor on training and new data alike
// Concept: 

// Of all the real positive cases, the fraction the model actually caught
// Concept: 

// When the model says "yes", the fraction of times it is actually right
// Concept: 

// Choosing which inputs to keep so the model learns from what matters
// Concept:

Practice Exercises

Note for IB CS learners: these A4.2 and A4.3 outcomes are examined with Describe and Explain. The exercises below practise that. At least one asks for a full prose response. All content here is HL.

Core

Describe (4 marks) – Describe two things data cleaning does, and explain why cleaning matters before training.
Explain (4 marks) – Explain the difference between overfitting and underfitting, using how each performs on training and unseen data.
Explain (4 marks) – Explain how linear regression predicts a continuous value, referring to fitting a line and reducing error.

Extension

Explain (6 marks) – Explain why accuracy alone can be misleading, and how precision and recall give a fuller picture, using an original imbalanced example.
Describe (6 marks) – Describe the roles of convolutional, pooling, and fully connected layers in a CNN, and how they build up a spatial hierarchy of features.

Challenge

Discuss (8 marks) – Two models are proposed for diagnosing a rare disease from scans: one with higher overall accuracy, one with higher recall. Discuss which you would choose and why, referring to the cost of false negatives and the limits of accuracy. Reach a reasoned conclusion. (Write in prose.)

Connections

Prerequisite: What Machine Learning Is – the vocabulary and types this page builds on
Prerequisite: Types of Learning – the algorithms that get trained and evaluated here
Related: Ethics of Machine Learning – biased data and misleading metrics are where fairness fails
Related: Testing Strategy – evaluating a model is the machine-learning form of testing

Training and Evaluating a Model

Table of Contents

Why This Page Exists

Stage 1: Preparing the Data

Data cleaning (A4.2.1)

Feature selection (A4.2.2)

Dimensionality reduction (A4.2.3)

Stage 2: Training a Model, with Linear Regression (A4.3.1)

Stage 3: Is It Any Good? Evaluation (A4.3.3)

Never judge a model on the data it trained on

Overfitting and underfitting

Accuracy is not enough

Hyperparameter tuning

Stage 4: Deep Learning for Images, with CNNs (A4.3.9)

Stage 5: Choosing Between Models (A4.3.10)

Quick Check

Match the Concept

Practice Exercises

Core

Extension

Challenge

Connections