Accuracy Is Lying to You: How to Evaluate ML Models Correct

One of the most important lessons I learned today is that building a machine learning model is only half the job. The real challenge is knowing how well the model performs, whether it generalizes, and if it actually helps the business.

As a QA/SDET, this felt very familiar — just like testing software, ML models need structured evaluation at every stage.

Training, Validation, and Test Sets

Machine learning models are evaluated using three different datasets, each with a distinct purpose.

Training Set

Used to train the model
Model learns patterns from this data

Validation Set

Used to tune hyperparameters
Helps detect overfitting early

Test Set

Used only once, at the end
Gives an unbiased evaluation of model performance

QA analogy:
Training = writing code
Validation = internal testing
Test set = final regression before release

Overfitting, Underfitting, and Balanced Models

A good model should generalize well — not memorize data.

Underfitting

Model is too simple
Misses important patterns
High bias, low variance

Overfitting

Model memorizes training data
Performs poorly on new data
Low bias, high variance

Balanced Model

Learns meaningful patterns
Performs well on unseen data
Low bias and low variance

Bias and Variance (Simple Explanation)

Bias → error due to wrong assumptions
Variance → error due to sensitivity to data changes

A good model balances both.

Model Type	Bias	Variance
Underfitting	High	Low
Overfitting	Low	High
Balanced	Low	Low

Classification vs Regression

Classification

Output is a category
Examples:
- Spam vs Not Spam
- Pass vs Fail
- Defect vs No Defect

Regression

Output is a number
Examples:
- Predict test execution time
- Predict defect count
- Predict system load

Classification Metrics (With Simple Example)

Let’s say we are predicting Defect vs No Defect.

Actual \ Predicted	Defect	No Defect
Defect	40 (TP)	10 (FN)
No Defect	5 (FP)	45 (TN)

Accuracy

How often the model is correct

Precision

Of all predicted defects, how many were actually defects

Useful when false positives are costly

Recall

Of all actual defects, how many were detected

Useful when missing defects is risky

Regression Metrics

Mean Squared Error (MSE)

Measures average squared difference between predicted and actual values.

Lower MSE = better model
Penalizes large errors heavily

R‑Squared (R²)

Measures how well the model explains variance in data.

R² = 1 → perfect model
R² = 0 → no better than average
R² < 0 → worse than baseline

Easy way to explain model effectiveness to business stakeholders

Business Metrics: Does the Model Help the Business?

Technical accuracy alone is not enough.
Models must be evaluated against business goals.

A/B Testing

Compare Model A vs Model B
Measure business impact (speed, cost, quality)

Canary Deployment

Release model to a small group
Monitor performance and risks
Gradually roll out to all users

Very similar to progressive rollout testing in QA

Key Takeaways

ML models need structured evaluation, just like software
Balanced models have low bias and low variance
Classification and regression use different metrics
Accuracy alone is misleading — precision & recall matter
Regression models use MSE and R²
Business success is validated using A/B testing and canary deployments
QA mindset fits naturally into ML evaluation

Final Thoughts

Learning how to evaluate ML models made me realize that quality engineering principles apply perfectly to AI systems. We may be testing models instead of features, but the goal remains the same — build reliable, trustworthy systems that deliver real value.

See you in the next learning update
— Hema

Evaluating ML Models Like a QA Engineer (Not a Data Scientist)

Training, Validation, and Test Sets

Training Set

Validation Set

Test Set

Overfitting, Underfitting, and Balanced Models

Underfitting

Overfitting

Balanced Model

Bias and Variance (Simple Explanation)

Classification vs Regression

Classification

Regression

Classification Metrics (With Simple Example)

Accuracy

Precision

Recall

Regression Metrics

Mean Squared Error (MSE)

R‑Squared (R²)

Business Metrics: Does the Model Help the Business?

A/B Testing

Canary Deployment

Key Takeaways

Final Thoughts

Comments

AI for QA

Why Machine Learning Models Break After Deployment

More from this blog

Beat the Oracle: I Built a World Cup AI Game in a Single HTML File

I'm an SDET Learning AI Agents — Here's How I Built a Daily News Newsletter Bot with Hermes

From Test Cases to Prompts: How I Built an AI Receipt Scanner as a Quality Engineer with No Dev Background

When Prompts Go Wrong: Hidden Risks in AI Every QA Engineer Must Know

Prompt Engineering Is a Skill: How QA Engineers Make AI Reliable

Command Palette

Training, Validation, and Test Sets

Training Set

Validation Set

Test Set

Overfitting, Underfitting, and Balanced Models

Underfitting

Overfitting

Balanced Model

Bias and Variance (Simple Explanation)

Classification vs Regression

Classification

Regression

Classification Metrics (With Simple Example)

Accuracy

Precision

Recall

Regression Metrics

Mean Squared Error (MSE)

R‑Squared (R²)

Business Metrics: Does the Model Help the Business?

A/B Testing

Canary Deployment

Key Takeaways

Final Thoughts

Comments

AI for QA

Why Machine Learning Models Break After Deployment

More from this blog