Skip to main content

Command Palette

Search for a command to run...

Evaluating ML Models Like a QA Engineer (Not a Data Scientist)

A practical, QA‑friendly guide to metrics, bias vs variance, and business impact

Updated
4 min read
Evaluating ML Models Like a QA Engineer (Not a Data Scientist)
H
I’m Hema Nambiradje, a Senior Quality Engineer who loves digging into problems, improving systems, and helping teams ship reliable, user‑focused products. I care a lot about clean processes, thoughtful testing, and building things that actually hold up in the real world. I’m always exploring new tools, learning something nerdy, and sharing what I discover along the way.

One of the most important lessons I learned today is that building a machine learning model is only half the job. The real challenge is knowing how well the model performs, whether it generalizes, and if it actually helps the business.

As a QA/SDET, this felt very familiar — just like testing software, ML models need structured evaluation at every stage.


Training, Validation, and Test Sets

Machine learning models are evaluated using three different datasets, each with a distinct purpose.

Training Set

  • Used to train the model

  • Model learns patterns from this data

Validation Set

  • Used to tune hyperparameters

  • Helps detect overfitting early

Test Set

  • Used only once, at the end

  • Gives an unbiased evaluation of model performance

QA analogy:
Training = writing code
Validation = internal testing
Test set = final regression before release


Overfitting, Underfitting, and Balanced Models

A good model should generalize well — not memorize data.

Underfitting

  • Model is too simple

  • Misses important patterns

  • High bias, low variance

Overfitting

  • Model memorizes training data

  • Performs poorly on new data

  • Low bias, high variance

Balanced Model

  • Learns meaningful patterns

  • Performs well on unseen data

  • Low bias and low variance


Bias and Variance (Simple Explanation)

  • Bias → error due to wrong assumptions

  • Variance → error due to sensitivity to data changes

A good model balances both.

Model Type Bias Variance
Underfitting High Low
Overfitting Low High
Balanced Low Low

Classification vs Regression

Classification

  • Output is a category

  • Examples:

    • Spam vs Not Spam

    • Pass vs Fail

    • Defect vs No Defect

Regression

  • Output is a number

  • Examples:

    • Predict test execution time

    • Predict defect count

    • Predict system load


Classification Metrics (With Simple Example)

Let’s say we are predicting Defect vs No Defect.

Actual \ Predicted Defect No Defect
Defect 40 (TP) 10 (FN)
No Defect 5 (FP) 45 (TN)

Accuracy

How often the model is correct


Precision

Of all predicted defects, how many were actually defects

Useful when false positives are costly


Recall

Of all actual defects, how many were detected

Useful when missing defects is risky


Regression Metrics

Mean Squared Error (MSE)

Measures average squared difference between predicted and actual values.

  • Lower MSE = better model

  • Penalizes large errors heavily


R‑Squared (R²)

Measures how well the model explains variance in data.

  • R² = 1 → perfect model

  • R² = 0 → no better than average

  • R² < 0 → worse than baseline

Easy way to explain model effectiveness to business stakeholders


Business Metrics: Does the Model Help the Business?

Technical accuracy alone is not enough.
Models must be evaluated against business goals.

A/B Testing

  • Compare Model A vs Model B

  • Measure business impact (speed, cost, quality)


Canary Deployment

  • Release model to a small group

  • Monitor performance and risks

  • Gradually roll out to all users

Very similar to progressive rollout testing in QA


Key Takeaways

  • ML models need structured evaluation, just like software

  • Balanced models have low bias and low variance

  • Classification and regression use different metrics

  • Accuracy alone is misleading — precision & recall matter

  • Regression models use MSE and R²

  • Business success is validated using A/B testing and canary deployments

  • QA mindset fits naturally into ML evaluation


Final Thoughts

Learning how to evaluate ML models made me realize that quality engineering principles apply perfectly to AI systems. We may be testing models instead of features, but the goal remains the same — build reliable, trustworthy systems that deliver real value.

See you in the next learning update
Hema

29 views