Designing value-adding ML systems

A step-by-step guide from business need to production

Article
Data Engineering
Machine learning operations

George Pavlidis

Data Engineer

9 min

29 Apr 2026

Machine learning (ML) is often treated as a modelling exercise. Pick an algorithm, train it, evaluate the metrics, deploy. In reality, the algorithm is one of the least important decisions you’ll make.

This article is for you if you're a Data Scientist, ML Engineer, or Analytics Lead who has built models that perform well in testing but struggle to translate them into real-life results. If you want to understand the challenges of an end-to-end ML project and how to overcome them in production, this guide gives you the structure to get from business question to working system.

What separates ML projects that deliver business value from those that stall in a notebook is system design: the structured process of translating a business need into a production-ready decision system. The difference becomes clear when you compare the two side by side:

The differences between ML Modelling and ML System Design

What is ML system design?

ML system design is the discipline of translating a business question into a production-ready system that reliably supports decision-making using data and machine learning. It covers everything from problem framing and data strategy to model development, evaluation, and operational integration.

This article in short

This article walks through the six stages of ML system design: business translation, problem framing, data decisions, model selection, evaluation, and business integration. At each step, we use a SaaS customer churn case to show what these decisions look like in practice.

Step 1: Translate the business question

Goal: define what decision will change, under which constraints, and how success is measured in business terms.

The first step is also the most underestimated. Before any data work begins, you need to understand what the client actually needs, not what they literally asked for. Clients speak in outcomes (“reduce churn”), not in ML terms (“train a binary classifier with a 30-day prediction window”). Your job is to bridge that gap.

Clarify the desired outcome. What decision will be made differently because this model exists? Clients rarely want a score, they want to know who to contact, when, and with what offer.

Identify operational constraints. How many customers can the team realistically reach? What’s the budget per intervention? What’s the minimum lead time?

Define success metrics. Model accuracy is almost never the right success metric. Business KPIs like revenue retained, churn rate reduction, and campaign ROI are what matter.

Specify actionability. Every prediction must map to a concrete intervention. If there’s no action tied to a prediction, the model is an expensive dashboard decoration.

💡 Churn example

The client asks: “Can we reduce churn?” The translated requirement: identify which customers are likely to cancel within 30 days so the retention team can reach out before it happens. The team can contact 500 customers per week at €10 per intervention, and outreach must happen at least 7 days before the renewal date. Success means monthly churn dropping from 5% to 3%.

Step 2: Frame the ML problem

Goal:Turn the business need into a precise prediction target with a defined horizon, scope, and integration path.

Once you understand the business need, you translate it into a formal ML problem. This means making explicit decisions that are often left vague, and that’s where projects go wrong.

Define the prediction target. What exactly are you predicting? The label must have a precise, unambiguous definition. Does it include partial disengagement, or only complete exits?

Choose the prediction horizon. How far ahead does the model need to predict? Too short and there’s no time to intervene. Too long and predictions become unreliable.

Set the model scope. Which entities does the model cover? Not every segment deserves the same model. High-value accounts may need different treatment than self-serve users.

Ensure integration alignment. How will predictions be consumed? If the output needs to feed a CRM workflow, that shapes everything from output format to latency requirements.

💡 Churn example

Churn is defined as account cancellation within 30 days. Downgrades are excluded, trial users are modelled separately. The model covers paying subscribers with 3+ months of tenure. Output integrates directly into the CRM to trigger the retention workflow automatically.

Step 3: Make your data decisions

Goal: Identify the right sources, engineer features with domain knowledge, and guard against quality pitfalls that silently break models.

Data decisions form the backbone of any ML system. Start by mapping which internal systems (CRM, billing, support logs, product analytics) can provide predictive power. For each source, assess availability, freshness, and reliability.

Raw data rarely has predictive power on its own. Effective features emerge when you combine data with domain understanding. Three categories tend to matter most: rolling aggregates that capture behaviour over time windows (30, 60, 90 days), trend and slope features that reveal whether engagement is increasing or declining, and recency indicators that measure how recently key interactions occurred.

Data quality is the most underestimated source of ML project failure. Watch for data leakage (features computed after the label event), survivorship bias (if departed users are removed from your data, the model only learns from those who stayed), class imbalance (a naive model predicting “no” every time can score 95%+ while being useless), and poor label quality (inconsistencies and incorrect measurements in historical data).

💡 Churn example

Data comes from four categories: behavioural (login frequency, product usage, support tickets), relational (customer tenure, NPS scores), financial (balance trends, fee complaints), and demographic (segment, acquisition channel). Key engineered features include the 30-day balance slope, login frequency trend, and days since last meaningful interaction. A feature like ‘account closed date’ is excluded to prevent leakage.

With only 5% of customers actually churning, the dataset is heavily imbalanced. A model trained without addressing this would simply predict "no churn" for everyone and achieve 95% accuracy while missing every customer that actually leaves.

To counter this, the team applies rebalancing techniques such as oversampling the minority class (e.g. SMOTE to generate synthetic churn examples), undersampling the majority class, and assigning higher weights to churned customers during training so the model pays more attention to these cases.

After training, they also lower the classification threshold from the default 0.5 to 0.35, meaning customers with even a moderate risk score are flagged for outreach. This increases recall (catching more true churners) at the cost of slightly more false alerts, a trade-off the retention team is comfortable with since the cost of a missed churner far exceeds the cost of an unnecessary call.

Step 4: Select and train your model

Goal: Find the model that best balances performance, interpretability, and maintenance cost for your specific context.

Model selection is a trade-off exercise, not a competition. A reliable approach follows three phases:

Establish a baseline. Start with a simple, interpretable model like logistic regression. It gives you a performance floor and initial insight into feature importance. If the baseline already meets the business need, you may not need anything more complex.

Explore complexity. Gradient boosting models (LightGBM, XGBoost) tend to perform well on tabular data and handle common data imperfections such as missing values, noisy features, and imbalanced classes through built-in parameters.

Tune deliberately. Use Bayesian hyperparameter search on stratified k-fold cross-validation, optimised for a metric that reflects the business objective (F1-score, AUC-ROC).

The critical question is not “which model scores highest?” but “does the marginal improvement justify the added complexity and maintenance burden?” A model that nobody can explain or maintain delivers no lasting value.

💡 Churn example

The baseline logistic regression reveals which features have predictive power. LightGBM handles the 95/5 class imbalance and outperforms the baseline significantly. Bayesian tuning optimised for F1-score, squeezes out the final gains. Decided LightGBM offers the right trade-off: strong performance with manageable complexity.

Step 5: Evaluate, offline and online

Goal: Validate that the model works statistically and delivers measurable business impact in the real world.

Evaluation is where many ML projects create a false sense of confidence. A model can look excellent in a notebook and fail completely in production.

Choose metrics that reflect the business objective, not just statistical performance. When the target event is rare, accuracy is misleading. Metrics like precision, recall, F1-score, and AUC-ROC give a more honest picture. Explainability matters too: SHAP values allow you to show, per individual prediction, which factors were decisive. This builds trust with the client and helps catch unexpected model behaviour before deployment.

Offline metrics tell you whether the model is statistically sound. Online evaluation, typically A/B testing, tells you whether it actually works in the real world. Always relate model performance back to business KPIs. A 3% improvement in AUC means nothing to a CFO. A €200K increase in retained revenue does.

💡 Churn example

Because false negatives (missed churners) are more costly than false positives, the team optimises for recall without sacrificing too much precision. SHAP analysis reveals that balance decline slope and days since last login are the strongest drivers. An A/B test over 8 weeks shows the intervention group churns 35% less than the control group, validating real-world impact.

Step 6: Integrate, monitor, and maintain

Goal: Embed predictions into workflows, close the feedback loop, and detect drift before it erodes value.

A model that is not embedded in business workflows delivers nothing. Predictions must land where decisions are made, whether that’s a CRM, a marketing automation platform, or an operational dashboard. Define exactly what actions are triggered by which predictions and under what conditions.

Once in production, feedback loops become essential. No amount of offline testing can guarantee real-world predictive value. Capturing actual outcomes (did the customer churn or not?) and comparing them against predictions is what allows you to measure true model performance, identify weaknesses, and continuously fine-tune the system.

ML systems are not static. Over time, the patterns a model learned can become outdated, a phenomenon known as data drift. Customer behaviour changes, new products launch, markets shift, and the data the model encounters in production gradually diverges from the data it was trained on. Without continuous monitoring, model performance degrades silently. Track prediction distributions, feature distributions, and business KPIs over time to catch these shifts early and trigger retraining before performance degrades.

And communicate transparently with the client: business teams need to understand what model outputs mean, how reliable they are, and where limitations exist.

The bigger picture

ML system design is not a technical exercise, it is a structured decision-making discipline. At every stage there are trade-offs: between simplicity and complexity, between accuracy and interpretability, between speed and robustness.

A question we often hear is: what is the difference between an ML model and an ML system? An ML model produces predictions. An ML system includes the business logic, data pipelines, integrations, monitoring, and feedback loops that turn those predictions into better decisions. The model is one component. The system is what delivers value.

This also explains why most ML projects fail to deliver impact. It is rarely because the algorithm was wrong. It is because predictions were never aligned to real business actions, constraints, and incentives. A model optimised for AUC in a notebook is not the same as a system that reduces churn in production.

The churn example illustrates this clearly: the model itself is almost incidental. What makes the system work is the precise problem definition, the carefully engineered features, the business-aligned evaluation, and the tight integration into the retention workflow. Take away any of those, and the algorithm alone delivers nothing.

💡 Churn example

The churn model integrates directly into the CRM. Each week, high-risk customers automatically enter the retention workflow: the account manager receives a task with the risk score, contributing factors, and a pre-approved discount offer. After three months, the team detects a drift in login patterns due to a product redesign and retrains the model with updated features.

This article was written by George Pavlidis

George has worked as a Data Engineer at Digital Power and has experience across Data Science, Machine Learning, and Data Engineering. He has contributed to projects in finance, renewable energy, and manufacturing, focusing on translating business challenges into scalable, production-ready solutions.

George Pavlidis

Data Engineer

Receive data insights, use cases and behind-the-scenes peeks once a month?

You might find this interesting too

Bring data science models into production with our Machine Learning Operations framework

Accelerate machine learning maturity and enable seamless deployment and governance of your ML models with Machine Learning Operations. Save time, reduce costs, and ensure that machine learning investments translate into tangible business value.

Machine Learning Engineer

Work on challenging machine learning and data science projects with leading organisations.