What is machine learning operations (MLOps)?

Answering the 7 most frequently asked questions about MLOps

Article
Data Engineering

Philip Roeleveld

Machine Learning Engineer

7 min

19 Dec 2023

Bringing machine learning models to production has proven to be a complex task in practice. MLOps assists organisations that want to develop and maintain models themselves in ensuring the quality and continuity. Read this article and get answers to the most frequently asked questions on this topic.

What is the definition of machine learning operations?

Machine learning operations (MLOps) covers everything involved in the development and deployment of machine learning models, excluding the development of the models themselves. It is about facilitating the lifecycle of machine learning models. This also includes helping Data Scientists and developers access their data and setting up environments where they can run experiments. This allows them to focus optimally on the development of their models. Various frameworks have been devised to describe that lifecycle. And although emphasis may be placed on different aspects, they generally agree on the main principles. See, this for example.

Machine learning operations is a relatively new term. It has emerged because it has proven to be a complex challenge in practice to deploy machine learning models in a way that is robust and scalable. This is because you are not only dealing with code (DevOps), but data also plays a significant role. When deploying machine learning, both aspects must be considered: The code, as in any other branch of software development, and the data, because even when code remains the same, a model must also be checked when the data changes.

Interested in the definition of MLOps in more detail? Consider taking a look at MLOps.

When is machine learning operations relevant for an organisation?

MLOps becomes relevant right from the first model. During development, it ensures that the right data is available, provisional results are recorded, and models can train on suitable hardware. This significantly saves time and reduces manual work for Data Scientists developing the model. When that model is ready for production, MLOps continues to be valuable by automating the process and significantly reducing the risks of quality issues through monitoring.

Thus, MLOps can help save costs and time during the development of machine learning models by accelerating development, running models more efficiently, scaling better, and reducing risks.

The impact of MLOps is greatest when you are faced with significant fluctuations in the statistical properties of your data. In such cases you first need to detect those changes, followed by retraining an ML model to understand the altered data. By implementing MLOps best practices, you can stay ahead of these risks, and streamline and automate the retraining process.

MLOps is also essential for scalability. If an organisation has many different machine learning models, it is no longer feasible to manually track which models perform well or need retraining. Within MLOps, there are tools and methods to continuously monitor all these models, as is done in Internet of Things (IoT) applications, for example. Additionally, there are opportunities to maintain various pipelines or infrastructure centrally, rather than for each model individually, resulting in significant time savings.

How would you apply MLOps in a company, and which tools and technologies would you use?

MLOps is not an all-or-nothing solution. The extent to which you apply it depends on the quantity and type of models in place. For the adoption of MLOps, discussions between the MLOps Engineer and the people actually creating the models are crucial to determine what the needs are. For instance, if three Data Scientists are each creating different models in their own way, a central environment is necessary for standardisation. If end-users struggle to understand a model's output, centralising data storage enables visualisation through dashboards.

A tool indispensable for an MLOps Engineer is MLflow, for model tracking. When a Data Scientist develops a new model, experimentation often begins with a subset of the data. With MLflow you can save the model and all relevant context. The lineage of the model remains known, and deployment of the model becomes more effortless on platforms like Databricks, Azure ML Studio, and AWS SageMaker—all compatible with MLflow.

Further needs depend on the scale. Besides MLflow, the next step often involves setting up CI/CD (continuous integration, continuous delivery) pipelines to simplify training. Deployment pipelines are also built to bring models to production.

One of our clients had numerous small sensors for monitoring and maintaining water pipes. There was a separate model for each sensor, resulting in thousands of small models. Manually checking that all these models were performing adequately was not feasible. Instead, we utilised model and data quality monitoring to automatically accept models after training, or reject them and fallback on a 'backup' algorithm without machine learning. When dealing with a large number of models in production, it is not only important to scale infrastructure. It is also imperative to keep monitoring the output, verified based on business logic.

Who is involved in machine learning operations projects?

Machine learning operations projects are typically carried out by a team of IT specialists. An MLOps Engineer, or ML Engineer, specialises in deploying machine learning models to production and setting up monitoring for running models. Additionally, roles such as Architects, DevOps Engineers, Platform Engineers, and Data Engineers are often involved. An Architect designs technical solutions, for MLOps typically in the cloud. The other mentioned profiles are developers, each with a different focus. Not all of these roles are necessary in every setup. Their inclusion depends on the team, environment, technical choices, and more.

In broad terms, two approaches are possible for these roles. A team can be formed with both model developers and MLOps engineers, or a specialised MLOps team can be established to support various other teams of Data Scientists. The latter is more common in larger organisations. In any case, the MLOps Engineer or MLOps team collaborates closely with the Data Scientists responsible for model development. The Data Scientist is the only profile occupied with modeling the data.

Once machine learning models are successfully deployed to production, the output can be consumed by other teams within an organisation. For example, a web team to display recommendations in an online store, or a service team for predictive maintenance on hardware. These teams are involved as stakeholders because they have a significant interest not only in the performance of the models but also in ensuring that the models remain available in production.

From a product management perspective, a key value of MLOps is to save on both cost and time in the long run. As such, the business is also involved by investing in MLOps, which pays off by accelerating model development, optimising production, and scaling the management of machine learning products.

What tips do you have for managing machine learning pipelines?

1. Ensure clarity on who is responsible for the models and their management (governance). Can someone from the team simply replace a model in production with a new version, or is approval required? And if yes, who should give it? In practice, a four-eyes principle is almost always applied, and there are Product Owners for the models. Additionally, it is of course essential to keep the users of the model informed.

2. Keep track of different versions of your model and their performance (versioning). Is a new version not performing up to par? Then you can revert to the previous one.

3. Store a great deal of metrics, logging, graphs, and other artifacts for every single model version during training. This allows you to have confidence that a new version is working well or quickly identify any issues.

4. Utilise a data catalog. A data catalog allows you to reference data as assets. For an ML model, this enables specifying which data it needs and keeping track of what data it was trained on. This also allows for automatic loading of data, eliminating the need to implement this in each model separately.

5. Continued dialogue with stakeholders is paramount. Just because a model is in production doesn't mean the work is done. New data sources may become available, or effects of a model might not be apparent from monitoring alone. By keeping communication open, you can identify and address such changes or opportunities.

How do you address ethical issues that may arise in the application of machine learning?

In practice, this usually concerns models making autonomous decisions that impact people. In such situations, the machine should never determine anything without human intervention. However, this is often still not enough because people can quickly shift from a deciding role, where the model is just one factor, to a checking role where the model is leading. In such cases, it is essential for the operation of the model to be transparent, allowing conclusions to be explained and justified.

One solution for this is the application of explainable AI. In this approach, the computer explains how the output is calculated in a human-understandable way. This can be very challenging depending on the type of model as training often becomes a black box for larger models, especially in the case of deep learning.

Machine learning operations can facilitate explainable AI by effectively storing the history of a model. The predictions of a model depend not only on the input data but also on the data on which the model was trained. With MLOps, you know which version of the model was used to generate the output, and in which period it was trained on what data. Through proper documentation, you can explain why a model behaves a certain way. For example, consider a model trained during a heatwave or a strike: With MLOps, you can determine that the normal data pattern deviates due to these circumstances. Subsequently, you can decide to exclude that data.

What are common pitfalls when deploying machine learning to production?

1. Data quality is an important factor in the success of a machine learning product. Is this not yet sufficient? Then it’s a good idea to direct focus there first.

2. Support from management. This is particularly relevant for larger organisations. Due to dependencies on other teams, data sources, and systems, gaining support is essential. Is the source data ready to connect? What does the collaboration look like with infrastructure and platform teams? Dependencies between teams need to be managed, and a top-down vision is crucial. Consider also who the end-user will be and what their roadmap looks like.

3. Time commitment from the users. When initially delivering a product, you will enter a testing phase where it is important to setup a feedback loop. Particularly in the initial phase, it is essential to have time for this back-and-forth. A good way to obtain this commitment is by making the user a co-owner of the project, so they share responsibility for the project's success.

4. Expecting the initial version of your model to perform perfectly. Successful machine learning products often require fine-tuning through several iterations. User feedback is crucial and can only be incorporated starting after the very first iteration.

5. Access to the correct data. This is again a concern chiefly for enterprise organisations. It closely relates to the maturity of the internal data and IT landscape. For machine learning, having access to high-quality data is essential. OTAP environments are traditionally "hard" separated: Ina testing environment there is only test data. However, this is unworkable for machine learning. The testing environment for your model needs to tap into production data.

6. Creating a machine learning model without a baseline. If there is nopre-established approach, it becomes unclear what needs improvement and whether an expensive machine learning model is worth further development. A baseline could be a business rule, a simple forecast, or an estimate of operational costs without the model.

Getting started with machine learning operations? We can assist you!

Our team of 25+ Data Engineers consists of IT specialists with expertise in areas such as MLOps, DevOps, data warehousing, and infrastructure. Together, we guide you from model development to infrastructure building, and from deployment to maintenance and monitoring.

This is Philip

Philip is a Machine Learning Engineer with experience in Data Engineering and Data Science. He most enjoys working at the intersection where models, data, and infrastructure come together to make Machine Learning possible. He has worked with data and modeling across various sectors, with time series data being the throughline.

Philip Roeleveld

Machine Learning Engineer philip.roeleveld@digital-power.com

Receive data insights, use cases and behind-the-scenes peeks once a month?

You might find this interesting too

20% fewer complaints thanks to data-driven maintenance reports

An essential part of Otis's business operations is the maintenance of their elevators. To time this effectively and proactively inform customers about the status of their elevator, Otis wanted to implement continuous monitoring. They saw great potential in predictive maintenance and remote maintenance.

Data quality: the foundation for effective data-driven work

Data projects often need to deliver results quickly. The field is relatively new, and to gain support, it must first prove its value. As a result, many organisations build data solutions without giving much thought to their robustness, often overlooking data quality. What are the risks if your data quality is not in order, and how can you improve it? Find the answers to the key questions about data quality in this article.

Valuable insights from Microsoft Dynamics 365

Agrico is a cooperative of potato growers. They cultivate potatoes for various purposes such as consumption and planting future crops. These potatoes are exported worldwide through various subsidiaries. All logistical and operational data is stored in their ERP system, Microsoft Dynamics 365. Due to the complexity of this system with its many features, the data is not suitable for direct use in reporting. Agrico asked us to help make their ERP data understandable and develop clear reports.

Kubernetes-based event-driven autoscaling with KEDA: a practical guide

This article explains the essence of Kubernetes Event Driven Autoscaling (KEDA). Subsequently, we configure a local development environment enabling the demonstration of KEDA using Docker and Minikube. Following this, we expound upon the scenario that will be implemented to showcase KEDA, and we guide through each step of this scenario. By the end of the article, you will have a clear understanding of what KEDA entails and how they can personally implement an architecture with KEDA.

AWS (Amazon Web Services) vs GCP (Google Cloud Platform) for Apache Airflow

This article provides a comparison between these two managed services Cloud Composer & MWAA. This will help you understand the similarities, differences, and factors to consider when choosing them. Note that there are other good options when it comes to hosting a managed airflow implementation, such as the one offered by Microsoft Azure. The two being compared in this article are chosen due to my hands-on experience using both managed services and their respective ecosystems.

Insight into the complete sales funnel thanks to a data warehouse with dbt

Our consultants log the assignments they take on for our clients in our ERP system AFAS. In our CRM system HubSpot, we can see all the information relevant before signing a collaboration agreement. When we close a deal, all the information from HubSpot automatically transfers to AFAS. So, HubSpot is mainly used for the process before entering a collaboration, while AFAS is used for the subsequent phase. To tighten our people's planning and improve our financial forecasts, we decided to set up a data warehouse to integrate data from both data sources.

The all-round profile of the modern data engineer

Since the field of big data emerged, many elements of the modern data stack became the data engineers' responsibility. What are these elements, and how should you build your data team?

Reliable reporting using robust Python code

The National Road Traffic Data Portal (NDW) is a valuable resource for municipalities, provinces, and the national government to gain insight into traffic flows and improve infrastructure efficiency.