MLOps, short for Machine Learning Operations, is a set of practices and tools aimed at addressing the specific needs of engineers building models and moving them into production. Some organizations start off with a few homegrown tools that version datasets after each experiment and checkpoint models after every epoch of training. On the other hand, many organizations have chosen to adopt a formal tool that has experiment tracking, collaboration features, model serving capabilities, and even pipeline features for processing data and training models.
To make the best choice for your organization, you should understand all the capabilities available from the leading MLOps tools in the industry. If you go the homegrown route, you should understand the capabilities you are giving up. A homegrown approach is fine for small teams that need to move quickly and may not have time to evaluate a new tool. If you choose to implement a third-party tool, then you will need to pick the tool that best matches your organization's engineering workflow. This could be tricky because the top tools today vary significantly in their approach and capabilities. Regardless of your choice, you will need data infrastructure that can handle large volumes of data and serve training sets in a performant manner. Checkpointing models and versioning large datasets require scalable capacity, and if you are using expensive GPUs, you will need performant infrastructure to get the most out of your investment.
In this post, I will present a feature list that architects should consider regardless of the approach or tooling they choose. This feature list comes from my research and experiments with three of the top MLOps vendors today - KubeFlow, MLflow, and MLRun. For organizations that chose to start off with a homegrown solution I will present a data infrastructure that can scale and perform. (Spoiler alert - all you need here is MinIO.) When it comes to third-party tools, I have noticed a pattern with the vendors I have researched. For organizations that choose to adopt MLOps tooling, I will present this pattern and tie it back to our Modern Datalake Reference Architecture.
Before diving into features and infrastructure requirements, let’s better understand the importance of MLOps. To do this, it is helpful to compare model creation to conventional application development.
Conventional application development, like implementing a new microservice that adds a new feature to an application, starts with reviewing a specification. New data structures or changes to existing data structures are designed first. The design of the data should not change once coding begins. The service is then implemented, and coding is the main activity in this process. Unit tests and end-to-end tests are also coded. These tests prove that the code is not faulty and correctly implements the specification. They can be run automatically by a CI/CD pipeline before deploying the entire application.
Creating a model and training it is different. The first step is understanding the raw data and the needed prediction. ML engineers do have to write some code to implement their neural networks or set up an algorithm, but coding is not the dominant activity. The main activity is repeated experimentation. During experimentation, the design of the data, the design of the model, and the parameters used will all change. After every experiment, metrics are created that show how the model performed as it was trained. Metrics are also generated to determine model performance against a validation set and a test set. These metrics are used to prove the quality of the model. You should save the model after every experiment, and every time you change your datasets, you should save them as well. Once a model is ready to be incorporated into an application, it must be packaged and deployed.
To summarize, MLOps is to machine learning what DevOps is to traditional software development. Both are a set of practices and principles aimed at improving collaboration between engineering teams (the Dev or ML) and IT operations (Ops) teams. The goal is to streamline the development lifecycle, from planning and development to deployment and operations, using automation. One of the primary benefits of these approaches is continuous improvement.
Let’s go a little deeper into MLOps and look at specific features to consider.
Experiment tracking and collaboration are the features most associated with MLOps, but today's more modern MLOps tools can do much more. For example, some can provide a runtime environment for your experiments. Others can package and deploy models once they are ready to be integrated into an application.
Below is a superset of features found in MLOps tools today. This list also includes other things to consider, such as support and data integration.
After looking at the differences between traditional application development and machine learning, it should be clear that to be successful with machine learning, you need some form of MLOps and a data infrastructure capable of performance and scalable capacity.
Homegrown solutions are fine if you need to start a project quickly and do not have time to evaluate a formal MLOps tool. If you take this approach, the good news is that all you need for your data infrastructure is MinIO. MinIO is S3 compatible so if you started with another tool and used an S3 interface to access your datasets, then your code will just work. If you are starting out then you can use our Python SDK, which is also S3 compatible. Consider using the enterprise version of MinIO, which has caching capabilities that can greatly speed up data access for training sets. Check out The Real Reasons Why AI is Built on Object Storage where we dive into how and why MinIO is used to support MLOps. Organizations that choose a homegrown solution should still familiarize themselves with the ten features described above. You may eventually outgrow your homegrown solution, and the most efficient way forward is to adopt an MLOps tool.
Adopting a third-party MLOps tool is the best way to go for large organizations with several AI/ML teams creating models of different types. The MLOps tool with the most features is not necessarily the best tool. Look at the features above and make note of the features that you need, the features you currently have as part of your existing CI/CD pipeline, and finally, the features you do not want, this will help you find the best fit. MLOps tools have a voracious appetite for large petabytes of object storage. Many of them automatically version your datasets with each experiment and automatically checkpoint your models after each epoch. Here again, MinIO can help since capacity is not a problem. Similar to the homegrown solution, consider using the enterprise edition of MinIO. The caching features work automatically once configured for a bucket so even though the MLOps tool does not request the use of the cache - MinIO will automatically cache frequently accessed objects like a training set.
Many of the MLOps tools on the market today use an open-source relational database to store the structured data generated during model training which is usually metrics and hyperparameters. Unfortunately, this will be a new database that needs to be supported by your organization. Additionally, if an organization is moving toward a Modern Datalake (or Data Lakehouse) then an additional relational database is not needed. What would be nice for major MLOps vendors to consider is using an OTF-based data warehouse to store their structured data.
All the major MLOps vendors use MinIO under the hood to store unstructured data. Unfortunately, this is generally deployed as a separate small instance that is installed as a part of the overall larger installation of the MLOps tool. Additionally, it is usually an older version of MinIO, which goes against our ethos of always running the latest and greatest. For existing MinIO customers, it would be nice to allow the MLOps tool to use a bucket within an existing installation. For customers new to MinIO, the MLOps tool should support the latest version of MinIO. Once installed, MinIO can also be used for purposes within your organization beyond MLOps resources, namely anywhere the strengths of object storage are required.
In this post, I presented an architect's guide to MLOps by investigating both MLOps features and the data infrastructure needed to support these features. At a high level, organizations can build a homegrown solution, or they can deploy a third-party solution. Regardless of the direction chosen, it is important to understand all the features available in the industry today. Homegrown solutions allow you to start a project quickly, but you may soon outgrow your solution. It is also important to understand your specific needs and how MLOps will work with an existing CI/CD pipeline. Many MLOps tools are feature-rich and contain features that you may never use or that you already have as part of your CI/CD pipeline.
To successfully implement MLOps, you need a data infrastructure that can support it. In this post, I presented a simple solution for those who chose a homegrown solution and described what to expect from third-party tools and the resources they require.
I concluded with a wish list for further development of MLOps tools that would help them to better integrate with the Modern Datalake.
For more information on using the Modern Datalake to support AI/ML workloads, check out AI/ML Within A Modern Datalake.
If you have any questions, be sure to reach out to us on Slack!