Model serving is a critical component of AI use-cases. It involves offering an inference from an AI model in response to a user request. Those who have dabbled in enterprise-grade machine learning applications know that it is usually not one model providing an inference, but actually hundreds or even thousands of models running in tandem. This is a very expensive process computationally as you can't spin up a dedicated container each time you want to serve a request. This is a challenge for developers deploying a large number of models across Kubernetes clusters because there are limitations such as the maximum number of pods and IP addresses allowed as well as compute resource allocation.
IBM solved this challenge with its proprietary ModelMesh model-serving management layer for Watson products such as Watson Assistant, Watson Natural Language Understanding, and Watson Discovery. Since these models have been running in production environments for several years, ModelMesh has been thoroughly tested for various scenarios. Now, IBM is contributing this management layer to open-source complete with controller components as well as model-serving runtimes.
ModelMesh enables developers to deploy AI models on top of Kubernetes at "extreme scale". It features cache management and also acts as a router that balances inferencing requests. Models are intelligently placed in pods and are resilient to temporary outages. ModelMesh deployments can be upgraded with ease without any external orchestration mechanism. It automatically ensures that a model has been fully updated and loaded before routing new requests to it.
Explaining the scalability of ModelMesh with some statistics, IBM went on to say that:
One ModelMesh instance deployed on a single worker node 8vCPU x 64G cluster was able to pack 20K simple-string models. On top of the density test, we also load test the ModelMesh serving by sending thousands of concurrent inference requests to simulate a high traffic holiday season scenario that all loaded models respond with single digit millisecond latency. Our experiment showed that the single worker node supports 20k models for up to 1000 queries per second and responds to inference quests with single digit millisecond latency.
IBM has contributed ModelMesh to the KServe GitHub organization that was developed jointly by itself, Google, Bloomberg, NVIDIA, and Seldon back in 2019. You can check out the ModelMesh implementation contributions in the various GitHub repositories mentioned below:
- Model serving controller
- ModelMesh containers used for orchestrating model placement and routing Runtime Adapters
- modelmesh-runtime-adapter - the containers which run in each model serving pod and act as an intermediary between ModelMesh and third-party model-server containers. It also incorporates the "puller" logic which is responsible for retrieving the models from storage
- triton-inference-server - Nvidia's Triton Inference Server
- seldon-mlserver - Python MLServer which is part of KFServing