跳转至

模型服务KServe

https://github.com/kserve/kserve/blob/master/install/v0.9.0/kserve-runtimes.yaml

针对不同类型的ML 框架,提供不同的服务镜像。

架构

控制平面

负责对InferenceService自定义资源进行调谐。

IngressClass 通过 HEADER 控制路由?

serverless mode:creates the Knative serverless deployment for predictor, transformer, explainer to enable autoscaling

raw deployment mode:creates Kubernetes deployment, service, ingress, HPA.

组件:

  • KServe Controller: 负责创建 service, ingress resources, model server container and model agent container
  • Ingress Gateway: 网关,路由外部/内部的请求;

在 Serverless 模式:

  • Knative Service Controller
  • Knative Activator
  • Knative Autoscaler

Architect

数据平面

Data Plane

组件:

  • Component:Each endpoint is composed of multiple components: "predictor", "explainer", and "transformer".
  • Predictor:model server available at a network endpoint.
  • Explainer: provides model explanations in addition to predictions.
  • Transformer:define a pre and post processing step before the prediction and explanation workflows.

prediction v1 protocol

API Verb Path Payload
Readiness GET /v1/models/ Response:{"name": , "ready": true/false}
Predict POST /v1/models/:predict Request:{"instances": []} Response:{"predictions": []}
Explain POST /v1/models/:explain Request:{"instances": []} Response:{"predictions": [], "explainations": []}

prediction v2 protocol

The Predict Protocol, version 2 is a set of HTTP/REST and GRPC APIs for inference / prediction servers.

Health:

GET v2/health/live

GET v2/health/ready

GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready

Server Metadata:

GET v2

Model Metadata:

GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]

Inference:

POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer

安装

k8s deployment

Raw Deployment 模式下不需要 Istio

  • 通过管理Ingress,控制路由,因此部署 Ingress-Nginx 即可。

创建 IngressClass

https://kserve.github.io/website/0.10/admin/kubernetes_deployment/

Nginx为例,参考 https://kubernetes.github.io/ingress-nginx/deploy/#quick-start,注意跟 K8s 的版本依赖

  • 安装 ingress - nginx
helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx --create-namespace
  • ingress - nginx 自带 IngressClass (不需要新增)
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
    app.kubernetes.io/version: 1.5.1
  name: nginx
spec:
  controller: k8s.io/ingress-nginx

安装 CertManager Yaml

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml

安装Kserve Yaml

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.10.0/kserve.yaml

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.10.0/kserve-runtimes.yaml

修改 inferenceservice-config

部署模式为 RawDeployment

kubectl patch configmap/inferenceservice-config -n kserve --type=strategic -p '{"data": {"deploy": "{\"defaultDeploymentMode\": \"RawDeployment\"}"}}'

ingressClassName 为创建的 IngressClass

ingress: |-
{
    "ingressClassName" : "your-ingress-class",
}

使用

模型支持

https://kserve.github.io/website/0.10/modelserving/v1beta1/serving_runtime/

Model Serving Runtime Exported model HTTP gRPC Default Serving Runtime Version Supported Framework (Major) Version(s) Examples
Custom ModelServer -- v1, v2 v2 -- -- Custom Model
LightGBM MLServer Saved LightGBM Model v2 v2 v1.0.0 (MLServer) 3 LightGBM Iris V2
LightGBM ModelServer Saved LightGBM Model v1 -- v0.10 (KServe) 3 LightGBM Iris
MLFlow ModelServer Saved MLFlow Model v2 v2 v1.0.0 (MLServer) 1 MLFLow wine-classifier
PMML ModelServer PMML v1 -- v0.10 (KServe) 3, 4 (PMML4.4.1) SKLearn PMML
SKLearn MLServer Pickled Model v2 v2 v1.0.0 (MLServer) 1 SKLearn Iris V2
SKLearn ModelServer Pickled Model v1 -- v0.10 (KServe) 1 SKLearn Iris
TFServing TensorFlow SavedModel v1 *tensorflow 2.6.2 (TFServing Versions) 2 TensorFlow flower
TorchServe Eager Model/TorchScript v1, v2, *torchserve *torchserve 0.7.0 (TorchServe) 1 TorchServe mnist
Triton Inference Server TensorFlow,TorchScript,ONNX v2 v2 21.09-py3 (Triton) 8 (TensoRT), 1, 2 (TensorFlow), 1 (PyTorch), 2 (Triton) Compatibility Matrix Torchscript cifar
XGBoost MLServer Saved Model v2 v2 v1.0.0 (MLServer) 1 XGBoost Iris V2
XGBoost ModelServer Saved Model v1 -- v0.10 (KServe) 1 XGBoost Iris

单模型部署

模型服务运行时:针对不同框架 sklearn, tensorflow, pytorch, mlflow 都提供不同的CRD,运行不同的镜像;

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "flower-sample"
  namespace: default
spec:
  predictor:
    model:
      modelFormat:
        name: tensorflow
      storageUri: "s3://kfserving-examples/models/tensorflow/flowers"
      # 修改运行时镜像的版本号
      runtimeVersion: 2.7.1

kubectl apply 后通过 kubectl get isvc flower-example 查看状态

$ kubectl get isvc flower-sample
NAME            URL                                        READY   PREV   LATEST   PREVROLLEDOUTREVISION        LATESTREADYREVISION                     AGE
flower-sample   http://flower-sample.default.example.com   True           100       

接口调用

MODEL_NAME=flower-sample
INPUT_PATH=@./input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)

curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d $INPUT_PATH

其中输入数据格式为:

{  
    "instances":[  
       {  
          "image_bytes":{  
              "b64": ...
           },
           "key":"   1"
       }
    ]
}

多模型部署

默认模式是'one model, one server' paradigm,当模型过多,会出现:

  • Compute resource limitation:cpu/gpu 的资源限制
  • Maximum pods limitation
  • Kubelet has a maximum number of pods per node with the default limit set to 110,不建议超过 100;
  • Maximum IP address limitation.
  • Each pod in InferenceService needs an independent IP.

二进制数据

示例

基础

创建模型的 PV 和PVC

apiVersion: v1
kind: PersistentVolume
metadata:
  name: kserve-demo-pv
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 2Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/home/liuzhiqiang/tensorflow-serving/serving/tensorflow_serving/servables/tensorflow/testdata"
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - ${NODE_NAME} 
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: kserve-demo-pvc
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

创建 InferenceService

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "tensorflow-kserve"
spec:
  predictor:
    model:
      modelFormat:
        name: tensorflow
      storageUri: "pvc://kserve-demo-pvc/saved_model_half_plus_two_cpu"

查看部署状态

$ kubectl get isvc tensorflow-kserve
NAME            URL                                             READY     ...           AGE
tensorflow-pvc   http://tensorflow-kserve-default.example.com   True                2m15s

Ingress 配置

InferenceService的annotation会添加到 Ingress 的 annotation,因此通过对 InferenceService 添加注解,实现对 Nginx Ingress Controller的路由的配置定义。

模型存储

支持 S3, PVC 和 URI。

通过 webhook修改container,注入initConatainerkserve/storage-initializer

  • 通过emptyDir将模型数据在initContainercontainer间传递。

Transformers

an InferenceService component which does pre/post processing alongside with model inference

  • transformer service calls to predictor service

对输入数据做预处理/后处理。

Kserve.Model定义三个 handlers,preprocess, predictpostprocess,顺序执行,且上个输出作为下个输入。

  • predict默认是通过获取predict_host进行 rest/grpc 调用;
  • predict_host默认会作为参数传递,默认REST调用;

自定义转换示例

InferenceGraph

consist of many models to make a single prediction.

  • e.g. a face recognition pipeline may need to first locate faces in a image, then compute the features of the faces to match records in a database

示例

https://kserve.github.io/website/0.10/modelserving/inference_graph/image_pipeline/#deploy-inferencegraph

模型解释