Supported Model Servers¶
Any model server that conform to the model server protocol are supported by the inference extension.
Compatible Model Server Versions¶
Model Server | Version | Commit | Notes |
---|---|---|---|
vLLM V0 | v0.6.4 and above | commit 0ad216f | |
vLLM V1 | v0.8.0 and above | commit bc32bc7 | |
Triton(TensorRT-LLM) | 25.03 and above | commit 15cb989. | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. Feature request |
SGLang | v0.4.0 and above | commit 1929c06 | Set --enable-metrics on the model server. LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in SGLang yet. |
vLLM¶
vLLM is configured as the default in the endpoint picker extension. No further configuration is required.
Triton with TensorRT-LLM Backend¶
Triton specific metric names need to be specified when starting the EPP.
Option 1: Use Helm¶
Use --set inferencePool.modelServerType=triton-tensorrt-llm
to install the inferencepool
via helm. See the inferencepool
helm guide for more details.
Option 2: Edit EPP deployment yaml¶
Add the following to the args
of the EPP deployment
- --total-queued-requests-metric
- "nv_trt_llm_request_metrics{request_type=waiting}"
- --kv-cache-usage-percentage-metric
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"
- --lora-info-metric
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
## SGLang
### Edit EPP deployment yaml
Add the following to the `args` of the [EPP deployment](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/manifests/inferencepool-resources.yaml#L32)