Supported Model Servers¶

Any model server that conform to the model server protocol are supported by the inference extension.

Compatible Model Server Versions¶

Model Server	Version	Commit	Notes
vLLM V0	v0.6.4 and above	commit 0ad216f
vLLM V1	v0.8.0 and above	commit bc32bc7
Triton(TensorRT-LLM)	25.03 and above	commit 15cb989.	LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. Feature request
SGLang	v0.4.0 and above	commit 1929c06	Set `--enable-metrics` on the model server. LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in SGLang yet.

vLLM¶

vLLM is configured as the default in the endpoint picker extension. No further configuration is required.

Triton with TensorRT-LLM Backend¶

Triton specific metric names need to be specified when starting the EPP.

Option 1: Use Helm¶

Use --set inferencePool.modelServerType=triton-tensorrt-llm to install the inferencepool via helm. See the inferencepool helm guide for more details.

Option 2: Edit EPP deployment yaml¶

Add the following to the args of the EPP deployment

- --total-queued-requests-metric - "nv_trt_llm_request_metrics{request_type=waiting}" - --kv-cache-usage-percentage-metric - "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}" - --lora-info-metric - "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.

## SGLang

### Edit EPP deployment yaml

 Add the following to the `args` of the [EPP deployment](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/manifests/inferencepool-resources.yaml#L32)


- --totalQueuedRequestsMetric
- "sglang:num_queue_reqs"
- --kvCacheUsagePercentageMetric
- "sglang:token_usage"
- --lora-info-metric
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by SGLang yet.