Supported Models

Model Implementation
vLLM
Transformers
Custom Models
Writing Custom Models
Loading a Model

Model Implementation

vLLM supports a variety of model implementations to enable efficient inference and flexibility. Users can work with models pre-integrated into vLLM or implement their own custom models.

vLLM

vLLM provides native support for large language models with optimized performance for low-latency serving. It supports models like GPT and LLaMA out of the box, leveraging faster decoding algorithms.

Transformers

The platform integrates with Hugging Face Transformers, allowing seamless loading and usage of a wide range of pre-trained models that comply with the Transformers library format.

Custom Models

Users may define custom models by extending vLLM’s base classes to tailor inference behavior to their specific needs and experiment with novel architectures.

Writing Custom Models

There is guidance on implementing custom models, focusing on overriding core methods to specify forward passes, token sampling, and other behaviors essential for integration with vLLM.

Loading a Model

vLLM supports loading models directly from Hugging Face model repositories or local checkpoints, providing flexible methods to quickly start inference tasks without complex setup.

Quote:
"vLLM supports a variety of model implementations to enable efficient inference and flexibility."

vLLM combines native and Transformer-based models support, offering an adaptable framework for deploying both built-in and custom large language models with efficient inference capabilities.

vLLM Docs — 2025-11-29