How to Build a vLLM Container Image?

With large language models (LLMs) emerging in advanced AI systems, inference will become increasingly critical. Maximize your inference performance to maximize user experience and savings as you deploy chatbots, summarization tools, or other applications powered by LLMs.

It is an open-source LLM inference engine that is highly memory efficient and can operate at high throughput. Here, we will guide you through the complete step-by-step process to create a container image for vLLM. Let’s go through it!

What is vLLM?

vLLM (Virtual Large Language Model) is a highly optimized inference engine for large language models. The main features for vLLM are:

Maximize throughput through scheduling algorithms.
Minimize memory consumption with PagedAttention, a new form of memory management.
Support dynamic batching to handle users with minimal latency.

vLLM also supports popular models like LLaMA, GPT-J, Falcon, and many others. vLLM has capabilities to serve these models via Python and Restful APIs, making it a useful backend for online LLM applications.

Why Containerize vLLM?

Here are reasons to containerize vLLM!

Portability: Same container can run on dev, staging, and production.
Reproducibility: No more “it works on my machine” problems.
Scalability: Deploying to a cloud platform, to a Kubernetes cluster or an on-prem GPU server is a simple task.
Simplicity: You put all dependencies, configuration, and scripts into one package, and you can reuse that package.

Pre-requisites for Building a vLLM Container

Before starting to build your vLLM container, there are some concepts you should understand.

Docker Basics: Get basic understanding of the Dockerfiles, and ways to build/run containers, images, volumes and networks.

Python & ML Environments: Have proper idea about Python virtual environments, install packages using pip, and basic idea about machine learning frameworks like PyTorch.

System Requirements
Since vLLM is built for high performance inference, it should run on a machine with:
NVIDIA GPU with CUDA support.
CUDA Toolkit installed on the host (the version installed on the host must match the container).
NVIDIA Drivers.
Ubuntu/Linux.

Steps to Build a vLLM Container Image

Let’s have a look at the step-wise tutorial to build a vLLM container image!

Before getting started, you will need:

Access the server using SSH.
Then, start the Docker service.

sudo systemctl start docker

Server Setup Configurations

Step 1: Create a new directory you will use to store your vLLM project files.

mkdir vllm-project

Step 2: Change directory to the new directory.

cd vllm-project

Step 3: Clone the vLLM project with Git.

git clone https://github.com/vllm-project/vllm/

Step 4: Make sure to list the files and verify that you have a new vLLM directory.

ls

Step 5: Change directory to the vLLM project directory.

cd vllm

Next, list the directory files and verify you have the necessary Dockerfile resources.

ls

The desired output is:

benchmarks collect_env.py Dockerfile docs LICENSE pyproject.toml requirements-common.txt requirements-dev.txt rocm_patch vllm
cmake CONTRIBUTING.md Dockerfile.cpu examples MANIFEST.in README.md requirements-cpu.txt requirements-neuron.txt setup.py
CMakeLists.txt csrc Dockerfile.rocm

The vLLM project directory consists of the following Dockerfile resources:

Dockerfile: It is the main startup context for the vLLM library with access to NVIDIA GPU systems.
Dockerfile.cpu: Mainly comprises the vLLM build context for CPU systems.
Dockerfile.rocm: It usually includes the context for AMD GPU systems.

Hence, you can now leverage the mentioned resources for creating a GPU or CPU system container image.

For CPU Systems

To build a new vLLM container image using Dockerfile.cpu, follow the steps mentioned below, ensuring that it is inclusive of all of the relevant packages and dependencies for CPU-based systems.

Step 1: Build a new container image using Dockerfile.cpu with all the files in the project working directory. Replace vllm-image with your new image name.

docker build -f Dockerfile -t vllm-gpu-image

Step 2: View all the Docker images on the server and ensure that new vllm-gpu-image is available.

docker images

The output is:

REPOSITORY        TAG       IMAGE ID       CREATED       SIZE
vllm-gpu-image   latest    bf92416d18b4   8 hours ago   8.88GB

Conclusion

Containerizing vLLM is an effective and efficient way to deploy large language models when you are scaling across environments that need portability and consistency. Follow few steps mentioned below:

Configure a CUDA-enabled Docker environment.
Install the underlying system and Python dependencies that vLLM needed.
Build and run a Docker image with GPU support serving LLMs.
Test and validate the deployment of your large language model via real API calls.

This foundational image can now serve as the basis of your LLM inference infrastructure whether you choose to deploy it on a local server, in a cloud instance, or within a container orchestration platform like Kubernetes.

Frequently Asked Questions

What models work with vLLM?
vLLM is compatible with models, such as:

LLaMA / LLaMA 2
Falcon
MPT
GPT-J / GPT-NeoX
OPT

Can I run vLLM on a CPU only machine?
No, vLLM is optimized for GPU inference and requires a CUDA compatible NVIDIA GPU for effective operations.

How do I change the model that is being served in the container?
Update either the command line in your Dockerfile or simply include a new command when executing the container:

docker run --gpus all -p 8000:8000 vllm-container

python -m vllm.entrypoints.api_server --model facebook/opt-1.3b

You could also utilize environment variables or a shell script as the entrypoint.

Is it possible to run multiple models in the same container?
vLLM is designed to serve one model per process, but you can:

Run multiple container processes as separate models.
Use a load balancer or API gateway to route your requests correctly.