Emmett Fear

Finding the Best Docker Image for vLLM Inference on CUDA 12.4 GPUs

vLLM is a high-throughput, memory-efficient library for large language model (LLM) inference and serving . When running vLLM on GPUs with CUDA 12.4, it’s crucial to pick a suitable Docker image and environment. This guide provides an in-depth look at available Docker image sources, system requirements, a sample Dockerfile for vLLM, common compatibility issues, and deployment steps on RunPod (with GPU access). The goal is to help you achieve fast, reliable vLLM inference on CUDA 12.4, with SEO-optimized content and references for further reading.

Available Image Sources

Several sources provide Docker images for vLLM inference on NVIDIA GPUs. The key options include NVIDIA NGC containers, official vLLM Docker images, and RunPod’s pre-built templates. Each has advantages in deployment speed and reliability:

  • NVIDIA NGC Containers (Triton + vLLM) – NVIDIA offers a pre-built Triton Inference Server container with vLLM support. For example, pulling nvcr.io/nvidia/tritonserver:23.10-vllm-python-py3 gives a Triton server (23.10 release) with the vLLM backend pre-installed . This image is optimized by NVIDIA for production, supporting multiple models and an OpenAI-compatible API via Triton. It’s kept up to date with each Triton release (e.g. 23.10 corresponds to October 2023) . These NGC images are thoroughly tested and reliable, though they include the full Triton server (which adds some overhead if you only need a single-model server).
  • Official vLLM Docker Image (Docker Hub) – The vLLM project provides an official Docker image on Docker Hub, named vllm/vllm-openai. This image is built from the vLLM GitHub repository’s Dockerfile and is designed to run an OpenAI-compatible LLM serving endpoint out-of-the-box. For instance, you can pull a tagged version like vllm/vllm-openai:v0.6.0 (around 4.8 GB) . This image comes with vLLM’s optimizations (e.g. PagedAttention) enabled and is ready to serve Hugging Face models via an OpenAI-style API. It’s a good default choice for quick deployment and is maintained by the vLLM team for consistency with each release.
  • RunPod vLLM Worker Images – RunPod.io offers pre-built vLLM worker Docker images optimized for their serverless GPU platform. These images are configured for high performance (including vLLM’s continuous batching and PagedAttention) and are cached on all RunPod machines for near-instant startup . For example, the stable vLLM worker image v2.5.0 (as of early 2024) is built with CUDA 12.1 and available as runpod/worker-v1-vllm:v2.5.0stable-cuda12.1.0 . (Development variants exist too.) Even though it’s CUDA 12.1 based, it runs on CUDA 12.4 GPUs due to driver compatibility. RunPod’s images are tuned for reliability – they include the vLLM engine pre-configured with PagedAttention for efficient memory use and provide an OpenAI-compatible endpoint. Using these templates means less manual setup; you can deploy a model by setting a few environment variables (like model name) without building a custom image.

Which is best for speed and reliability? If you are on RunPod, their official vLLM template is highly optimized for quick deployments (image pre-cached, minimal cold start) . For self-hosting or other cloud providers, the official vllm/vllm-openai image is a great starting point since it’s purpose-built for vLLM serving. NVIDIA’s Triton+vLLM container is enterprise-grade and allows multi-model serving, but if you only need one model with vLLM, the lighter vLLM-specific images may be simpler. Many users report excellent throughput improvements with vLLM compared to plain Hugging Face servers , so any of these image sources – when properly configured – should offer fast and reliable inference.

(Authoritative source: the vLLM GitHub repository provides the Dockerfile and documentation for the official image, and RunPod’s documentation confirms availability of optimized vLLM worker images .)

Comparison of vLLM Docker Image Sources

SourceImage (example)FeaturesOptimized forNVIDIA NGC Tritonnvcr.io/nvidia/tritonserver:23.10-vllm...Triton Inference Server + vLLM backend . Multi-model, OpenAI API, NVIDIA support.Enterprise deployment; multi-model serving with minimal custom code.Official vLLMvllm/vllm-openai:<version>Standalone vLLM server image. OpenAI-compatible REST API, Hugging Face integration, ~5 GB size.Single-model LLM serving with maximum throughput (vLLM’s native server).RunPod Templaterunpod/worker-v1-vllm:v2.5.0stable-cuda12.1.0Pre-loaded vLLM worker . Configurable via env vars, PagedAttention enabled , image cached on platform .RunPod serverless endpoints (fast startup, easy scaling); minimal ops overhead.

Each of these images is compatible with CUDA 12.4 GPUs (given an appropriate NVIDIA driver and runtime, as discussed next). Choose based on your environment: for RunPod users, the built-in template is easiest; for others, the official vLLM image or NVIDIA’s container are robust choices.

Minimum Hardware and Software Dependencies

Running vLLM with CUDA 12.4 requires certain minimum hardware and software setup. Below are the key dependencies you should have in place:

  • GPU Hardware: An NVIDIA GPU with Compute Capability ≥ 7.0 is required . In practice, this means Volta (V100) or newer (Volta 7.0, Turing 7.5 like T4/RTX20xx, Ampere 8.0/8.6 like A100/RTX30xx, Hopper H100, etc.). vLLM’s authors specifically list V100, T4, RTX 20-series, A100, L4, H100 as supported examples . Older GPUs (Pascal or Maxwell) with <7.0 capability are not supported by vLLM’s precompiled binaries. For best performance, Ampere or Hopper GPUs are recommended, but any 7.x/8.x SM should work.
  • GPU Memory: Sufficient VRAM to hold your model and intermediate data. As a rule of thumb, a 7B parameter model requires ~8 GB GPU memory in half-precision, 13B ~16 GB, and larger models (30B, 70B) may require 2× or 4× GPUs or high-memory GPUs (e.g. 80 GB A100) with tensor parallelism. Minimum ~8 GB is needed even for smaller models to avoid out-of-memory. Also ensure ample system RAM for vLLM’s paging (vLLM may utilize CPU memory for the KV cache via PagedAttention).
  • CUDA Toolkit / Drivers: The Docker image or host should have CUDA 12.4 runtime libraries and your NVIDIA driver must support CUDA 12.4. NVIDIA’s release notes indicate Linux driver version ≥ 550.54.14 is required for CUDA 12.4 (Linux) . In other words, use R550 driver branch or newer. Ensure nvidia-smi on the host shows driver 550+ to avoid “CUDA driver too old” errors. If you use an official CUDA base image (e.g. nvidia/cuda:12.4.0-base-ubuntu22.04), it will expect the host driver to be compatible. The CUDA 12.4 toolkit includes support for newer GPU architectures and improvements, so matching driver/toolkit is crucial.
  • cuDNN and Libraries: Most deep learning containers include cuDNN and other libraries. For CUDA 12.x, use cuDNN 8.9 or higher (the CUDA 12.4 containers bundle cuDNN 9.x as well ). If building your own image, install the corresponding cuDNN package for CUDA 12.4. In practice, if you install PyTorch via pip (which vLLM uses), it will bring its own CUDA and cuDNN binaries. Just ensure that any PyTorch build is compiled for CUDA 12.4 or is compatible (more on this in issues section).
  • Operating System: Linux (Ubuntu 20.04 or 22.04) is the typical environment. Official images are based on Ubuntu 22.04 (Jammy) for CUDA 12.x. Ensure the host OS or Docker base is 64-bit Linux with GLIBC ≥ 2.31 (which Ubuntu 22.04 provides). Windows is not supported for vLLM inference (and vLLM’s official Docker is Linux-only) .
  • Python Version: Python 3.9 to 3.12 is supported by vLLM . Containers usually have Python 3.10 or 3.11. Make sure your Docker image has a compatible Python – vLLM’s PyPI requires >=3.9,<3.13 . Ubuntu 22.04 ships with Python 3.10 by default, which works well.
  • NVIDIA Container Toolkit: If running Docker on your own host (not on a service like RunPod), install the NVIDIA Container Toolkit to enable --gpus all flag. This ensures the Docker container can access the GPU. (RunPod’s platform handles this under the hood for you.)

In summary, use an NVIDIA Volta/Turing/Ampere GPU or newer with a 550+ driver, CUDA 12.4 runtime, and Python 3.10+ in a Linux container. Following these baseline requirements will avoid most compatibility issues when running vLLM.

Example Dockerfile and Runtime Config for Inference

To illustrate how to set up a Docker image for vLLM on CUDA 12.4, below is an example Dockerfile. This Dockerfile uses an official CUDA 12.4 base image (Ubuntu 22.04), installs Python and vLLM, and sets up an entrypoint to serve a model. It’s optimized for RunPod compatibility and general use:

# Base image with CUDA 12.4, CUDA runtime and cuDNN on Ubuntu 22.04
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
# Install Python 3 and pip
RUN apt-get update && apt-get install -y python3 python3-pip && \
   rm -rf /var/lib/apt/lists/*

# Upgrade pip and install vLLM (which will install PyTorch and other deps)
RUN python3 -m pip install --upgrade pip && \
   pip install vllm

# (Optional) Install specific transformers or other libraries if needed
# RUN pip install transformers==4.xx.x

# Expose port for the vLLM OpenAI API server (default 8000)
EXPOSE 8000

# Define entrypoint to run the vLLM server. Using "vllm serve" for OpenAI API.
ENTRYPOINT ["vllm", "serve"]

# By default, serve a small model (can be overridden at runtime via args or env)
CMD ["facebook/opt-125m", "--host", "0.0.0.0", "--port", "8000", "--api-key", "LOCAL-API-KEY"]

A few notes on this Dockerfile:

  • We chose the CUDA 12.4 runtime base image (nvidia/cuda:12.4.0-runtime-ubuntu22.04). This image comes with CUDA 12.4 libraries. We use the “runtime” flavor (including necessary driver libraries, cuBLAS, etc., but not the compiler) to keep the image smaller. Ubuntu 22.04 is used as per NVIDIA’s support and vLLM’s recommendations . If you need the compiler (to build from source), you could use the -devel image, but here we rely on pip wheels.
  • We install system Python 3 and pip, then pip install vllm. The vLLM PyPI package will pull in a compatible version of PyTorch. Important: As of vLLM v0.5.x, the pip wheel is built against CUDA 12.1 by default . This means it may install a PyTorch version with CUDA 12.1 support. On a CUDA 12.4 system, that’s okay as long as the NVIDIA driver is new enough (which it is, by assumption), because CUDA 12.x minor versions are forward-compatible . However, if you want to ensure the exact CUDA 12.4 build of PyTorch, you might install PyTorch separately. For example, you could do: pip install torch==2.1.0+cu121 torchvision+cu121 -f https://download.pytorch.org/whl/cu121 (if a cu124 wheel isn’t available yet) before installing vLLM, to guarantee GPU support. In most cases, pip install vllm on a machine with GPUs will give you a working GPU-enabled PyTorch (it typically detects and avoids installing a CPU-only torch). Always verify by running python3 -c "import torch; print(torch.cuda.is_available())" inside the container.
  • The Dockerfile sets up an entrypoint that uses vllm serve. The vLLM library provides this CLI command to launch an OpenAI-compatible HTTP server . The ENTRYPOINT ["vllm", "serve"] means whenever the container starts, it will run vllm serve .... We then provide a default CMD specifying a model (facebook/opt-125m as an example small model), host binding, port, and an API key. vLLM’s server requires an --api-key for clients to authenticate (you can set a dummy value like LOCAL-API-KEY for testing, or override it). We also bind to 0.0.0.0 so the server is accessible externally on port 8000. In practice, you will likely override these CMD parameters when running the container (or bake your desired model into the image). For instance, on RunPod you might ignore the CMD and instead use environment variables (MODEL_NAME, etc.) to control which model to load.
  • We exposed port 8000, which is vLLM’s default for the OpenAI-style endpoint. If you use a different interface or a custom API, expose the relevant port.

This Dockerfile is a starting point. You can modify it to include downloading the model weights in advance (to avoid downloading at runtime). For example, you could add a line to RUN python3 -c "from vllm import LLM; LLM('your-model').load();" or use huggingface-cli to download the model, so that the image contains the model (this would make the image quite large, but eliminates cold-start download). For RunPod Serverless, the recommended approach is actually to bake the model into the image for production use (to speed up cold start), or use their volume caching mechanism.

Finally, note that vLLM can serve models in 8-bit or other quantized forms (via GPTQ, etc.), which you might configure at runtime. The Dockerfile above focuses on getting the base environment ready for inference.

Runtime Configuration: When running this container for inference, you would typically provide any necessary arguments or environment variables. For example:

  • On a local machine: docker run -d --gpus all -p 8000:8000 my-vllm-image:latest <model> --host 0.0.0.0 --port 8000 --api-key YOURKEY. The --gpus all flag (with Docker Engine 19+ and NVIDIA Container Toolkit) passes the GPU into the container. Replace <model> with a Hugging Face model ID or path. If you baked the model, you might not need to specify it again.
  • On RunPod: you wouldn’t run the docker run manually; instead, you’d deploy it as a Serverless Endpoint (see below). RunPod’s UI or API would handle passing the GPU and port. You’d configure MODEL_NAME (and possibly API_KEY) as environment variables through their interface, which the vLLM worker entrypoint will read (the RunPod base images read those env vars and construct the serve command accordingly).

This example should serve as a blueprint for building a CUDA 12.4-compatible vLLM inference container. It installs the necessary components and starts the vLLM server to listen for requests.

Common Issues with CUDA 12.4 Compatibility

Running cutting-edge CUDA 12.4 for LLM inference can introduce some compatibility challenges. Here are common issues and their solutions:

  • PyTorch – CUDA Version Mismatch: vLLM depends on PyTorch under the hood. If the PyTorch version installed in the container isn’t built for CUDA 12.4, you might see errors or suboptimal performance. As noted, the vLLM pip wheel is built with CUDA 12.1 in mind . This means it may install a PyTorch that expects CUDA 12.1 libraries. On a system with CUDA 12.4 drivers, that usually works (because newer drivers run older CUDA toolkit code) via compatibility mode. However, if you encounter an error like “CUDA driver mismatch” or “unspecified launch failure”, it could be a sign that the PyTorch binary isn’t aligning with the environment. Solution: Install a PyTorch binary specifically for CUDA 12.4 (if available, e.g. a nightly build for CUDA 12.4) before installing vLLM. Alternatively, use pip install vllm inside an official PyTorch CUDA 12.4 container to ensure consistency. In worst case, build vLLM from source against your environment. The vLLM docs state that binary incompatibility can occur if you use a different CUDA/PyTorch than expected, and recommend building from source in that case .
  • NVIDIA Driver Out of Date: This is a very common issue. If your host’s NVIDIA driver is older than required, the container will not be able to initialize the GPU for CUDA 12.4. The error might be "CUDA driver version is insufficient for CUDA runtime" or it might silently fall back to CPU. Double-check driver with nvidia-smi. As mentioned, you need driver 550 or newer for 12.4 . Solution: Update the NVIDIA driver on the host to the latest production version (e.g., 535+ for Linux, or 537+ on Windows if using WSL2). On services like RunPod, the driver is managed for you (their GPUs will have appropriate drivers to support advertised CUDA versions).
  • cuDNN or Library Compatibility: If you build your own image and mix library versions, issues can arise (e.g., mismatched cuDNN causing errors in convolution ops). To avoid this, base your image on NVIDIA’s official CUDA 12.4 images which include the proper cuDNN (or explicitly install the matching libcudnn8 deb for CUDA 12.4). If using pip wheels, ensure that the wheel’s cuDNN is compatible. Usually, PyTorch wheels bundle a compatible cuDNN, so this isn’t often a separate issue. But if you see errors like “CUDNN_STATUS_NOT_SUPPORTED” it could indicate a version mismatch. Solution: Stick to known good combinations (such as the ones in official images) or use the nvidia/cuda:<tag>-cudnn-runtime images that include cuDNN for you.
  • Memory and Performance Tuning: On CUDA 12.x you get access to newer performance features, but you might also run into memory fragmentation or allocation issues for large models. For example, if using multiple GPUs, ensure your model fits in each GPU’s memory or use vLLM’s tensor parallelism flags correctly. If you hit out-of-memory (OOM) errors, you may need to lower --max_model_len (context length) or use smaller batch sizes. vLLM’s continuous batching will try to maximize GPU utilization, but extremely long prompts can still OOM a GPU if it runs out of memory for the KV cache. Solution: Reduce context or use a GPU with more memory, or add the --max_model_len param (e.g., 16384) to cap it . Also monitor memory usage over time – vLLM should recycle the KV cache effectively, but if you keep the server running long with varied lengths, watch for any memory growth.
  • Environment Variable Effects: Some environment variables can inadvertently cause issues if not set properly. For instance, TOKENIZERS_PARALLELISM (from Hugging Face Tokenizers) – if this is left at default and the code forks processes, you might see a warning or even a deadlock. The warning “The current process just got forked after multi-threaded tokenization…” is common . Solution: Set TOKENIZERS_PARALLELISM=false in the environment to disable parallel tokenization in HuggingFace, as recommended, to avoid that issue. We’ll discuss more about such env vars in the FAQ.
  • Docker Runtime Settings: If you forget to run the container with GPU access, the server will run on CPU which is extremely slow for LLMs. Always use --runtime=nvidia (old) or --gpus all (newer Docker) when starting the container. On Kubernetes, use the NVIDIA device plugin. On RunPod’s serverless, this is taken care of by the platform (you select a GPU type in the UI, and it ensures the container sees a GPU). If you accidentally deploy a vLLM container on a CPU instance, it will technically run but at a fraction of expected performance. Solution: double-check that the deployment target has GPUs and that no environment variable like CUDA_VISIBLE_DEVICES is unintentionally set to blank (hiding GPUs).

In summary, keep your drivers and CUDA in sync, ensure the Python packages (vLLM, PyTorch) align with CUDA 12.4, and set recommended environment variables. When in doubt, consult vLLM’s GitHub issues for any specific incompatibilities – by 2025, many early issues with CUDA 12.x have been ironed out, but staying updated with vLLM releases (which often track new CUDA and PyTorch versions) is wise.

How to Deploy on RunPod with GPU Access

Deploying a vLLM-powered Docker container on RunPod.io is straightforward using their Serverless Endpoints with GPU instances. RunPod provides a UI and API that let you select a vLLM template and launch it on NVIDIA GPUs in the cloud. Here’s a step-by-step walkthrough to get your vLLM inference service running on RunPod:

Step 1: Choose Your Model and Check Requirements – First, decide which Hugging Face model (or custom model) you want to serve with vLLM. Ensure the model is supported by vLLM (most transformers models are; see the vLLM docs for supported architectures). For example, you might use openchat/openchat-3.5-0106 (a chat model) as in RunPod’s tutorial , or any other HF model ID. Note the model’s size to choose an appropriate GPU (e.g., don’t try a 70B model on a 16 GB GPU).

Step 2: Create a Serverless vLLM Endpoint via RunPod Console – Log in to your RunPod account and navigate to the Serverless section. Click “Deploy Endpoint” and find the vLLM worker option (RunPod often lists a “Quick Deploy: Serverless vLLM” template ). Click Configure to set up your endpoint. In the configuration modal, you’ll do the following:

  • Select vLLM Image Version: Choose the latest stable vLLM worker image (by default, RunPod will show the latest stable tag, e.g. v2.5.0). This corresponds to a specific CUDA version (at the time of writing, CUDA 12.1 was the latest stable on RunPod ; in the future, they may list a CUDA 12.4-based image if available). If you want to ensure CUDA 12.4, you might choose a “dev” image tag if provided, or simply use the stable one since it runs on 12.4 GPUs regardless.
  • Select Model: Under “Hugging Face Models,” enter the model name or repository ID (for example, openchat/openchat-3.5-0106 or facebook/opt-1.3b). This tells the vLLM container which model to download or load. If the model is gated or private, also input your Hugging Face token (there’s a field for it).
  • vLLM Settings: You may be presented with some vLLM-specific settings. For instance, “Max Model Length” (context length) can be set – e.g. 8192 tokens for a long-context model . If unsure, you can leave defaults, but it’s good to match this to your model’s max sequence length to avoid allocating excessive memory. Other settings like batch size, parallelism, etc., usually have safe defaults.
  • Endpoint Compute Configuration: Next, choose your hardware. Select a GPU type available (RunPod provides various GPU types – e.g., A10G (24 GB), A100 40GB or 80GB, etc. – the availability depends on region and pricing). For many LLMs, an A100 40GB is a solid choice. Set GPU Count to 1 for most cases . You can set GPU Count >1 if you plan to use multiple GPUs for one worker (vLLM supports tensor parallel inference across GPUs). For example, for a 70B model, you might set GPU Count = 2 and select two 40GB GPUs; vLLM will automatically use both (it detects multiple GPUs and can distribute the model). If unsure, stick to 1. Also configure Active Workers vs Max Workers: Active Workers = 1 means one container will be kept running at all times (ensuring low-latency responses), whereas 0 means it will scale to zero when idle (cheaper, but with cold-start delay). Max Workers defines how many containers can scale out concurrently under load – for example, Max 2 means at most two instances of your endpoint will run to handle traffic spikes. These settings let you balance cost and performance.
  • Deployment: Once the options are set, click Deploy. RunPod will now spin up the endpoint. Behind the scenes, it’s pulling the vLLM Docker image and launching the container with the parameters you provided . If you chose a model that is not baked into the image, the worker will download the model weights on startup (you’ll see logs of it doing so). The endpoint will show an “Initializing” status during this process , which can take a few minutes for large models (note: to avoid this download time, you could build a custom image with the model, but for development the quick deploy is fine).

Step 3: Testing the Endpoint – Once the endpoint status is “Running,” you’ll get an Endpoint ID or URL. In the RunPod console, click on your endpoint to see details. You can test it right from the UI: go to the Requests tab, where RunPod provides a form or some example input . For OpenAI-compatible endpoints, there might be a default test prompt (e.g., a JSON for a chat completion with "prompt": "Hello World" as in the docs ). Click Run to send a test request. The request will be processed by the vLLM server in your container. The UI will display the output and some metadata like latency once completed .

Alternatively, you can test using code: RunPod endpoints have a URL and an API key. For OpenAI-compatible vLLM endpoints, you’d typically use the OpenAI API client by pointing it to the RunPod URL. For example, set OPENAI_API_BASE to your RunPod endpoint URL and use the provided API key (RunPod will show an API key or you might use your user API key depending on how they implement authentication). Then a normal OpenAI API call (like openai.ChatCompletion.create) will hit your vLLM instance. This makes integration easy – you can swap out OpenAI’s endpoint for your own vLLM endpoint.

Step 4: Monitoring and Scaling – On RunPod, you can monitor your endpoint’s performance. The endpoint detail page will show logs (e.g., model download progress, any errors, requests received, etc.). You should see logs confirming that vLLM started and is listening on port 8000. If there are issues (like the model failed to load), the logs will help diagnose (for example, missing HF token for a gated model). You can also see metrics like how many instances are running (if you allowed scaling) and the last request time. Currently, RunPod’s UI might not show real-time GPU utilization graphs, but you can always exec into the container (using RunPod’s shell access) and run nvidia-smi to see GPU memory usage and utilization. The platform ensures the container has full GPU access, so nvidia-smi inside will reflect usage.

If you anticipate higher load, you can increase Max Workers to allow the endpoint to autoscale horizontally (each worker on a separate GPU). RunPod’s serverless will automatically route requests to multiple instances if you have more than one active. Conversely, for cost control, you might keep Active Workers = 0 so that when no requests come, it scales down (you then pay only storage costs, and scale up on demand).

Step 5: (Optional) Customize or Update – If you need to change models or environment variables, you can edit the endpoint configuration in RunPod. For example, to switch to a different model, you might update the MODEL_NAME variable or redeploy with a new model ID. If you built your own Docker image (from the Dockerfile above), you can also deploy that on RunPod by providing the image name in the endpoint config instead of their pre-built ones (you’d have to push your image to a registry first). RunPod’s system is flexible: you can bring your own image as long as it listens on the expected port and uses the same OpenAI API schema (or uses RunPod’s serverless handling, which usually expects an HTTP server responding to either OpenAI or RunPod’s generic format).

Internal Links: For more details, check RunPod’s official docs on vLLM workers. RunPod provides an official tutorial on deploying a vLLM worker with screenshots for each step. They emphasize that quick-deployed workers download models at startup and suggest packaging the model in the image for production to reduce cold starts . Additionally, their vLLM worker overview highlights features like OpenAI API compatibility and environment variable configurations which we’ve leveraged in this guide.

By following these steps, you can get a vLLM inference service running on RunPod’s GPU cloud with relative ease. The combination of vLLM’s efficiency and RunPod’s scalable infrastructure means you can serve LLM responses with high throughput and lower latency, paying only for the GPU time you actually need.

FAQ

Q1: I installed vllm but it’s not using my GPU (or I get an error about CUDA). What did I do wrong?

A1: This usually means either the GPU wasn’t visible or the installation fell back to CPU. First, ensure you run the container with GPU access (--gpus all). Inside the container, run nvidia-smi to verify the GPU is recognized. If vLLM is running but says CUDA not available, it might be that the PyTorch installed is CPU-only. This can happen if the pip installer didn’t find a matching CUDA version and installed torch without CUDA. The fix is to install the correct GPU-enabled PyTorch. One approach: in your Dockerfile (or environment), explicitly install torch with CUDA (for example, pip install torch==2.1.0+cu121 -f https://download.pytorch.org/whl/cu121 for CUDA 12.1, which will work on 12.4 drivers) before installing vLLM. Alternatively, use a PyTorch nightly wheel for CUDA 12.4 if available. In summary, make sure a CUDA-enabled PyTorch is present. Also, check your environment variable CUDA_VISIBLE_DEVICES – it should list the GPUs you want to use (or be unset). If it’s set to an empty value or CPU, then no GPUs will be used. In Docker, it’s normally not set at all, which means all provided GPUs are visible (e.g., CUDA_VISIBLE_DEVICES=0 would restrict to GPU0 only). On RunPod, this is managed for you, and the vLLM worker will automatically use the GPU provided.

Q2: What environment variables should I consider setting for optimal vLLM performance?

A2: There are a few environment variables that can tweak performance or suppress warnings:

  • CUDA_VISIBLE_DEVICES: As mentioned, this variable controls which GPU(s) the process can see. In single-GPU scenarios, it can be left alone (defaults to all available GPUs). In multi-GPU, vLLM will use all visible GPUs for parallel inference by default. If you want to pin a container to a specific GPU in a multi-GPU node, you could set CUDA_VISIBLE_DEVICES=1 (for example) to only use GPU1. On RunPod serverless, each worker is isolated to the GPU you requested, so you typically don’t need to set this manually.
  • TOKENIZERS_PARALLELISM: Set this to false (or 0). This prevents the Hugging Face tokenizers library from using multiple threads in a way that conflicts with forking. vLLM’s server forks worker processes, and if tokenizers had launched threads, you get a warning or potential deadlock . By disabling parallelism, you ensure tokenization is single-threaded per process, avoiding that issue. It might slightly reduce tokenization speed, but that is usually negligible compared to the model inference time, and it ensures stability.
  • OMP_NUM_THREADS / MKL_NUM_THREADS: These control the number of CPU threads used by PyTorch (for CPU ops or data loading). If you are running one model per GPU and not doing heavy CPU compute, you might not need to set these. However, if you observe high CPU usage, you can limit these to, say, 1 or 2 to avoid contention. This is more relevant for multi-container deployments on one machine.
  • HF_TOKEN: If you’re loading a private or gated Hugging Face model, set HF_TOKEN to your HuggingFace access token so that vLLM can download the model. On RunPod, this is provided in the UI if needed.
  • MAX_MODEL_LEN, MAX_BATCH_SIZE, etc.: vLLM allows configuration of maximum context length and batch sizes via either CLI args or env vars. For example, in RunPod’s env config, you might see MAX_MODEL_LEN and you could set it to 8192 or 4096 depending on your model’s limit. Tuning these can help manage memory usage (lower max length = lower memory overhead).

In general, vLLM is designed to work out-of-the-box without many env tweaks, but the above can be useful. Always set TOKENIZERS_PARALLELISM=false to keep things smooth. RunPod’s official worker images already set this internally (their Dockerfiles have ENV TOKENIZERS_PARALLELISM=false by default) , so you might not need to worry about it there.

Q3: How can I monitor GPU utilization and performance of the vLLM server?

A3: For basic monitoring, use nvidia-smi. If you have shell access to the running container (for example, using docker exec -it <container> bash on a self-hosted setup, or the console access on RunPod), run watch -n1 nvidia-smi to see real-time GPU memory usage and compute utilization. You should see your process’s memory footprint (which will include the model and cache) and the GPU utilization percentage when serving requests. A near-100% utilization during generation indicates the GPU is fully engaged (which is good for throughput). Low utilization might indicate the model is waiting on I/O or batching – vLLM’s continuous batching tries to keep it high by grouping requests.

For more detailed performance metrics, vLLM has some built-in logging and metrics. If launched in debug mode or with certain flags, it can output latency and throughput stats. Additionally, if you use Triton’s vLLM backend (NGC container), Triton provides Prometheus metrics for things like infer count, queue latency, etc. On RunPod, the endpoint detail page will show you the execution time for each request (as in the example, you see executionTime: the time spent processing) . You can use that to gauge how fast responses are and track any changes after tuning.

If you need to profile, you can also run the server locally with torch.cuda.profiler or similar, but that’s advanced. In most cases, watching nvidia-smi and using RunPod’s logs for request timings is sufficient to ensure the GPU is being well utilized.

Q4: I get “Binary not compiled with GPU support” or similar error when running inside the container. How do I fix it?

A4: This suggests the vLLM or PyTorch binary thinks there’s no GPU support. It might happen if the container image lacked CUDA libraries or if the pip installed a CPU-only binary. To fix this, ensure you are using the correct base image (with CUDA). The example Dockerfile uses nvidia/cuda:12.4-runtime, which has the necessary CUDA libs. If you accidentally used a slim Linux base with no CUDA, the PyTorch installed might default to CPU. Always start from an NVIDIA CUDA image or an official PyTorch image that includes CUDA. Another tip: after installing vLLM, run python3 -c "import vllm; import torch; print(torch.cuda.is_available())" in the Docker build to verify GPU availability (you can do this in a multi-stage build and echo the result). If it prints False, then the environment is not set up correctly for CUDA. Install the correct wheels or drivers in the image. In short, use the right base and the right pip wheels. If using conda, use a CUDA-enabled conda environment (e.g., install pytorch cudatoolkit=12.1 via conda). In our guide, we stick to pip and the NVIDIA base which should avoid this issue.

Q5: Are there any known issues specific to CUDA 12.4 and vLLM I should be aware of?

A5: As of early 2025, most issues have been around building/packaging rather than functionality. One thing to note is that CUDA 12 introduced some changes that improved performance (as seen in some PyTorch benchmarks, CUDA 12.1/12.2 gave 5-10% speedups in certain transformer operations ). vLLM can benefit from these improvements. However, if you use very new CUDA 12.4, make sure all dependencies (like FlashAttention, if used, or other kernels) are compiled for it. vLLM uses its own optimized kernels and integrates with FlashAttention. If you install additional packages like xformers or flash-attn, ensure you get versions compiled for CUDA 12.x; otherwise, you might face compilation errors. Also, keep an eye on driver bugs – occasionally a newest driver might have a bug affecting certain operations. Upgrading to the latest patch (e.g., if using 550.54, try 550.66 if available) can resolve weird behaviors. The community forums (GitHub issues in vLLM or PyTorch) are a good place to check if any regression is observed with a specific CUDA version. At the time of writing, no major vLLM-specific bug on CUDA 12.4 has been reported publicly; the main work was ensuring support, which has been done.

Q6: How do I update my vLLM deployment to a new version (or different model)?

A6: If you used the RunPod approach, updating is as simple as deploying a new endpoint or updating the image tag/model in the existing endpoint. For example, when vLLM releases a new version with enhancements, RunPod might add a new image tag (say v2.6.0stable-cuda12.4.0). You can edit your endpoint and select the new image, or redeploy with it. If you manage your own Docker, you would rebuild your Dockerfile with the updated pip install vllm==<new_version> or pull the updated official image tag. To change the model, on RunPod just change the MODEL_NAME env or reconfigure the endpoint. If running locally, you might need to restart the container with a different CMD (or if you built the model into the image, you’d build a new image or mount new weights). The process is similar to initial deployment. Always test the new setup in a staging environment if possible, to ensure everything is compatible (especially when jumping major vLLM versions or CUDA versions).

By following this guide, you should have a solid understanding of how to find or build the best Docker image for vLLM inference on CUDA 12.4 GPUs, and how to deploy it efficiently on RunPod. The combination of vLLM’s advanced optimizations (like PagedAttention to minimize memory waste) and powerful GPUs will enable high-throughput LLM serving – vLLM boasts up to 24× higher throughput than naive HF Transformers serving by reducing wasted memory and better batching . With the right Docker image and configuration, you can tap into this performance for your own applications, whether it’s an OpenAI-style API or a bespoke inference service.

Sources:

  • vLLM GitHub – vLLM: A high-throughput and memory-efficient LLM inference library
  • NVIDIA Docs – Triton Inference Server with vLLM Backend (NGC container info)
  • RunPod Documentation – vLLM Worker overview and deployment guide
  • vLLM Documentation – Installation requirements and compatibility notes
  • NVIDIA CUDA Release Notes – Driver requirement for CUDA 12.4
  • RunPod GitHub – vLLM worker image tags and caching info
  • RunPod Blog – Intro to vLLM & PagedAttention (performance benefits)

Build what’s next.

The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.