Emmett Fear

How to Deploy a Custom LLM in the Cloud Using Docker

Deploying a custom Large Language Model (LLM) in the cloud has never been more accessible, thanks to containerization technologies like Docker and cloud GPU providers like RunPod. Whether you're fine-tuning an open-source LLM or building a private inference API, this guide walks you through deploying your model using Docker — from building your container to exposing endpoints for production use.

In this tutorial, we’ll cover:

  • Building a Docker container with your LLM, tokenizer, and inference server
  • Configuring GPU runtime for efficient inference
  • Exposing your LLM via HTTP endpoints
  • Deploying the container on RunPod
  • FAQs on GPU selection, scaling, and container limits

Let’s get started!

Prerequisites

Before we begin, ensure you have the following:

  • A trained or fine-tuned LLM (e.g., from Hugging Face or a local model checkpoint)
  • Docker installed on your local machine
  • A RunPod account (Sign up for free here)
  • Basic familiarity with Python and Docker
💡 Want to skip ahead? Explore our Docker-based LLM templates to jumpstart your deployment.

Step 1: Choose an Inference Server

To serve your model efficiently, you’ll need an inference server optimized for LLMs. Two popular options are:

  • vLLM: High-throughput inference server built for transformer models
  • Hugging Face’s text-generation-inference: Production-grade inference server used to power Hugging Face APIs

In this guide, we’ll use Hugging Face’s text-generation-inference, but the process is similar for vLLM.

Step 2: Prepare Your Model Files

Gather the following components:

  • Model weights (e.g., pytorch_model.bin, model.safetensors)
  • config.json
  • tokenizer.json or tokenizer folder

You can download these from Hugging Face using transformers CLI or git lfs:

git lfs install
git clone https://huggingface.co/your-username/your-model

Place these files in a directory called model/ inside your Docker project.

Step 3: Write a Dockerfile

Your Dockerfile should install required dependencies, copy your model files, and start the inference server.

Here’s an example Dockerfile using Hugging Face’s TGI:

FROM ghcr.io/huggingface/text-generation-inference:1.2

# Copy your model into the container
COPY ./model /data

# Set environment variables (optional)
ENV MODEL_ID=/data
ENV MAX_INPUT_LENGTH=1024
ENV MAX_TOTAL_TOKENS=2048

# Expose the inference port
EXPOSE 80

🛠 You can find more Dockerfile examples in our LLM deployment templates.

Build your Docker image:

docker build -t my-custom-llm .

Run it locally (for testing):

docker run --gpus all -p 8080:80 my-custom-llm

Now navigate to http://localhost:8080/generate and test your model via POST requests.

Step 4: Push to a Container Registry

To deploy on RunPod, your container must be accessible from a registry like Docker Hub or GitHub Container Registry.

Tag and push your image:

docker tag my-custom-llm your-dockerhub-username/my-custom-llm
docker push your-dockerhub-username/my-custom-llm

Step 5: Deploy on RunPod

  1. Log into your RunPod dashboard
  2. Click “Deploy” > “Custom Container”
  3. Under Container Image, enter your Docker image URL (docker.io/your-dockerhub-username/my-custom-llm)
  4. Enable GPU Access and select a GPU type (e.g., A100, RTX 3090, 4090)
  5. Under Container Ports, expose port 80
  6. Optionally mount volumes for logging or persistent model data
  7. Click “Deploy Pod”

Your LLM is now running in the cloud with GPU acceleration!

🚀 Want to automate deployments? Use our RunPod API to spin up pods programmatically.

Step 6: Access Your Model via API

Once deployed, RunPod provides an endpoint like:

https://<your-pod-id>.runpod.io/generate

Send a POST request to generate text:

curl -X POST https://<your-pod-id>.runpod.io/generate \
   -H "Content-Type: application/json" \
   -d '{"inputs":"Hello, my name is"}'

You’ll receive a JSON response with the generated output.

Optional: Scaling and Load Balancing

For production use, consider auto-scaling via multiple pods. RunPod supports:

  • Horizontal scaling (multiple pods)
  • Load balancing via custom endpoints or external reverse proxies
  • Auto-shutdown and restart options to save costs

Check out our inference scaling guide to build robust deployments.

Common Questions

What GPU should I choose?

  • RTX 3090 / 4090: Great for medium-sized models (e.g., LLaMA 7B, Mistral)
  • A100 / H100: Best for large models (13B+, multi-threaded inference)
  • T4: Cost-effective for low-latency or smaller models

Check our GPU comparison guide for performance benchmarks.

Are there container size limits?

RunPod supports containers up to 10GB in size via registry pull. For larger models, mount a volume with your model files or use model downloading logic in your container.

Can I scale inference across multiple GPUs?

Yes! Use inference servers like vLLM or DeepSpeed with multi-GPU support. RunPod supports multi-GPU pods and can be configured via environment variables or launch scripts.

Final Thoughts

Deploying your own LLM in the cloud doesn’t have to be complex. With Docker and RunPod, you can containerize your model, expose it via a secure API, and scale on demand — all while benefiting from high-performance GPUs at a fraction of traditional cloud costs.

🔧 Ready to bring your model to life? Sign up for RunPod and deploy your first LLM today.
📦 Need a head start? Explore our prebuilt LLM templates and modify them to suit your needs.

Resources

Happy deploying! 🚀

Build what’s next.

The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.