Deploying a custom Large Language Model (LLM) in the cloud has never been more accessible, thanks to containerization technologies like Docker and cloud GPU providers like RunPod. Whether you're fine-tuning an open-source LLM or building a private inference API, this guide walks you through deploying your model using Docker — from building your container to exposing endpoints for production use.
In this tutorial, we’ll cover:
- Building a Docker container with your LLM, tokenizer, and inference server
- Configuring GPU runtime for efficient inference
- Exposing your LLM via HTTP endpoints
- Deploying the container on RunPod
- FAQs on GPU selection, scaling, and container limits
Let’s get started!
Prerequisites
Before we begin, ensure you have the following:
- A trained or fine-tuned LLM (e.g., from Hugging Face or a local model checkpoint)
- Docker installed on your local machine
- A RunPod account (Sign up for free here)
- Basic familiarity with Python and Docker
💡 Want to skip ahead? Explore our Docker-based LLM templates to jumpstart your deployment.
Step 1: Choose an Inference Server
To serve your model efficiently, you’ll need an inference server optimized for LLMs. Two popular options are:
- vLLM: High-throughput inference server built for transformer models
- Hugging Face’s text-generation-inference: Production-grade inference server used to power Hugging Face APIs
In this guide, we’ll use Hugging Face’s text-generation-inference
, but the process is similar for vLLM
.
Step 2: Prepare Your Model Files
Gather the following components:
- Model weights (e.g.,
pytorch_model.bin
,model.safetensors
) config.json
tokenizer.json
or tokenizer folder
You can download these from Hugging Face using transformers
CLI or git lfs
:
git lfs install
git clone https://huggingface.co/your-username/your-model
Place these files in a directory called model/
inside your Docker project.
Step 3: Write a Dockerfile
Your Dockerfile
should install required dependencies, copy your model files, and start the inference server.
Here’s an example Dockerfile using Hugging Face’s TGI:
FROM ghcr.io/huggingface/text-generation-inference:1.2
# Copy your model into the container
COPY ./model /data
# Set environment variables (optional)
ENV MODEL_ID=/data
ENV MAX_INPUT_LENGTH=1024
ENV MAX_TOTAL_TOKENS=2048
# Expose the inference port
EXPOSE 80
🛠 You can find more Dockerfile examples in our LLM deployment templates.
Build your Docker image:
docker build -t my-custom-llm .
Run it locally (for testing):
docker run --gpus all -p 8080:80 my-custom-llm
Now navigate to http://localhost:8080/generate
and test your model via POST requests.
Step 4: Push to a Container Registry
To deploy on RunPod, your container must be accessible from a registry like Docker Hub or GitHub Container Registry.
Tag and push your image:
docker tag my-custom-llm your-dockerhub-username/my-custom-llm
docker push your-dockerhub-username/my-custom-llm
Step 5: Deploy on RunPod
- Log into your RunPod dashboard
- Click “Deploy” > “Custom Container”
- Under Container Image, enter your Docker image URL (
docker.io/your-dockerhub-username/my-custom-llm
) - Enable GPU Access and select a GPU type (e.g., A100, RTX 3090, 4090)
- Under Container Ports, expose port
80
- Optionally mount volumes for logging or persistent model data
- Click “Deploy Pod”
Your LLM is now running in the cloud with GPU acceleration!
🚀 Want to automate deployments? Use our RunPod API to spin up pods programmatically.
Step 6: Access Your Model via API
Once deployed, RunPod provides an endpoint like:
https://<your-pod-id>.runpod.io/generate
Send a POST request to generate text:
curl -X POST https://<your-pod-id>.runpod.io/generate \
-H "Content-Type: application/json" \
-d '{"inputs":"Hello, my name is"}'
You’ll receive a JSON response with the generated output.
Optional: Scaling and Load Balancing
For production use, consider auto-scaling via multiple pods. RunPod supports:
- Horizontal scaling (multiple pods)
- Load balancing via custom endpoints or external reverse proxies
- Auto-shutdown and restart options to save costs
Check out our inference scaling guide to build robust deployments.
Common Questions
What GPU should I choose?
- RTX 3090 / 4090: Great for medium-sized models (e.g., LLaMA 7B, Mistral)
- A100 / H100: Best for large models (13B+, multi-threaded inference)
- T4: Cost-effective for low-latency or smaller models
Check our GPU comparison guide for performance benchmarks.
Are there container size limits?
RunPod supports containers up to 10GB in size via registry pull. For larger models, mount a volume with your model files or use model downloading logic in your container.
Can I scale inference across multiple GPUs?
Yes! Use inference servers like vLLM
or DeepSpeed
with multi-GPU support. RunPod supports multi-GPU pods and can be configured via environment variables or launch scripts.
Final Thoughts
Deploying your own LLM in the cloud doesn’t have to be complex. With Docker and RunPod, you can containerize your model, expose it via a secure API, and scale on demand — all while benefiting from high-performance GPUs at a fraction of traditional cloud costs.
🔧 Ready to bring your model to life? Sign up for RunPod and deploy your first LLM today.
📦 Need a head start? Explore our prebuilt LLM templates and modify them to suit your needs.
Resources
Happy deploying! 🚀