Back to docs

Container Guide

How to package your ML models as containers for deployment on VectorLay.

Overview

VectorLay deploys your workloads as containers with dedicated GPU access via VFIO passthrough. Your container gets full, bare-metal GPU performance with no virtualization overhead.

Option 1: Use a pre-built image

The fastest way to get started is to use an existing inference server image. VectorLay works with any Docker image that exposes an HTTP port:

  • vllm/vllm-openai:latest — vLLM with OpenAI-compatible API
  • ghcr.io/huggingface/text-generation-inference — HuggingFace TGI
  • nvcr.io/nvidia/tritonserver — NVIDIA Triton
Dockerfile
FROM vllm/vllm-openai:latest

# That's it - vLLM handles everything
# Configure via environment variables at deploy time

Option 2: Build a custom image

For custom inference logic, build your own image based on NVIDIA CUDA:

Dockerfile
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*

# Install your inference framework
RUN pip3 install vllm torch

# Copy model serving code
COPY serve.py /app/serve.py

# Expose the inference port
EXPOSE 8000

# Start the server
CMD ["python3", "/app/serve.py"]

Container requirements

  • Expose an HTTP port — Your container must listen on a port (default: 8000). Configure via the container_port field.
  • Health endpoint — Implement a GET /health endpoint that returns 200 when ready to serve traffic.
  • Use NVIDIA base images — Start from nvidia/cuda to ensure CUDA compatibility.
  • Keep images small — Use multi-stage builds and slim base images when possible.

Local testing

Test your container locally with Docker Compose before deploying:

docker-compose.yml
# docker-compose.yml for local testing
services:
  model:
    build: .
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - MODEL=meta-llama/Llama-3.1-8B-Instruct
      - MAX_MODEL_LEN=4096

Private registries

VectorLay supports pulling from private container registries. Add your registry credentials in your dashboard settings, then reference private images in your cluster config:

  • Docker Hub (private repos)
  • GitHub Container Registry (ghcr.io)
  • AWS ECR
  • Google Artifact Registry