Container Guide

Overview

VectorLay deploys your workloads as containers with dedicated GPU access via VFIO passthrough. Your container gets full, bare-metal GPU performance with no virtualization overhead.

Option 1: Use a pre-built image

The fastest way to get started is to use an existing inference server image. VectorLay works with any Docker image that exposes an HTTP port:

vllm/vllm-openai:latest — vLLM with OpenAI-compatible API
ghcr.io/huggingface/text-generation-inference — HuggingFace TGI
nvcr.io/nvidia/tritonserver — NVIDIA Triton

Dockerfile

FROM vllm/vllm-openai:latest

# That's it - vLLM handles everything
# Configure via environment variables at deploy time

Option 2: Build a custom image

For custom inference logic, build your own image based on NVIDIA CUDA:

Dockerfile

FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*

# Install your inference framework
RUN pip3 install vllm torch

# Copy model serving code
COPY serve.py /app/serve.py

# Expose the inference port
EXPOSE 8000

# Start the server
CMD ["python3", "/app/serve.py"]

Container requirements

Expose an HTTP port — Your container must listen on a port (default: 8000). Configure via the container_port field.
Health endpoint — Implement a GET /health endpoint that returns 200 when ready to serve traffic.
Use NVIDIA base images — Start from nvidia/cuda to ensure CUDA compatibility.
Keep images small — Use multi-stage builds and slim base images when possible.

Local testing

Test your container locally with Docker Compose before deploying:

docker-compose.yml

# docker-compose.yml for local testing
services:
  model:
    build: .
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - MODEL=meta-llama/Llama-3.1-8B-Instruct
      - MAX_MODEL_LEN=4096

Private registries

VectorLay supports pulling from private container registries. Add your registry credentials in your dashboard settings, then reference private images in your cluster config:

Docker Hub (private repos)
GitHub Container Registry (ghcr.io)
AWS ECR
Google Artifact Registry