Overview
VectorLay deploys your workloads as containers with dedicated GPU access via VFIO passthrough. Your container gets full, bare-metal GPU performance with no virtualization overhead.
Option 1: Use a pre-built image
The fastest way to get started is to use an existing inference server image. VectorLay works with any Docker image that exposes an HTTP port:
vllm/vllm-openai:latest— vLLM with OpenAI-compatible APIghcr.io/huggingface/text-generation-inference— HuggingFace TGInvcr.io/nvidia/tritonserver— NVIDIA Triton
Dockerfile
FROM vllm/vllm-openai:latest
# That's it - vLLM handles everything
# Configure via environment variables at deploy timeOption 2: Build a custom image
For custom inference logic, build your own image based on NVIDIA CUDA:
Dockerfile
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip && \
rm -rf /var/lib/apt/lists/*
# Install your inference framework
RUN pip3 install vllm torch
# Copy model serving code
COPY serve.py /app/serve.py
# Expose the inference port
EXPOSE 8000
# Start the server
CMD ["python3", "/app/serve.py"]Container requirements
- Expose an HTTP port — Your container must listen on a port (default: 8000). Configure via the
container_portfield. - Health endpoint — Implement a
GET /healthendpoint that returns 200 when ready to serve traffic. - Use NVIDIA base images — Start from
nvidia/cudato ensure CUDA compatibility. - Keep images small — Use multi-stage builds and slim base images when possible.
Local testing
Test your container locally with Docker Compose before deploying:
docker-compose.yml
# docker-compose.yml for local testing
services:
model:
build: .
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- MODEL=meta-llama/Llama-3.1-8B-Instruct
- MAX_MODEL_LEN=4096Private registries
VectorLay supports pulling from private container registries. Add your registry credentials in your dashboard settings, then reference private images in your cluster config:
- Docker Hub (private repos)
- GitHub Container Registry (ghcr.io)
- AWS ECR
- Google Artifact Registry