Loading…
Loading…
VectorLay deploys your workloads as containers with dedicated GPU access via VFIO passthrough. Your container gets full, bare-metal GPU performance with no virtualization overhead.
The fastest way to get started is to use an existing inference server image. VectorLay works with any Docker image that exposes an HTTP port:
vllm/vllm-openai:latest — vLLM with OpenAI-compatible APIghcr.io/huggingface/text-generation-inference — HuggingFace TGInvcr.io/nvidia/tritonserver — NVIDIA TritonFROM vllm/vllm-openai:latest
# That's it - vLLM handles everything
# Configure via environment variables at deploy timeFor custom inference logic, build your own image based on NVIDIA CUDA:
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip && \
rm -rf /var/lib/apt/lists/*
# Install your inference framework
RUN pip3 install vllm torch
# Copy model serving code
COPY serve.py /app/serve.py
# Expose the inference port
EXPOSE 8000
# Start the server
CMD ["python3", "/app/serve.py"]container_port field.GET /health endpoint that returns 200 when ready to serve traffic.nvidia/cuda to ensure CUDA compatibility.Test your container locally with Docker Compose before deploying:
# docker-compose.yml for local testing
services:
model:
build: .
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- MODEL=meta-llama/Llama-3.1-8B-Instruct
- MAX_MODEL_LEN=4096VectorLay supports pulling from private container registries. Add your registry credentials in your dashboard settings, then reference private images in your cluster config: