Back to docs

Scaling & Autoscaling

Scale your GPU inference clusters manually or automatically based on traffic demand.

Manual Scaling

Adjust the number of replicas in your cluster at any time. New replicas are provisioned on available GPUs and added to the load balancer automatically.

terminal
curl -X PATCH https://api.vectorlay.com/v1/clusters/cl_abc123 \
  -H "Authorization: Bearer vl_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "replicas": 5
  }'

Autoscaling

Enable autoscaling to automatically adjust replica count based on incoming request volume. VectorLay monitors requests per replica and scales up or down to meet your target.

terminal
curl -X POST https://api.vectorlay.com/v1/clusters \
  -H "Authorization: Bearer vl_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "production-llm",
    "gpu_type": "rtx-4090",
    "container": "vllm/vllm-openai:latest",
    "replicas": 2,
    "autoscaling": {
      "min_replicas": 2,
      "max_replicas": 10,
      "target_requests_per_replica": 50
    }
  }'

How autoscaling works

  1. Monitor: VectorLay tracks active requests per replica over a 60-second sliding window.
  2. Evaluate: If requests per replica exceed the target, a scale-up is triggered. If below 50% of target for 5 minutes, a scale-down is triggered.
  3. Provision: New replicas are provisioned on available GPUs across the network.
  4. Route: Traffic is automatically distributed to healthy replicas via the load balancer.

Fault tolerance during scaling

VectorLay's distributed architecture means scaling is inherently fault-tolerant:

  • If a replica fails during scale-up, a replacement is provisioned automatically
  • Scale-down gracefully drains in-flight requests before terminating replicas
  • Replicas are spread across multiple nodes to minimize correlated failures
  • The load balancer health-checks all replicas and routes around unhealthy ones

Scaling limits

Min replicas

At least one replica must always be running

1
Max replicas

Contact us for higher limits

100
Scale-up cooldown

Minimum time between scale-up events

30 seconds
Scale-down cooldown

Prevents flapping on traffic spikes

5 minutes