Back to docs
Scaling & Autoscaling
Scale your GPU inference clusters manually or automatically based on traffic demand.
Manual Scaling
Adjust the number of replicas in your cluster at any time. New replicas are provisioned on available GPUs and added to the load balancer automatically.
terminal
curl -X PATCH https://api.vectorlay.com/v1/clusters/cl_abc123 \
-H "Authorization: Bearer vl_xxx" \
-H "Content-Type: application/json" \
-d '{
"replicas": 5
}'Autoscaling
Enable autoscaling to automatically adjust replica count based on incoming request volume. VectorLay monitors requests per replica and scales up or down to meet your target.
terminal
curl -X POST https://api.vectorlay.com/v1/clusters \
-H "Authorization: Bearer vl_xxx" \
-H "Content-Type: application/json" \
-d '{
"name": "production-llm",
"gpu_type": "rtx-4090",
"container": "vllm/vllm-openai:latest",
"replicas": 2,
"autoscaling": {
"min_replicas": 2,
"max_replicas": 10,
"target_requests_per_replica": 50
}
}'How autoscaling works
- Monitor: VectorLay tracks active requests per replica over a 60-second sliding window.
- Evaluate: If requests per replica exceed the target, a scale-up is triggered. If below 50% of target for 5 minutes, a scale-down is triggered.
- Provision: New replicas are provisioned on available GPUs across the network.
- Route: Traffic is automatically distributed to healthy replicas via the load balancer.
Fault tolerance during scaling
VectorLay's distributed architecture means scaling is inherently fault-tolerant:
- If a replica fails during scale-up, a replacement is provisioned automatically
- Scale-down gracefully drains in-flight requests before terminating replicas
- Replicas are spread across multiple nodes to minimize correlated failures
- The load balancer health-checks all replicas and routes around unhealthy ones
Scaling limits
Min replicas
1At least one replica must always be running
Max replicas
100Contact us for higher limits
Scale-up cooldown
30 secondsMinimum time between scale-up events
Scale-down cooldown
5 minutesPrevents flapping on traffic spikes