Loading…
Loading…
Scale your GPU inference clusters manually or automatically based on traffic demand.
Adjust the number of replicas in your cluster at any time. New replicas are provisioned on available GPUs and added to the load balancer automatically.
curl -X PATCH https://api.vectorlay.com/v1/clusters/cl_abc123 \
-H "Authorization: Bearer vl_xxx" \
-H "Content-Type: application/json" \
-d '{
"replicas": 5
}'Enable autoscaling to automatically adjust replica count based on incoming request volume. VectorLay monitors requests per replica and scales up or down to meet your target.
curl -X POST https://api.vectorlay.com/v1/clusters \
-H "Authorization: Bearer vl_xxx" \
-H "Content-Type: application/json" \
-d '{
"name": "production-llm",
"gpu_type": "rtx-4090",
"container": "vllm/vllm-openai:latest",
"replicas": 2,
"autoscaling": {
"min_replicas": 2,
"max_replicas": 10,
"target_requests_per_replica": 50
}
}'VectorLay's distributed architecture means scaling is inherently fault-tolerant:
At least one replica must always be running
Contact us for higher limits
Minimum time between scale-up events
Prevents flapping on traffic spikes