Scaling & Autoscaling

Manual Scaling

Adjust the number of replicas in your cluster at any time. New replicas are provisioned on available GPUs and added to the load balancer automatically.

terminal

curl -X PATCH https://api.vectorlay.com/v1/clusters/cl_abc123 \
  -H "Authorization: Bearer vl_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "replicas": 5
  }'

Autoscaling

Enable autoscaling to automatically adjust replica count based on incoming request volume. VectorLay monitors requests per replica and scales up or down to meet your target.

terminal

curl -X POST https://api.vectorlay.com/v1/clusters \
  -H "Authorization: Bearer vl_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "production-llm",
    "gpu_type": "rtx-4090",
    "container": "vllm/vllm-openai:latest",
    "replicas": 2,
    "autoscaling": {
      "min_replicas": 2,
      "max_replicas": 10,
      "target_requests_per_replica": 50
    }
  }'

How autoscaling works

Monitor: VectorLay tracks active requests per replica over a 60-second sliding window.
Evaluate: If requests per replica exceed the target, a scale-up is triggered. If below 50% of target for 5 minutes, a scale-down is triggered.
Provision: New replicas are provisioned on available GPUs across the network.
Route: Traffic is automatically distributed to healthy replicas via the load balancer.

Fault tolerance during scaling

VectorLay's distributed architecture means scaling is inherently fault-tolerant:

If a replica fails during scale-up, a replacement is provisioned automatically
Scale-down gracefully drains in-flight requests before terminating replicas
Replicas are spread across multiple nodes to minimize correlated failures
The load balancer health-checks all replicas and routes around unhealthy ones

Scaling limits

Min replicas

At least one replica must always be running

Max replicas

100

Scale-up cooldown

Minimum time between scale-up events

30 seconds

Scale-down cooldown

Prevents flapping on traffic spikes

5 minutes

Manual Scaling

Adjust the number of replicas in your cluster at any time. New replicas are provisioned on available GPUs and added to the load balancer automatically.

terminal

curl -X PATCH https://api.vectorlay.com/v1/clusters/cl_abc123 \
  -H "Authorization: Bearer vl_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "replicas": 5
  }'

Autoscaling

Enable autoscaling to automatically adjust replica count based on incoming request volume. VectorLay monitors requests per replica and scales up or down to meet your target.

terminal

curl -X POST https://api.vectorlay.com/v1/clusters \
  -H "Authorization: Bearer vl_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "production-llm",
    "gpu_type": "rtx-4090",
    "container": "vllm/vllm-openai:latest",
    "replicas": 2,
    "autoscaling": {
      "min_replicas": 2,
      "max_replicas": 10,
      "target_requests_per_replica": 50
    }
  }'

How autoscaling works

Monitor: VectorLay tracks active requests per replica over a 60-second sliding window.

Evaluate: If requests per replica exceed the target, a scale-up is triggered. If below 50% of target for 5 minutes, a scale-down is triggered.

Provision: New replicas are provisioned on available GPUs across the network.

Route: Traffic is automatically distributed to healthy replicas via the load balancer.

Fault tolerance during scaling

VectorLay's distributed architecture means scaling is inherently fault-tolerant:

If a replica fails during scale-up, a replacement is provisioned automatically

Scale-down gracefully drains in-flight requests before terminating replicas

Replicas are spread across multiple nodes to minimize correlated failures

The load balancer health-checks all replicas and routes around unhealthy ones