Load Balancing¶
Load balancing ensures that AI API requests are efficiently distributed across multiple models within the same AI vendor, preventing overloading of any single model. WSO2 API Manager supports the following load balancing methods:
- Round Robin
- Weighted Round Robin
Note
You can only configure a single load balancing stratgery at a given time for a given AI API.
Round Robin¶
With Round Robin policy, requests are evenly distributed across all configured AI models in a cyclic manner, ensuring equal request allocation over time.
Configure Round Robin Routing¶
You can enforce round robin based load balancing for your AI API by attaching the Model Round Robin policy. Here are the steps that you need to follow:
- Login to the Publisher Portal (
https://<hostname>:9443/publisher
). - Select the AI API for which you want to configure load balancing.
- Navigate to API Configurations, and click Policies.
-
Look for the policy named Model Round Robin listed under the Common Policies section within the policy list. Let's, drag and drop the Model Round Robin policy to the Request flow of
/chat/completions
POST operation. -
Fill in the requested details and click Save.
Section Description Production/Sandbox Add any amount of models by clicking on Add Model button. For each model addition, select a model from the dropdown (if no models are listed, make sure to add the desired models from the Admin Portal). Following the model selection, select the endpoint from the dropdown (if no endpoints are listed, make sure to add the required endpoints under the Endpoints page of Publisher Portal) Suspend Duration Suspend duration in seconds. This will be used to suspend any failed model-endpoint pairs. If not configured, knowledge about failed invocations are not persisted. -
Finally, scroll to the bottom of the page and click on Save and deploy.
-
For more information on how to work with API Policies, refer to the API Policies section.
Try Out Round Robin Routing¶
- Navigate to the Developer Portal and try to invoke the AI API.
-
Let's invoke
/chat/completions POST
resource by obtaining a Production key since we configured the round robin policy only for production. You can use the below mentioned payload.Notice how the request was made to
gpt-4o
model, but the response was fromgpt-4o-mini
.
Weighted Round Robin¶
Requests are distributed based on predefined weight values assigned to each model. This allows probabilistic control over request distribution, giving higher priority to models with greater processing power or availability.
Configure Weighted Round Robin Routing¶
You can enforce weighted round robin based load balancing for your AI API by attaching the Model Weighted Round Robin policy. Here are the steps that you need to follow:
- Login to the Publisher Portal (
https://<hostname>:9443/publisher
). - Select the AI API for which you want to configure load balancing.
- Navigate to API Configurations, and click Policies.
-
Look for the policy named Model Weighted Round Robin listed under the Common Policies section within the policy list. Let's, drag and drop the Model Weighted Round Robin policy to the Request flow of
/chat/completions
POST operation. -
Fill in the requested details and click Save.
Section Description Production/Sandbox Add any amount of models by clicking on Add Model button. For each model addition, select a model from the dropdown (if no models are listed, make sure to add the desired models from the Admin Portal). Following the model selection, select the endpoint from the dropdown (if no endpoints are listed, make sure to add the required endpoints under the Endpoints page of Publisher Portal). Assign a weight to the model that was picked. Suspend Duration Suspend duration in seconds. This will be used to suspend any failed model-endpoint pairs. If not configured, knowledge about failed invocations are not persisted. -
Finally, scroll to the bottom of the page and click on Save and deploy.
-
For more information on how to work with API Policies, refer to the API Policies section.
Try Out Weighted Round Robin Routing¶
- Navigate to the Developer Portal and try to invoke the AI API.
-
Let's invoke
/chat/completions POST
resource by obtaining a Production key since we configured the weighted round robin policy only for production. You can use the below mentioned payload.Notice how the request was made to
gpt-4o
model, but the response was fromo3-mini
.