Rate Limiting¶
AI APIs often consume resources based on token usage, and excessive requests can lead to increased costs, throttling, or service disruptions. To manage this effectively, WSO2 API Manager provides rate limiting capabilities, allowing you to control the number of requests sent to an AI service within a given timeframe.
By enforcing rate limits, you can:
- Prevent unexpected cost spikes from excessive AI API usage.
- Optimize performance by ensuring fair resource distribution.
- Protect AI backends from overuse and service degradation.
WSO2 API Manager enables you to define rate limits at different levels, including subscription-level, and backend-level, ensuring flexibility in managing API usage.
Subscription-Level Rate Limiting¶
Subscription-level rate limiting applies different quotas based on business plans, allowing API providers to enforce request-based or token-based limits on API subscribers.
Quotas Available for Subscription Policies:¶
- Request Count - Limits the total number of requests an application can make. Once the limit is reached, further requests are throttled until the quota resets.
- Total Token Count - Defines the maximum number of tokens consumed by an application across all interactions with an AI API. When the limit is exceeded, throttling occurs.
- Prompt Token Count - Controls the number of tokens used specifically for AI prompt processing. This prevents excessive prompts from consuming API resources.
- Completion Token Count - Restricts the number of response tokens generated by an AI API. When exceeded, further completions are throttled.
Steps to add subscription throttling policies:
- Login to the Admin Portal (
https://<hostname>:9443/admin
). - Navigate to Rate Limiting Policies and click Subscription Policies.
- Click Add Policy, then click Add AI Policy to define a new AI policy.
- Configure quotas, including request limits and token-based limits.
- Click Save to apply the new policy.
For more details, see Configuring AI API Subscription Policies
Backend Rate Limiting¶
Backend rate limiting ensures that AI APIs do not overload backend AI services by controlling token usage and request counts.
Quotas Available for Backend Rate Limiting:¶
- Request Count - The maximum number of requests the backend AI service can handle.
- Total Token Count - Limits the overall token consumption to prevent resource exhaustion.
- Prompt Token Count - Controls the number of input tokens sent to the AI model.
- Completion Token Count - Limits the number of tokens generated by the AI model as a response.
Steps to configure backend throughput limits for AI APIs:
- Login to the Publisher Portal (
https://<hostname>:9443/publisher
). - Select the AI API for which you want to set the maximum backend throughput.
- Navigate to API Configurations and click Runtime.
- Under Backend Rate Limiting, choose Specify and set limits for:
- Total token count
- Prompt token count
- Completion token count
- Maximum request count
- Define separate limits for Production and Sandbox.
- Click Save to apply the changes
For a step-by-step guide, refer to Backend Rate Limiting for AI APIs
Implementing rate limiting ensures cost control, backend protection, and fair API usage. By configuring subscription policies and backend throughput limits, you can optimize AI API performance while preventing overuse of AI resources.