To ensure fair use and stability of the service, Cinclus imposes rate limits on API usage. This includes limits on the number of requests per minute and the number of tokens processed. Additionally, tools are provided to monitor your API usage over time.
Rate Limits
- Requests Per Minute: By default, each API key is allowed a certain number of requests per minute. For example, an account might be limited to 60 requests per minute on the default plan. If you exceed this rate, further requests may be temporarily rejected until the rate window resets.
- Token Limits: There may also be limits on the number of tokens processed in a minute or day. For instance, you might have a quota such as 1,000,000 tokens per month included in your plan. Tokens include both prompt and completion tokens in each request.
- Concurrent Requests: The service might limit how many requests you can have in flight at the same time. If you send too many requests simultaneously, some might be queued or rejected with a rate limit error.
If you exceed a rate limit, the API will return an HTTP 429 Too Many Requests error. The response may include a Retry-After
header indicating how many seconds to wait before retrying. If you encounter this, implement exponential backoff or throttling in your client to stay within allowed limits.
Note: Rate limits can vary based on your subscription plan or whether you are on a shared vs. dedicated instance. Higher-tier plans often have higher limits, while free/trial accounts have more restrictive limits.
Monitoring Usage
Understanding your usage is key to managing costs and staying within limits. Cinclus provides multiple ways to monitor usage:
- Usage Dashboard: The Cinclus web dashboard includes a usage section where you can see your consumption over time. This typically displays metrics like total requests, tokens used, and current rate limit status for your account.
-
API Usage Endpoint: (If available) There may be an endpoint to programmatically retrieve your usage data. For example,
GET /v1/usage
could return your current usage for the billing period or your remaining quota. (Check the Cinclus API reference for the exact details if this endpoint exists.) -
Response Metadata: Each API response includes a
usage
object (as shown in the examples for completions and embeddings) which details the tokens used for that request. You can log these and aggregate them to track how many tokens you're consuming over time. - Alerts: You can set up alerts (via the dashboard or third-party monitoring tools) to notify you when you approach a certain usage threshold. For instance, you might want an email alert if you use 80% of your monthly token quota, or if you start receiving 429 errors frequently.
Increasing Limits
If you consistently hit rate limits or require more capacity:
- Upgrade Plan: Consider upgrading to a plan with higher quotas or unlimited usage (if offered).
- Contact Support: Reach out to Cinclus support or sales to discuss higher rate limits or a custom plan. Often, limits can be adjusted for your use case, potentially with additional cost.
- Dedicated Instance: For heavy workloads, using a dedicated instance can provide higher throughput and exclusive resources, effectively bypassing the shared rate limits (see Offloading & Scaling for more on dedicated instances).
Always design your application with graceful handling of rate limit responses. This means catching HTTP 429 errors, respecting Retry-After
headers, and retrying after a delay. Proper handling ensures a smooth user experience even when you approach the limits.