Offloading & Scaling

Cinclus LLM On-Demand is designed to scale with your needs, whether you're just starting out or have enterprise-level demands. This section explains how the service handles scaling, the difference between shared and dedicated instances, and how "offloading" to different zones ensures reliability and performance.

Shared vs. Dedicated Instances

Shared Instances: By default, when you use the Cinclus API, your requests run on shared infrastructure. This means the underlying LLM servers are serving multiple customers at the same time. Shared instances offer a cost-effective way to get started:

  • Resources are allocated on-demand, and you pay only for what you use (or what's included in your plan).
  • There are generally fixed rate limits on shared usage to ensure fair access for all users.
  • Minor variability in response times can occur if many users are using the service simultaneously, but the platform actively balances load to maintain performance.

Dedicated Instances: For users with high volume or special requirements, Cinclus offers dedicated instances:

  • A dedicated instance is an LLM server (or cluster) reserved solely for your use. No other customers' workloads run on your dedicated instance.
  • This provides more consistent performance, as the compute resources are not shared. It can handle your traffic up to the machine's capacity without interference.
  • Rate limits on a dedicated instance are typically higher or lifted, since your usage is isolated. You're mainly limited by the hardware capacity and your agreement with Cinclus.
  • Dedicated instances can be beneficial for data privacy as well — your data is processed on isolated infrastructure.

Typically, to set up a dedicated instance, you will contact Cinclus support or sales. You'll specify the model you need and the desired zone(s) for deployment. They will provision a dedicated environment for you, which you then access with your API key (often a special key tied to the dedicated instance).

In summary, use shared instances for development, testing, and moderate workloads. Consider a dedicated instance for production applications with high traffic or strict performance and privacy requirements.

Zone Offloading (Cross-Zone Scaling)

Zones: Cinclus operates in multiple geographic zones, for example:

  • NO – Norway
  • SW – Sweden
  • DK – Denmark
  • FI – Finland

When you make a request, you have the option to specify a zone. But what happens if the zone you selected is at capacity or temporarily under heavy load? This is where offloading comes into play.

Offloading: If a particular zone is busy or has reached its capacity, Cinclus can automatically route (offload) your request to another zone that has available capacity. The goal is to avoid failed requests or long delays due to regional capacity limits.

  • By default, if you do not specify a zone in your request, the system will route your request to an available zone automatically. This gives you the best chance of a fast response, as the load balancer will pick an optimal zone based on current load.
  • If you do specify a zone (for example, zone: "NO" to target Norway specifically), the system will prioritize using that zone. If Norway's capacity is fully utilized at that moment, the behavior can be:
    • Queue or Minor Delay: The request might wait briefly until resources free up in NO.
    • Automatic Fallback: If you have allowed fallback, Cinclus might forward your request to another zone (say, SW or DK) to get it processed without waiting. This ensures your application gets a timely response even if your preferred zone is busy.
    • Strict Zone Enforcement: If you require that data only stay in the specified zone (for compliance or policy reasons), you can disable cross-zone offloading. In that case, if the zone is at capacity, you may receive an error or a timeout instead of being served elsewhere.

Offloading is seamless; the response you receive will look the same. There might be a small difference in latency if the alternate zone is farther from your users, but otherwise the functionality is identical.

Example Scenario:

You normally send all requests with zone: "DK" (Denmark) for low latency to your European users. One day, Denmark servers are extremely busy due to many requests. If offloading is enabled (or if you omit the zone), Cinclus might process some of your requests in SW (Sweden) where there is free capacity. Your users continue to get responses, possibly a few milliseconds slower, but without errors or long waits. Once the load in Denmark normalizes, requests resume processing in DK as usual.

Scaling Best Practices

  • Test Multiple Zones: If latency matters, test your application with different zones to see which provides the best performance for your users. You might find, for example, that "FI" yields faster responses for some regions than "DK".
  • Leverage Offloading for High Availability: Unless you have a strict requirement to pin to a single region, allowing the platform to auto-select zones (or fallback when needed) will improve your application's resilience. This way, a spike in one region won't bottleneck your service.
  • Monitoring and Alerts: Monitor your latency and error rates. If you notice frequent 429 errors or slow responses, it might indicate that you're hitting capacity limits. This could be a sign to enable offloading, upgrade your plan, or consider a dedicated instance.
  • Plan for Scale: If you anticipate rapid growth in usage, reach out to Cinclus ahead of time. They can help ensure capacity (through scaling the shared pool or setting up a dedicated instance) to meet your needs without interruption.

Cinclus is built to grow with you. Whether through the shared pool's elasticity or provisioning dedicated servers for your exclusive use, the platform aims to ensure your applications can always get the AI results they need quickly and reliably.