Skip to main content

Hello,

I would really like to experiment with the 10M token context window of the Llama 4 Scout model. Groq is one of the very few places that offer this model over an API.

However, the 300000 TPM limit seems to defeat the purpose. As far as I understand the API is stateless, so I’d have to have several megatokens in a request to use the context window, and the TPM limit will always relect this request no matter how long I wait?

If there is a solution please tell me so (maybe there *is* a way to have incremental requests that would let me drip-freed the large context at 300000 TPM?)

Alternatively, could it be possible to request a removal of this TPM limit in exchange for a VERY strict RPM limit? I can do with 1 RPM for my idea, which is basically “stuff the entire docset into the context window then ask questions or create code based on the docset”.. I will of course go on the paid tier if this request could be granted. My experiments are not expected to reach large scale anytime soon, though.

Unfortunately, Llama Scout currently only runs with a 131k token context window. In fact, I don’t believe any provider runs it with the whole 10M context window. If you’d still like to get more full context window requests in per minute, consider using "service_tier": "flex" or requesting a rate limit increase from Chat With Us.


Reply