Prompt Caching

Caching is currently enabled for Kimi-K2 and gpt-oss-20b according to Prompt Caching - GroqDocs

Can you roll out support for all models please? We’re using qwen3-32b a lot, and caching would be a huge improvement

Yep, we’re rolling it out for more and more models!

Do you have a timeline for the rollout to more models? We’d love to get it on gpt-oss-120b ASAP too

We’re doing rolling testing and releasing to more and more models currently, should be soon but I don’t have an exact rollout timeline.

Caching not currently working for gpt-oss-20b, works fine on the moonshot model

def request():
    system_prompt = f"""
    You are a legal expert AI assistant. Analyze the following legal document and provide detailed insights.\\n\\nLEGAL DOCUMENT:\\n{lorem.words(1000)}
    """

    first_analysis = client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": "What are the key provisions regarding user account termination in this agreement?",
            },
        ],
        model="openai/gpt-oss-20b",
        max_tokens=1,
    )

    print("Usage:", first_analysis.usage)
    time.sleep(5)

if __name__ == "__main__":
    for _ in range(2):
        request()

this prints:

Usage: CompletionUsage(completion_tokens=1, prompt_tokens=1548, total_tokens=1549, completion_time=0.001701566, prompt_time=0.181845881, queue_time=0.046640008, total_time=0.183547447)
Usage: CompletionUsage(completion_tokens=1, prompt_tokens=1554, total_tokens=1555, completion_time=0.001011545, prompt_time=0.090511759, queue_time=0.044814801, total_time=0.091523304)

i.e. no cached tokens

1 Like

Hi!

We try our best to maximize cache hits, but caching isn’t guaranteed on subsequent requests due to our internal routing (which minimizes latency). This is especially true for smaller models, since there are more instances, and caching isn’t shared between instances, so it’s likely you will hit a different instance between requests.

We’re constantly working on improving the cache hit rate, and we appreciate your feedback!