Qwen3-32B starts hallucinating when CCU increases

Hi everyone,

I’d like to raise a concern regarding my experience using Qwen3:32B on Groq for an enterprise product. I’m hoping to get some insights from the experts here.

Setup

I’m running requests to Groq in Python via the Groq framework with the following parameters:

  • Message length (input tokens): ~3,000–4,000 tokens

  • Temperature: 0.0

  • Reasoning effort: default

Test Scenarios

I conducted a performance test with:

  • Total requests: 300

  • Concurrent users (VUs): scaling from 1 VU up to 10 VUs

Observations

  • With ≤ 4 VUs, the model output is consistent and follows the expected format:

    <think>...</think>  
    <output>
    
    
  • However, as the load increases (VUs > 4), the model starts producing hallucinations and strange tokens. The output deviates significantly from the expected format. (I’ve attached some sample responses for reference.)

Concern

While the inference speed is impressive, the stability issue at higher concurrency makes it difficult to consider this setup production-ready.

Responded on Discord but will post here: this was related to an issue with prefix caching. We’ve fixed this for now and are taking steps to ensure this doesn’t happen in the future.