Qwen3-32B starts hallucinating when CCU increases

2uanDM · September 17, 2025, 5:23am

Hi everyone,

I’d like to raise a concern regarding my experience using Qwen3:32B on Groq for an enterprise product. I’m hoping to get some insights from the experts here.

Setup

I’m running requests to Groq in Python via the Groq framework with the following parameters:

Message length (input tokens): ~3,000–4,000 tokens
Temperature: 0.0
Reasoning effort: default

Test Scenarios

I conducted a performance test with:

Total requests: 300
Concurrent users (VUs): scaling from 1 VU up to 10 VUs

Observations

With ≤ 4 VUs, the model output is consistent and follows the expected format:
```
<think>...</think>  
<output>
```
However, as the load increases (VUs > 4), the model starts producing hallucinations and strange tokens. The output deviates significantly from the expected format. (I’ve attached some sample responses for reference.)

Concern

While the inference speed is impressive, the stability issue at higher concurrency makes it difficult to consider this setup production-ready.

benank · September 19, 2025, 9:52pm

Responded on Discord but will post here: this was related to an issue with prefix caching. We’ve fixed this for now and are taking steps to ensure this doesn’t happen in the future.

Topic		Replies	Views
Inquiries about Qwen-32B to be available as a production model Forum	2	29	July 3, 2025
Qwen3-32B is Now LIVE on Groq! Groq News	1	37	June 11, 2025
Prompt Caching Feature Requests	3	23	October 9, 2025
Qwen3 Coder & Qwen3 235B A22B Thinking 2507 Feature Requests	8	109	August 24, 2025
New qwen model? Feature Requests	4	40	August 2, 2025

Qwen3-32B starts hallucinating when CCU increases

Setup

Test Scenarios

Observations

Concern

Related topics