Hi everyone,
I’d like to raise a concern regarding my experience using Qwen3:32B on Groq for an enterprise product. I’m hoping to get some insights from the experts here.
Setup
I’m running requests to Groq in Python via the Groq framework with the following parameters:
-
Message length (input tokens): ~3,000–4,000 tokens
-
Temperature: 0.0
-
Reasoning effort: default
Test Scenarios
I conducted a performance test with:
-
Total requests: 300
-
Concurrent users (VUs): scaling from 1 VU up to 10 VUs
Observations
-
With ≤ 4 VUs, the model output is consistent and follows the expected format:
<think>...</think> <output>
-
However, as the load increases (VUs > 4), the model starts producing hallucinations and strange tokens. The output deviates significantly from the expected format. (I’ve attached some sample responses for reference.)
Concern
While the inference speed is impressive, the stability issue at higher concurrency makes it difficult to consider this setup production-ready.