Skip to main content

Hi all,

Our team is building an AI workflow in Japanese. We are prioritizing response time, but using LLaMA-3.1 8B, which is the smallest model in production env, will dramatically dampen the performance. As a result, we used LLaMA-3.3 70B to find a balence between response time and performance.

From our experiments, qwen models turn out to excel among all models, but it’s not feasible to use it in production env (suffering from ~50% server error rate). If qwen models become available in production env, we would definitely use them instead of LLaMA models. So is there a clear roadmap about when will these models be available? Thanks in advance.

Hi Excalibar!

We’ve been constantly tuning and improving the Qwen-32b model, and making it faster and error free. The model should already be able to take large workloads without failing.

Please run your evals again and see if you’re still seeing the errors? From our observability, you shouldn’t be seeing error rates like that, but if you still run into those problems, please email me at jzheng@groq.com so I can help diagnose with the engineers.

 


Hi Excalibar, since you're seeing high performance with Qwen models but facing a ~50% server error rate, your best course is to contact the Qwen model providers directly or monitor their official channels (e.g., Alibaba Cloud or ModelScope) for updates on production readiness. In the meantime, consider using a hybrid setup: continue with LLaMA-3.3 70B in production while using Qwen-32B for offline tasks or batch inference until its stability improves for production use.


Reply