Hi all,
Our team is building an AI workflow in Japanese. We are prioritizing response time, but using LLaMA-3.1 8B, which is the smallest model in production env, will dramatically dampen the performance. As a result, we used LLaMA-3.3 70B to find a balence between response time and performance.
From our experiments, qwen models turn out to excel among all models, but it’s not feasible to use it in production env (suffering from ~50% server error rate). If qwen models become available in production env, we would definitely use them instead of LLaMA models. So is there a clear roadmap about when will these models be available? Thanks in advance.