(Posted by Verdi on the Groq Discord: https://discord.com/channels/1207099205563457597/1377607663004811294)
Looking to use Groq for writing our eval set. We are intending to use a design which is LLM-as-judge in our CD/CI pipeline. However, one of the issues as highlighted in this research (https://arxiv.org/pdf/2303.16634) is when asked a model will bias its "score" answer to be almost-always 3 and a better technique to get more high-quality test is to use the probability distribution instead.
This is available when using OpenAI's API directly, but not for models on Groq. I would prefer to use Groq if possible since it's much faster!