Support for chaining multiple model requests into a single API call would be fantastic. Groq’s inference speed means the primary latency bottleneck is network I/O when making sequential model calls. Enabling request chaining would reduce network calls, making things faster and more efficient for workflows that need several model inferences in a row!
That’s a great feature request! I’ll bring that up to the engineering team!
What kind of chaining are you looking for, e.g. First call to Scout extracts a summary, and Second call converts the output from the first response JSON or translates to French?
Thanks I appreciate that!
I'd imagine there are a whole host of use cases including the one you suggested. Mine is for building a conversation bot for personal use. Right now I need to do:
I speak -> call whisper model -> send latency -> inference -> receive latency -> call LLM model -> send latency -> inference -> receive latency -> call text to speech model -> send latency -> inference -> receive latency -> play output
Internet I/O is the big bottleneck. What would be fantastic and remove the bottleneck would be:
I speak -> I call all three models in a chained API call -> send latency -> inference -> inference -> inference -> receive latency -> play output
There must be so many use cases that would benefit from this functionality!
Thanks for the detailed explanation — the engineers agree and are looking into adding chaining as a feature!
That's fantastic to hear, I appreciate that! Hopefully will benefit a lot of workflows!
Reply
Login to the community
No account yet? Create an account
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.