Skip to main content

Support for chaining multiple model requests into a single API call would be fantastic. Groq’s inference speed means the primary latency bottleneck is network I/O when making sequential model calls. Enabling request chaining would reduce network calls, making things faster and more efficient for workflows that need several model inferences in a row!

That’s a great feature request! I’ll bring that up to the engineering team!

What kind of chaining are you looking for, e.g. First call to Scout extracts a summary, and Second call converts the output from the first response JSON or translates to French?


Thanks I appreciate that! 

I'd imagine there are a whole host of use cases including the one you suggested. Mine is for building a conversation bot for personal use. Right now I need to do:

I speak -> call whisper model -> send latency -> inference -> receive latency -> call LLM model -> send latency -> inference -> receive latency -> call text to speech model -> send latency -> inference -> receive latency -> play output

Internet I/O is the big bottleneck. What would be fantastic and remove the bottleneck would be:

I speak -> I call all three models in a chained API call -> send latency -> inference -> inference -> inference -> receive latency -> play output 

There must be so many use cases that would benefit from this functionality! 


Thanks for the detailed explanation — the engineers agree and are looking into adding chaining as a feature!


That's fantastic to hear, I appreciate that! Hopefully will benefit a lot of workflows!


Reply