Chaining Models Support

Support for chaining multiple model requests into a single API call would be fantastic. Groq’s inference speed means the primary latency bottleneck is network I/O when making sequential model calls. Enabling request chaining would reduce network calls, making things faster and more efficient for workflows that need several model inferences in a row!

Page 1 / 1

That’s a great feature request! I’ll bring that up to the engineering team!

What kind of chaining are you looking for, e.g. First call to Scout extracts a summary, and Second call converts the output from the first response JSON or translates to French?

Thanks I appreciate that!

I'd imagine there are a whole host of use cases including the one you suggested. Mine is for building a conversation bot for personal use. Right now I need to do:

I speak -> call whisper model -> send latency -> inference -> receive latency -> call LLM model -> send latency -> inference -> receive latency -> call text to speech model -> send latency -> inference -> receive latency -> play output

Internet I/O is the big bottleneck. What would be fantastic and remove the bottleneck would be:

I speak -> I call all three models in a chained API call -> send latency -> inference -> inference -> inference -> receive latency -> play output

There must be so many use cases that would benefit from this functionality!

Thanks for the detailed explanation — the engineers agree and are looking into adding chaining as a feature!

That's fantastic to hear, I appreciate that! Hopefully will benefit a lot of workflows!

Reply

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded