Skip to main content

Let’s consider the following example use case:

  1. We want to ask a model that supports tool use the following question: “What is the weather like today in the largest city in Japan that doesn’t contain a y? Make sure to explain your reasoning in detail before calling any tools.”
  2. We have two tools:
    1. get_weather, which takes two parameters: location and date, and returns the corresponding weather forecast.
    2. get_current_date, which returns today’s date.

I’d like the model to first determine the largest city in Japan without a ‘y’ - hence the chain-of-thought instruction in the prompt. Without it, it often decides to go with Tokyo. Then, it should call the tool to get the current date. Finally, when it has determined the date, it would be able to call the weather tool.

In general, I’d like to have a loop of text generation, then a tool call, then perhaps more text generation, then another tool call, etc. And I'd like to be able to follow through and inspect what the model was thinking throughout.

The problem I’m having is, when I make an API call with the prompt and the two tools, the model generates some tokens (presumably doing the chain of thought), then returns a tool call asking for the current date, as expected. But I don’t actually get the generated text back, just the tool call.

So then I can execute the tool, and add the result to the context as a “tool” message. But the chain-of-thought will no longer be in the context, so the model will re-do it again, before finally asking for the weather in Osaka, to only drop it again.

Am I doing something wrong here? Especially for more complex workflows that use tool-calls in-between text generation, dropping all generated text after every tool call, and then having to re-generate it again, is very inefficient.

This is the perfect use case of a “ReAct Agent” workflow!

I have it implemented as part of a Google Maps chat function here (https://github.com/janzheng/sidenote - take a look at reactAgent.svelte.ts) but I think I’ll create a cleaner, isolated template and readme / walkthrough on it.

A couple of pointers that could help:

  • Preserve the entire context w/ all the reasoning by pushing it into the conversation history array
  • Force it into patterns with system prompts; e.g. have it output something like “Thought: (generated tokens)”, “Action: tool_name”, “Action Input: {json input}”, and “Final Answer: (answer)”. You can have it generate these as text (which you’ll parse w/ regex) or have it generate JSON. You can then detect, filter, and extract the outputs back out

You’d essentially use a while loop to generate a bunch of responses that feeds it back into its own context; each one has the same system prompt that asks it to think first, then to figure out action and input, and then display final answer, if it has enough context to do those things. It’ll fill in the blanks on its own. Once it’s generated Final Answer then you can stop and break out of the loop (you’d also have a max_iterations in case it goes infinitely)

Hopefully that makes sense! I should really put together a walkthrough of this!


Ah, so you don’t use the tools field in the API, but manually put the tools in the system message, along with instructions on how to use them, and then manually parse the AI response for tool calls.

But then you don’t get the exact same result that you would if you did use the tools field. In particular, the Kimi-K2 tokenizer template specifies one pattern to set tools (<|tool_calls_section_begin|>, etc.), while your agent uses a different one (Action: tool_name).

Wouldn’t that hurt performance, though? I would assume the model was trained with the template pattern in mind - so would work best if agents stuck to it. Is that not so?


Yeah everything goes in the system message.

Mmm yes I opted for a really different way to do tool calling, and yes you’re right, it definitely affects performance a bit, for the reasons you mentioned. 

I built my system this way to give me a ton more prototyping flexibility, as I needed it to be able to go back and review what its done, and have some kind of introspection. This looping pattern gives me more power to design for the agentic / reflexive behavior I needed though. This is more flexible but slower and more expensive (more loops, more tokens consumed). So it’s definitely a tradeoff.


Reply