[Issue] Tool Calling Failures on Groq LLMs via Pydantic-AI (OpenAI baseline vs Groq)

Hi everyone,

I’m the creator of madin (github: kinyugo/madin), a library for preparing documents for agentic RAG. Recently, I’ve been experimenting with running it on Groq-hosted LLMs. While the output quality itself is good, I’ve run into serious inconsistencies with tool calling.

Here’s what I’ve observed so far:

  • Without explicitly passing tool_choice: required, Groq models almost always fail to call tools.
  • Even when tool_choice: required is set, the models sometimes attempt to call non-existent tools.
  • By contrast, OpenAI’s models handle tool calls reliably with the same setup (madin uses pydantic-ai under the hood).

For reference, here are some example traces from Logfire:

  1. openai:gpt-4.1 (baseline): Works well, with errors unrelated to tool calling.
  2. groq:openai/gpt-oss-120b: Tool call fails but the model partially recovers. (Ran with tool_choice: required).

I’d really like to adopt Groq services long-term since their latency is excellent (and the GPT-OSS models perform well on other providers like Hugging Face). However, the tool calling experience is currently very unreliable, which makes it difficult to use in production.

Has anyone else encountered similar issues with Groq + tool calls in pydantic-ai? Any known workarounds?

1 Like

As a follow-up to my earlier post, I wanted to share one more trace and a reproducible notebook.

Here’s another example from Logfire:

To make this easier to verify, I’ve also prepared a notebook that reproduces the runs I described.

Curious if anyone else has tested this model specifically, or if there are known workarounds to improve tool call reliability on Groq.

2 Likes

same problem. groq calls goes to :
openai.APIError: Tool choice is none, but model called a tool

I see your tool is called json but sometimes it’s useful to be really explicit with the function name — under the hood it’s gpt-oss trying to figure out what the tool names are. It’s not as powerful as something like sonnet-4-5 so you have to spoon feed it a lot more information

this for example works pretty well:

curl --request POST \
    --url https://api.groq.com/openai/v1/chat/completions \
    --header 'authorization: Bearer ID' \
    --header 'content-type: application/json' \
    --data '{
    "messages": [
        {
            "role": "user",
            "content": "Extract structured data from: '\''John Doe, age 30, lives in New York'\''"
        }
    ],
    "model": "openai/gpt-oss-120b",
    "temperature": 1,
    "max_completion_tokens": 8192,
    "top_p": 1,
    "stream": false,
    "stop": null,
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_json_from_data",
                "description": "Extract and return structured JSON data from a short string of unstructured text",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "data": {
                            "type": "string",
                            "description": "The raw text data to parse and extract information from"
                        },
                        "schema": {
                            "type": "object",
                            "description": "The expected JSON schema structure to extract",
                            "properties": {
                                "type": {
                                    "type": "string",
                                    "enum": ["object"]
                                },
                                "properties": {
                                    "type": "object",
                                    "description": "Field definitions for the extracted data"
                                },
                                "required": {
                                    "type": "array",
                                    "items": {
                                        "type": "string"
                                    },
                                    "description": "List of required fields"
                                }
                            }
                        }
                    },
                    "required": [
                        "data",
                        "schema"
                    ]
                }
            }
        }
    ]
}'```

@yawnxyz The tool definitions are handled by pydantic-ai so I don’t have the flexibility to change the definitions. I have tested it with other small models such as gpt-5-small and I haven’t observed any issues with tool call.

Tool calling fails with larger models like kimi-k2-instruct as well so I am not sure it’s about model size.

oh I see! I’ll experiment a bit with pydantic-ai and report back (I’m mostly in JS)

how often does this fail, %-wise?

(I usually call tools vanilla, via fetch / curl, and at least on oss-120b I get almost 100% success, so it might be some kind of mismatch of prompt/tool/model/naming and pydantic-ai)

It seems that your services regressed because I never had issues with the old kimi-k2 and previous models.

Currently it fails about 95% of the time. The output is usually correct but the tool calls are not even after model retries. I have had to shift to other providers.

The experience is quite consistent cross python libraries. I have tried instructor as well and I get the same tool calling issue.

oh interesting, I’ll bring this up to the Kimi team to fix. Thanks so much for flagging this!

You are welcome. I am happy to help and offer as much feedback as I can.

Groq is an amazing platform. However, the tool calling failures make it pretty unusable for any production workload. I hope this gets fixed soon.

Your repo is cool. We do lots of RAG Pipelines where I work.

Have you tried Docling - Docling for unstructured data extraction? Docling is open-source and runs local.

@Harry Thanks for the kind sentiments. Yes, I do primarily use docling for document extraction.

My lib is supposed to take the markdown output of tools such as docling and ensure that they have a hierarchical structure that we can use for agentic rag similar to PageIndex I aim to build a more flexible and customizable pipeline.

1 Like

I was unfamiliar with PageIndex, but looks great!

1 Like

I would also like to add that Kimi K2 0905 is awesome & fast, however we get rampant tool call issues.

The tool definitions are all very precise and exact, but about 80% of the time it fails to call the tool and just hallucinates an answer.

Eg: get_date_and_time tool, prompt: “what’s the year?“
result: model doesnt call tool, just responds with “2025”. same thing happens for “what’s the day today?” - model just hallucinates a wrong date.

We’ve also had issues with the gpt-oss models calling tools that don’t exist. I believe this to be a platform issue rather than the individual models themselves. I think this is a large blocker for teams like mine migrating to Groq from OpenAI.

1 Like

Thank you for the reports, we’re working on rolling out some tool call updates/patches very soon!

did you solve the issue?

We’ve been rolling out more updates/patches under the hood; would love to hear if it’s solved your issues as well

Wait what? tell us more. Would love to run evals if you have patched stuff. Any insights would be help? Especially which models?

oh we’re improving our harness. but constrained decoding is juuust around the corner, and the team is working hard to get it out the door