Bug] GPT-OSS-120B: Reasoning tokens and gibberish output appearing in responses despite configuration to hide reasoning

Model: GPT OSS 120B
Endpoint: (e.g., /v1/chat/completions)

SDK: 0.32
params = { “temperature”: 1.0, “max_tokens”: 512, “top_p”: 0.9, “reasoning_format”: “hidden”, “disable_tool_validation”: True, }

Frequency: 4/10 requests

Expected Behavior

  • When reasoning_format=“Hidden”

    no reasoning text should appear in the final user content.

  • The response should contain only valid, user-facing text or tool calls, never both mixed together.

Observed Behavior

  • The model generates gibberish / “thinking” text at the beginning of responses — e.g. starts with “First…” or random valid tokens.

  • Occasionally the response mixes tool calls and junk text in the same message.

  • This appears as if the model is leaking reasoning tokens, even though the configuration requests them to be hidden.

  • The issue started recently; the same setup previously behaved correctly.

Sample response :

Sure! Let’s talk… First—………… …

Got……… … ………… Sorry about that glitch

2 Likes

I’m trying to reproduce the error with this, but I can’t get it to spit out gibberish; are you still seeing errors on your end?

Oh, the other thing we added is prompt caching; so if the FIRST request spits out gibberish and you try the same exact prompt again, the gibberish might be cached.

Bust the cache by adding a timestamp or random value at the beginning of the message

curl --request POST \
    --url https://api.groq.com/openai/v1/chat/completions \
    --header 'authorization: Bearer ID' \
    --header 'content-type: application/json' \
    --data '{
    "messages": [
        {
            "role": "user",
            "content": "Extract structured data from: '\''John Doe, age 30, lives in New York'\''"
        }
    ],
    "model": "openai/gpt-oss-120b",
    "temperature": 1,
    "max_completion_tokens": 8192,
    "top_p": 1,
    "stream": false,
    "stop": null,
    "reasoning_format": "hidden",
    "disable_tool_validation": true,
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_json_from_data",
                "description": "Extract and return structured JSON data from a short string of unstructured text",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "data": {
                            "type": "string",
                            "description": "The raw text data to parse and extract information from"
                        },
                        "schema": {
                            "type": "object",
                            "description": "The expected JSON schema structure to extract",
                            "properties": {
                                "type": {
                                    "type": "string",
                                    "enum": ["object"]
                                },
                                "properties": {
                                    "type": "object",
                                    "description": "Field definitions for the extracted data"
                                },
                                "required": {
                                    "type": "array",
                                    "items": {
                                        "type": "string"
                                    },
                                    "description": "List of required fields"
                                }
                            }
                        }
                    },
                    "required": [
                        "data",
                        "schema"
                    ]
                }
            }
        }
    ]
}'

The issue occurs when using tool calling with 3 or more tools, where the model reasons for a bit. It’s not easily reproducible. The pattern involves content, then tool use case. Unfortunately, I cannot provide the actual data.

Were you able to solve this problem? I have the same problem, something like 9 tools and it appears when the conversation becomes long.

Happens for me when the model has to reason a lot and then it forgets to send a response. If the user types ? nothing but if I read the reasoning token and use the verb mentioned to respond to the user, it works. Bizzaro bug. The other option Im considering is to provide a scratchpad but there is patch not a fix.

If you find yourselves with more than 3-4 tools, and really large contexts, it always helps to break large tasks into smaller tasks; instead of “make thanksgiving dinner plus all the potatoes and turkey and for each dish you do this and that”…

Try breaking the massive tasks into much smaller tasks, like “make the potatoes” and “make the turkey” — each will have its own specific tasks and context — this decomposition of the main tasks into smaller units help you get way better definition of what jobs you want to accomplish, you can test each of them separately / swap out system prompts and tools separately for A/B testing, plus it has the added effect of not overloading the model with too much stuff to do.

For Scratchpad — that seems to work on Claude Sonnet 4.5, but it slows it down by a lot and increases costs by a lot. I’d recommend trying the “break tasks down into the smallest unit that doesn’t break” strategy first.

Logically this work but harder to do this generically in code.

I’m working on a real-time assistant right now that ends up using a bunch of tools:

I’ll create a “router” agent that generally chooses what classes of tools need to be used (e.g. Salesforce, Web search) and then I’ll have it generate an array of commands (basically JSON configs) that it’ll send to other functions/classes, e.g. a Salesforce function, that will agentically figure out what needs to be done for that task.

It works surprisingly well! I like using the “jobs to be done” framework to think about how to decompose really big tasks into smaller ones, wrap functions around them, and have a router essentially create a queue of commands to fire off these smaller, task-focused functions.

Sorry but this happens even if there is just one tool. Already using the router mechanism and it still happens. If you need I can DM you a trace.

I’ll DM you with a curl and a trace