Skip to main content

In your documentation for Prefilling at https://console.groq.com/docs/prefilling I noticed the following sentence:

 

> Note: For some models, adding a newline after the prefill assistant message leads to better results.

 

You can fix this bug by using “token healing”. The cause for the decreased quality is that some sequences of smaller tokens are less likely to be generated by the LLM than larger tokens that are a combination of those smaller tokens. Now if the prefill stops in the middle of a sequence of smaller tokens, the larger, more likely token can not be generated anymore, confusing the model.

 

For example, consider the prefill

def quicksort(values)

Without token healing, you might get the completion

def quicksort(values) → list

instead of the more common

def quicksort(values):

because the larger, more likely token `):` can not be generated anymore, since the token `)` already exists and can not be undone.

 

The solution is to chop off some tokens and let the model continue, which is now allowed to generate the large token. Of course, make sure to zero out the probability of all tokens which do not match the prefix.

 

Here is the corresponding PR in llama.cpp for reference: https://github.com/ggml-org/llama.cpp/pull/7187

 

And if you have that in place, you can implement full-blown GBNF grammar support, which allows the generation of JSON with a specific schema, XML, YAML, syntactically correct programs and everything else that can be expressed as a grammar: https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md

Thank you for the great idea, and the thoughtful post! I’ll forward this to the team.