If you are reading this, then you are probably well aware of OpenAI’s large language models, such as GPT-3 and its dialogue interface ChatGPT. You may have also heard of concerns from teachers and employers over heavy use of Large Language Models (LLMs), or using them to automatically generate content.

The OpenAI Text Completion API uses a combination of machine learning models and various techniques to generate completions for a given prompt or text.

Presence penalty and Frequency Penalty

The presence penalty and frequency penalty are two such techniques that the API uses to generate high-quality and coherent completions.

Presence penalty is a technique that helps to discourage the model from generating tokens that do not appear in the training data. This ensures that the generated tokens are valid and coherent, and are not made up of random characters or meaningless words. This helps to ensure that the generated text is semantically meaningful and makes sense in the context of the prompt.

Frequency penalty is a technique that encourages the model to generate tokens that are more frequent in the training data. This helps to ensure that the generated tokens are likely to appear in a natural language text and are less likely to be rare or obscure words. This helps to make the generated text more readable and easier to understand.

Sign Up for Email Updates

By combining these two penalties, the OpenAI Text Completion API is able to generate high-quality completions that are both coherent and natural-sounding.

Methods for LLM Detection

Understanding Perplexity

Perplexity is a measure of how well a language model predicts a given text. In the context of large language models, such as OpenAI’s GPT models, perplexity is used to evaluate the quality of the model and to compare different models.

Perplexity is defined as the exponential of the cross-entropy loss between the true distribution of the text and the model’s predicted distribution. Cross-entropy loss measures the difference between the true distribution and the predicted distribution, and the exponential is used to provide a more interpretable measure of this difference.

Below we can see the formula for perplexity:

2^{H(p,q)}

A lower perplexity score indicates that the model is better at predicting the text, and therefore has a higher quality. Conversely, a higher perplexity score indicates that the model is worse at predicting the text, and therefore has a lower quality.

Perplexity is commonly used to evaluate language models on benchmark datasets, such as the Penn Treebank or the Wikipedia corpus. By comparing the perplexity of different models on these datasets, researchers can assess the quality of the models and determine which models are best suited for a particular task.

Methods of LLM Detection

There has been quite a bit of research into understanding how to detect the output from Large Language Models, and more importantly the theoretical limits of detection, especially those [1]

Using Penalties to Circumvent LLM Detection

Effectiveness of Commercial LLM Detection Software

Given what we’ve discussed above, let’s test some of the leading commercial software for detecting large language models and check its effectiveness at recognizing output from OpenAI’s text-davinci-003 model.

Here we explore the output of the base model, with very low temperature settings, as well as the default frequency and penalty parameters:

Now let’s see how some commercial LLM detection software performs on this text:

Clearly we can see the detection model is well aware that this output text exhibits very high probability words for its output, so it easily detects that it came from some LLM. But what happens as we begin to increase penalties for frequency and presence of output tokens? Let’s keep the same prompt in mind, but now explore how adjusting the parameters can change the output, while still keeping the same overall “thesis” for the output text:

Notice how even the formatting changes, and since whitespace and punctuation count as tokens, we can assume that also effects LLM detection. If we pass this LLM generated text, we can see that the detection capability completely breaks:
You will notice that the LLM detection software still believes there is a 1% that this text is AI generated, but clearly the simple edits in frequency and presence penalties have been effective at nullifying this detection model. Now just for fun, let’s let the LLM sarcastically mock the software in a manner that would be obvious under human review, but not under automated review:
And now for the detection result:

Perhaps the model was right, it is an “almighty force to be reckoned with!”

Conclusion

If you are using LLM detection software, you should be aware that it can only perform well against default settings of large language models. Currently resources like ChatGPT do not allow users to edit temperature or frequency and presence penalties, but it is very likely in the future that these parameters will be editable by the user, causing most detection software to “break”, in the sense that if they do begin detecting highly penalized outputs, they will also produce many false positives on human written text, in fact we have already seen this behavior occurring for human written Wikipedia articles, due to the fact that the training corpus for large language models almost always includes the full text of the entirety of Wikipedia. In conclusion, make sure you know what you’re paying for if you subscribe to these LLM detection services!

Sources

[1] https://arxiv.org/pdf/2002.03438.pdf

Picture of Davison Robie
Davison Robie
Sign Up for Email Updates