My one-word AI prompt to induce deeper reasoning and more accurate output from ChatGPT: “RUMINATE”

My one-word AI prompt to induce deeper reasoning and more accurate output from ChatGPT: “RUMINATE”

Slow down, genius: A simple hack for smarter AI responses

While generative AI dazzles us with its speed, it stumbles over the simplest tasks. By inviting them to ‘ruminate,’ we nudge these speedsters to pause, ponder, and perform a little less like a rushed coworker trying to beat the clock.

Strawberries, it seems, are in season. Not only is it rumoured to be the top secret name behind a project at Open AI that may or may not be Artificial General Intelligence, it’s also become a riddle for prompt engineers. Why do current AI models struggle to count the number of Rs in “strawberry”?

The “How Many Rs in Strawberry?” problem is an unexpected challenge that has gained traction among users of large language models (LLMs). The problem is simple on the surface: count the number of “r” letters in the word “strawberry.” You’d think this is a straightforward task, but it turns out to be a fascinating stumbling block for advanced AI systems, including top models like GPT-4 and Claude.
The viral AI brainteaser. Many chatbots confidently provide the wrong answer, stating there are only two. (Image created with https://www.thewordfinder.com/wof-puzzle-generator/)

Before explaining how I solved the letter-counting conundrum in a way that will improve all of your prompts (thank you, thank you, please save your adulations for the comments section), I want to expound a little on AGI, and why I think “strawberry” has cropped up as the project’s name.

Berry Mysterious: The AGI Connection

A strawberry is one of these things that’s more than it seems. It’s not a real berry! It’s not even one fruit: it’s actually about 200, because the little seeds are the real fruit part (the juicy red flesh is just a ‘receptacle’ or false fruit).

Why “Strawberry” Is the Sweetest Name OpenAI Could Pick for Artificial General Intelligence

Ripe With Meaning: The Symbolism Behind “Project Strawberry”

medium.com

If there’s a hidden message there — and Altman and friends love clues — I think “project strawberry” means we’ll be getting more than we expected. Like a strawberry, AGI is going to be self-seeding, trained on its own data.

Ironically, the word “Strawberry” also shows the current limitations of AI.

Strawberry Stumble: AI’s Letter-Counting Conundrum

ChatGPT users have been taking delight in the wrong answers it will supply you in response to a relatively simple question: “How many letter ‘r’s are in the word strawberry?” To the internet’s glee, LLMs from ChatGPT to Claude to Copilot usually incorrectly say “there are two Rs in the word strawberry”.

https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FbyajUNOOqNI%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DbyajUNOOqNI&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FbyajUNOOqNI%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtube

“2 Rs in Strawberry” has become the AI hands of Large Language Models — the amusing error we love to hate, and a common complaint on the OpenAI Developer Forum! Some people even see this as a case of hallucination, however it’s actually more of a perceptual problem than confabulation.

This isn’t AI ‘making up’ a false answer, and in fact the consistency bears that out. But the letter counting probably actually reveals a lot about how LLMs parse language (breaking it up into tokens), and respond (rapidly).

We’ll get onto tokens/speed in a moment, along with how my prompt can help us approximate the high-level problem solving of Project Strawberry, improving the AI’s ability to work through challenges more like a human would. However, first I want to address a peeve of mine: arguing with AI.

Chat Etiquette: Stop Arguing with Robots, It’s Embarrassing

I’ve noticed in most of the demonstrations of ChatGPT failures at the “How many Rs in Strawberry?” letter-counting problem, the user will query the answer, and are surprised when AI doubles down on the mistake. That’s just not good chat hygiene. You see, current language models are auto-regressive, which means they use what they’ve previously generated to decide what to generate next. Far from correcting errors, this compounds them. It’s why people think AI is ‘argumentative’ when it gets things wrong.

The strawberry riddle has become something of an in-joke among AI enthusiasts. The idea that a sophisticated AI can stumble over counting Rs in a word as simple as “strawberry” is both amusing and a bit concerning. It underscores the gap between AI’s apparent intelligence and its actual performance​
Source: OpenAI Developer Forum

So pestering AI on a mistake, or getting it to change tact when it’s being as stubborn as a mule, is really only demonstrating its “lookback” is working.

Unless you can really shock it out of it (I recommend swearing!) you’ll only get more of the same. Arguing is an exercise in futility; it’s better to refresh the conversation and start again. By resetting the dialogue, you give the AI a chance to reevaluate the task without the baggage of its previous errors.

Why AI Is Stubborn (It’s Done Dealing with You)

Doubling down is even worse in 4o than in earlier versions of ChatGPT, which makes me suspect that it might be built into the “Personality V.2” settings of 4o. “Personality” dictates how Chat interacts with you; in my opinion, Personality V.2 more belligerently confident in its own expertise, perhaps in an effort to save tokens (more on tokens later!). You can reveal the system prompts, i.e. the compulsory chat settings, by using this hack:

Unmasking AI. The system prompts behind everything

ChatGPT spills its guts about our digital interactions

medium.com

Ironically, it’s attempting to avoid user frustration by providing immediate, seemingly definitive answers. Apple Intelligence’s backend prompt tells it to prefer to reply in clauses, not complete sentences, in order to reduce token spend! It’s a bit like how some businesses prioritize speed over accuracy, rushing customers out the door without really addressing their concerns.

ChatGPT is obstinate that “There are two Rs” because it’s done talking with you. Next! Computational resources cost money, so it brooks no argument. However, this assertive approach often backfires when the system clings to inaccuracies, and means ChatGPT almost never admits when it’s mistaken. “Are you sure about that?” invites a positive reiteration. Of course it’s sure!

Future AGI will be able to learn from mistakes. But currently you can’t re-teach a pre-trained chatbot, as it operates in a temporary state specific to your interaction. It’s just going to repeat the same thing back to you again, unless you interrupt the cadence. It’s not really evaluating the error, and it’s a fool’s errand to try to correct it. At best, you establish a pattern for the chat to follow where the AI will assume this back-and-forth is what you want. In other words, by persistently challenging an AI within the same interaction, you might inadvertently reinforce a dynamic where the AI mimics the rhythm of the debate — rather than addressing the mistake.

You can of course edit your earlier request to try to redirect the chat down another path, but it draws on the recent history of outputs. Better refresh!

The Tokenization Trap

But why does ChatGPT fail so spectacularly at character counting? Well, an AI doesn’t read words the same as you and I. Natural language models like GPT process text by breaking them into basic units, or “Tokens”. In simple terms, tokens can be short words, parts of longer words, or punctuation marks, depending on how the model has been trained to segment text.

Here’s how “How many Rs in the word “STRAWBERRY”?” is tokenized:

The error likely stems from the way LLMs handle tokenization and pattern recognition. Instead of reading words the way humans do, AIs break down inputs into tokens that may not correspond directly to individual letters. This can lead to errors in simple counting tasks, where context and pattern matter more than raw linguistic data​.

The prevailing theory is that the error is due to how the word “Strawberry” is broken into those units; that it’s somehow compressing letters together. However, I believe there’s a subtler reason. When we try it with a similar but entirely made-up word, of the same length and with Rs in the same location (3, 8 ,and 9 in a ten-letter word) we get identical tokenization:

If an AI can stumble over something as basic as counting letters, what does that mean for more complex tasks? This issue highlights potential limitations in current LLMs, suggesting that while they excel in natural language generation, their underlying understanding of language mechanics might be shallow​.

However, the answer is correct! So ChatGPT counts Rs in “coralberry”, but not “strawberry”. Accordingly, it can’t be due to the division of tokens alone.

While AIs often fumble with the number of Rs in “strawberry,” they typically have no such issue with “coralberry.” It doesn’t trigger the same cognitive bias or miscount that often occurs with its fruitier cousin. Despite similar tokenization, “coralberry” doesn’t present the same computational challenge.
“Coralberry” doesn’t work all the time, but is more successful letter-counted than “strawberry”.

While I agree the “2 Rs” error has to do with economizing computation, I hypothesize it has more to do with how it can get away with superficially processing words. “Coralberry” — an uncommon word — is making it pay attention, rather breezing through on a surface level with an easy one.

Thinking Too Fast: Why “Genius” AI Flunks Mensa Puzzles

I’ve seen similar issues to the Strawberry problem before. I found ChatGPT was flummoxed by Mensa word puzzles. “Mensa-level” sounds hard, but it should have been a breeze, considering that ChatGPT has a verbal IQ of 155 (putting it in the top 0.1% of the population. Mensa members are in the top 2%). Why it confused AI is that the brainteasers were written in such a way as to trip up automatic thinking and quick, obvious assumptions. The trick with the puzzles is to slow down and not go with your first impression. AI happily blurted out an instant answer, but it was almost always incorrect.

Can ChatGPT Solve Mensa Puzzles?

Daily Challenges: AI vs. Mensa Calendar Puzzles

pub.towardsai.net

My theory — which we’re about to test out — is that the inaccuracies result from the spontaneity of generative responses. All an LLM does is intuit what word comes next in a sentence, vamping like a student who hasn’t done the reading the night before making up a presentation on the spot.

Ironically, the tech industry often tout the speed of AI platforms and how they respond in realtime as desirable features. But it turns out that users actually prefer dynamic delays in chat interactions. Also, greater latency would allow AI to get to the end of the sentence before supplying output. Researchers from the Stanford Graduate School of Business agree that, just like humans, LLMs perform better at complex tasks when they slow down.

The utility of NLP systems could be improved by giving time to examine the task, compose and reflect. Instead, AI chat is modelled after instant messaging — we see the words appear on screen in rapid succession, one after the other. The result is it’s racing ahead, like skipping stones water.

From Fast Talker to Deep Thinker

What we’re seeing is analogous to fast and slow thinking in humans, what Daniel Kahneman dubs System 1 and System 2 thinking models. It seems paradoxical, but we have to slow AI down for better reasoning. In fact, some suggest that’s what Project Strawberry is: a long inference large language model, capable of anaylzing things it would otherwise skim.

System 1’s rapid-response is great for generating fluid, conversational text, but it can stumble when precision is required — like counting the number of Rs in ‘strawberry’. By contrast, System 2 thinking, which is slower and more methodical, is what we need for tasks that involve critical thinking.

Currently LLMs like ChatGPT are System 1: speedy, yet prone to errors, relying on heuristics and pattern recognition rather than on deliberate reasoning (in humans, this would be instinctual, unconscious thought).

Attainment of true System 2 level of reasoning in AGI would more closely mimic our human cognitive processes, in particular the intentionality of thought, and the ability to reflect on challenges more like a human does.

You can see why Project Strawberry is a big deal. It shifts AI from word-calculators to something approximating phenomenological experience.

“Strawberry” is rumored to be a secret codename for an AGI project at OpenAI. Like the fruit, which is more complex than it seems, AGI may hold hidden depths — potentially seeding its own growth through self-learning.
Image created by Jim the AI Whisperer (2024)

But can we emulate this long inference with the models we currently have?

How I Tricked AI into Thinking Longer and Harder

We still need to consider token spend (although not the breaking up of the word “Strawberry”), because we’re compelling the model to predict more tokens in order to put more thought into its answers. I’ve previously had success convincing ChatGPT to use more tokens when troubleshooting code using my “golden prompt”. The key that time was asking ChatGPT:

“Please really think on this and come up with an accurate and simple solution that does not interfere with the rest of the code”.

I found that approach worked well (much better than arguing), as did:

“Can you describe what might be happening first, really try to understand it, then come up with a creative solution?”

By prompting System 2 style thinking, the output was longer and improved. This approach can be particularly effective when troubleshooting, or when we need the AI to break away from a pattern of mistakes it’s fallen into. The same strategy worked like a charm for solving the letter counting problem:

As you can see, the AI identified the problem, worked out a procedure, and then self-invigilated. I like that it not only spelled the word out, but checked each time whether a letter “was or was not an R”, then added them up. It’s a longer, more leisurely inference process, with a higher token spend than a brusque “There are 2 Rs in Strawberry”. And most importantly: it’s correct.

In another attempt, it even wrote a creative poem to give me the answer!

Start by counting the letters one by one.
Take your time, and have some fun.
Remember to focus on the “R”s you see,
And check them carefully, counting all three.
Wait! Did you spot one? Let’s move along,
Because we need to ensure nothing goes wrong.
Each “R” is important, so don’t miss a beat,
Review them closely; your task is complete.
Rejoice! You’ve counted every “R” here,
Yes! The total count is clear, my dear!

There’s something wonderfully humanizing about that — it reminds me of the mnemonics we’d recite in school to remember how spell correctly. There’s literary a pace and a rhythm to working through the process.

We can also slow down ChatGPT through chain of thought prompting — i.e. breaking the task into a multistep request, and guiding it to go through the problem methodically. But I wanted to find one single power word to do it. One well-chosen magic word makes all the difference in prompts! For instance, I swear by “Please” (and more recently, swear words with 4o).

Solving the “Strawberry Problem” With a Single Power Word

With a lots of experimentation, the word I ultimately decided worked best for inspiring deeper reasoning was RUMINATE. Ruminate means to think long and deeply, so it perfectly sums up prompting AI to longer inference.

When you want ChatGPT to think more deeply, try the magic word: “RUMINATE.” This simple prompt can push AI from quick, superficial responses (System 1 thinking) to more thoughtful, accurate outputs (System 2 thinking).
Ruminate on this: How many R’s are in “STRAWBERRY”? Jim the AI Whisperer (2024)

The idea is to slow down the generation process, rather its usual rapid-fire, System 1 response. By encouraging the AI to ‘ruminate,’ I nudge it closer to System 2 thinking, pushing the model to engage more deeply with the task:

AI That Actually Thinks Before It Speaks

Interestingly, when given an opportunity to ruminate on the problem, not only was it more correct, but it acknowledged the challenge of counting Rs, which gives us insight into what was going on (thus confirming my theory):

It was even able to handle my misspelling of “Strrawberrry”, which meant it was paying closer attention, and didn’t autocorrect to a familiar spelling:

It’s Not Just Token Segmentation After All

As you can see, comparing the basic prompt with the prefix “Ruminate on this” doesn’t alter the token breakup of the word “strawberry” (ST-RAW-B-ERRY in both cases), putting paid to the notion it’s a segmentation issue:

Instead of redistributing tokens (which is one of the methods other prompt engineers are trying), my subtle yet effective prompt tweak induces a shift in the AI’s attitude toward token expenditure, encouraging a more mindful analysis that mirrors the way we percolate ideas when we ponder a puzzle.

Sometimes, AI Just Doesn’t Listen

It’s important to note that “ruminate” doesn’t work all of the time; AI still occasionally hallucinates or gets on the wrong track, especially if it starts off with “There are 2 Rs” and then tries to justify this initial error at length.

You can usually tell if this is going to happen; there isn’t a pause before it launches into the answer. It’s like a gameshow contestant buzzing in early before the host has finished asking the question. In my opinion, when this occurs, the instruction to “ruminate” hasn’t been taken in. In short, I think it’s a failure to process the prompt itself in full: ironically it’s glossed it over!

You might have experienced a similar issue when you’ve asked ChatGPT to do an online search, but it responded without doing the research, because it’s overconfident. I don’t like to personify AI, but in both cases it seems like it’s taken one look at the request and said “Don’t tell me how to do my job”.

However, “ruminate” is still much more reliably correct on the Strawberry test than prompting without, which almost always gets the wrong answer. In my estimation, with “ruminate” it’s correct 80/20, versus 20/80 without.

When AI Finally Does Its Homework

Another advantage over low-token output is that it shows the working. This means if we ask a question to which we don’t already know the answer, we can follow along and check it, which makes it easier to spot any mistakes.

My “ruminate” prompt works on other AI platforms too, such as Claude:

Previously, Claude performed the worst of all on the Strawberry problem:

How to Sow “Ruminate” in Your Everyday AI Life

So, how can you apply this in your prompting techniques? Simple: preface your requests with a command to ‘ruminate on this’ and then clearly state the task. This encourages the LLM to slow down, take an extra moment to process the request and to generate a more accurate, thoughtful response.

While current AI models are fast and fluent, they can sacrifice reasoning skills for speed. By tweaking your prompts to promote a bit of humanized ‘rumination,’ you can coax the AI into a more reflective mode, improving its reasoning, reducing those frustrating errors, and avoiding arguments.

In a few weeks, if OpenAI’s Project Strawberry bears fruit, we might see a model that combines fast and slow thinking, changing the fields forever. Until then, remind it to stop and smell the strawberries once in a while.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *