The Solution to Hallucinations in LLMs Will Likely Not Be Found Within

By Anthony Spaelti, Principal, Cota Capital

Generative AI has gained widespread adoption through chatbots like ChatGPT and Claude, with many people now using AI tools like these daily. They’re popular because they work. Well-trained large-language models (LLMs) can consistently and quickly produce high-quality outputs when processing new, unseen inputs.

But LLMs don’t work 100% of the time. Occasionally, they generate content that’s semantically and syntactically correct but factually wrong. This is called a “hallucination.” A classic example is a grammatically perfect but nonsensical statement like, “A chair is fruit.” Clearly, this is about as logical as it is edible.

One reason hallucinations frequently occur is that the model lacks knowledge about the input topic. As the model’s ability to relate the input to its existing knowledge diminishes, the probability of generating an accurate response dramatically decreases.

However, it’s not just missing knowledge in the model’s training data that can produce a hallucination. To understand this, let’s look at how LLMs actually work: An LLM’s architecture is designed to take a number of tokens (think of them as words) as input, then predict what token (or word) is most likely to follow that given input.

It’s important to understand that the “most likely” next token isn’t necessarily the one with the highest probability. For LLMs to be creative and come up with ideas never seen before, they sometimes pick a token with a lower probability for a more creative output – this is a feature, not a bug. We call this “temperature” and need to “regulate” it properly depending on whether we want primarily factual information (low temperature) or creative ideas (higher temperature). But even with a low temperature, you can’t fully rely on the LLM not to make stuff up.

New methods try to address these hallucination challenges

The AI community has explored numerous strategies to mitigate LLM hallucinations, with two approaches getting significant media attention and substantial venture capital investment:

  1. Retrieval-augmented generation (RAG): This method introduces external knowledge directly into the model.
  2. Large Reasoning Models (LRM): This method employs advanced prompt-engineering techniques, or reasoning-augmented prompting, to enhance the model’s reasoning capabilities.

While these approaches attempt to address hallucinations and make AI more reliable, they ultimately fail, especially in critical or complex applications. Their fundamental limitation is their failure to address the core nature of LLMs, which are, as discussed before, inherently probabilistic systems.

RAG and LRM are add-on solutions that don’t modify the underlying probabilistic architecture of LLMs. They are supplementary approaches layered on top of the existing model rather than fundamental structural changes.

Let’s take a closer look at both of these approaches and their shortcomings.

RAG is still more art than science

RAG allows external knowledge to be dynamically added to the model’s context window before it processes an input. The context window is everything the model works with at a given time; it’s the combined input (the user’s prompt) and output (what the model generates).

The primary objectives of RAG are to reduce model hallucinations and introduce domain-specific knowledge. It should mitigate hallucinations by providing the model with more accurate, up-to-date information.

However, it ultimately falls short of providing a sustainable solution to hallucinations due to three reasons:

First, RAG can only be as good as the external data it can search and access. And herein lies a challenge: how to search and retrieve information? We’ve all struggled to find what we’re looking for when googling something – it’s not much different for RAG. How we develop a search algorithm that can crawl through tons of unstructured PDFs and other corporate data to pull out the most relevant, and only the most relevant, data is more art than science at this point.

Second, even if we find the right data, since we only add it to the context window, we just “hint” the LLM it should use this data, but we can’t know if it will actually use it.

Finally, as enumerated multiple times, LLMs are probabilistic by design—our output will always have some level of variation.

LRMs are impressive but not always efficient

Large Reasoning Models fall into the broader category of reasoning-augmented prompting, a sophisticated approach to prompt engineering that aims to achieve better outputs by modifying the input. Prompt engineering is the “art” of modifying your request to an LLM so its output becomes, on the one hand, more reliable and, on the other hand, does exactly what you need it to do – making the output more predictable. For example, instead of asking at a coffee shop for “A drink,” you would ask for “A cup of black coffee, two sugars” – this is a form of prompt engineering in real life.

This field gained significant attention with the release of OpenAI’s o1 model and now with DeepSeek R1 because they use a sophisticated form of prompt engineering to augment human reasoning.

But, critically, these models don’t actually reason scientifically—instead, they’re cleverly manipulated and tricked into “reasoning.” This is done by embedding specific instructions and intermediate steps within the prompts to guide the model’s inference process (that’s the model’s “thinking” process).

Two primary approaches have emerged to enhance model reasoning capabilities: chain of thought (CoT) and its more recent variation, tree of thoughts (ToT). While research continuously develops new theoretical concepts, the ideas around chain of thought and tree of thoughts are emerging as key concepts.

In the CoT approach—utilized by OpenAI’s o1 model and DeepSeek R1—the model is instructed to avoid immediate answers and instead dissect the input into subtasks and solve them individually. While this method can improve output accuracy, it has a notable drawback: it significantly increases inference time, which is the time it takes to get from an input to an output.

ToT seeks to invoke human-like reasoning by instructing the model to explore multiple potential reasoning paths simultaneously. The model then systematically self-evaluates them, progressively prunes less-promising paths, and narrows down the exploration until only the most promising path remains.

ToT has demonstrated impressive performance in complex cognitive tasks like mathematical problem-solving. But the method is not without drawbacks. Its comprehensive exploration of multiple reasoning paths demands significant computational resources and can be inefficient for simpler tasks that don’t require elaborate reasoning strategies.

To truly avoid hallucinations, a new architecture is required

While current methods of reducing hallucinations have promising implications for many enterprise use cases, they fail to create truly reliable AI for high-stakes scenarios demanding absolute precision. Critical domains such as medical diagnostics, legal document analysis, and safety-critical systems such as autonomous vehicles and emergency response systems cannot tolerate marginal error rates.

These kinds of applications require genuine logical inference—capabilities that current LLMs can’t provide. The good news is that promising solutions are emerging. One of them is neuro-symbolic AI.

This decades-old technology combines the generative power of neural networks (the “neuro”) with the rigorous logical capabilities of symbolic systems (the “symbolic”). Because it’s capable of both creative generation and precise, reliable inference, neuro-symbolic AI offers the potential to eliminate hallucinations in modern AI applications altogether.

Stay tuned for our upcoming articles, in which we will delve more deeply into this fascinating topic of neuro-symbolic AI.

By Anthony Spaelti, Principal, Cota Capital

Generative AI has gained widespread adoption through chatbots like ChatGPT and Claude, with many people now using AI tools like these daily. They’re popular because they work. Well-trained large-language models (LLMs) can consistently and quickly produce high-quality outputs when processing new, unseen inputs.

But LLMs don’t work 100% of the time. Occasionally, they generate content that’s semantically and syntactically correct but factually wrong. This is called a “hallucination.” A classic example is a grammatically perfect but nonsensical statement like, “A chair is fruit.” Clearly, this is about as logical as it is edible.

One reason hallucinations frequently occur is that the model lacks knowledge about the input topic. As the model’s ability to relate the input to its existing knowledge diminishes, the probability of generating an accurate response dramatically decreases.

However, it’s not just missing knowledge in the model’s training data that can produce a hallucination. To understand this, let’s look at how LLMs actually work: An LLM’s architecture is designed to take a number of tokens (think of them as words) as input, then predict what token (or word) is most likely to follow that given input.

It’s important to understand that the “most likely” next token isn’t necessarily the one with the highest probability. For LLMs to be creative and come up with ideas never seen before, they sometimes pick a token with a lower probability for a more creative output – this is a feature, not a bug. We call this “temperature” and need to “regulate” it properly depending on whether we want primarily factual information (low temperature) or creative ideas (higher temperature). But even with a low temperature, you can’t fully rely on the LLM not to make stuff up.

New methods try to address these hallucination challenges

The AI community has explored numerous strategies to mitigate LLM hallucinations, with two approaches getting significant media attention and substantial venture capital investment:

  1. Retrieval-augmented generation (RAG): This method introduces external knowledge directly into the model.
  2. Large Reasoning Models (LRM): This method employs advanced prompt-engineering techniques, or reasoning-augmented prompting, to enhance the model’s reasoning capabilities.

While these approaches attempt to address hallucinations and make AI more reliable, they ultimately fail, especially in critical or complex applications. Their fundamental limitation is their failure to address the core nature of LLMs, which are, as discussed before, inherently probabilistic systems.

RAG and LRM are add-on solutions that don’t modify the underlying probabilistic architecture of LLMs. They are supplementary approaches layered on top of the existing model rather than fundamental structural changes.

Let’s take a closer look at both of these approaches and their shortcomings.

RAG is still more art than science

RAG allows external knowledge to be dynamically added to the model’s context window before it processes an input. The context window is everything the model works with at a given time; it’s the combined input (the user’s prompt) and output (what the model generates).

The primary objectives of RAG are to reduce model hallucinations and introduce domain-specific knowledge. It should mitigate hallucinations by providing the model with more accurate, up-to-date information.

However, it ultimately falls short of providing a sustainable solution to hallucinations due to three reasons:

First, RAG can only be as good as the external data it can search and access. And herein lies a challenge: how to search and retrieve information? We’ve all struggled to find what we’re looking for when googling something – it’s not much different for RAG. How we develop a search algorithm that can crawl through tons of unstructured PDFs and other corporate data to pull out the most relevant, and only the most relevant, data is more art than science at this point.

Second, even if we find the right data, since we only add it to the context window, we just “hint” the LLM it should use this data, but we can’t know if it will actually use it.

Finally, as enumerated multiple times, LLMs are probabilistic by design—our output will always have some level of variation.

LRMs are impressive but not always efficient

Large Reasoning Models fall into the broader category of reasoning-augmented prompting, a sophisticated approach to prompt engineering that aims to achieve better outputs by modifying the input. Prompt engineering is the “art” of modifying your request to an LLM so its output becomes, on the one hand, more reliable and, on the other hand, does exactly what you need it to do – making the output more predictable. For example, instead of asking at a coffee shop for “A drink,” you would ask for “A cup of black coffee, two sugars” – this is a form of prompt engineering in real life.

This field gained significant attention with the release of OpenAI’s o1 model and now with DeepSeek R1 because they use a sophisticated form of prompt engineering to augment human reasoning.

But, critically, these models don’t actually reason scientifically—instead, they’re cleverly manipulated and tricked into “reasoning.” This is done by embedding specific instructions and intermediate steps within the prompts to guide the model’s inference process (that’s the model’s “thinking” process).

Two primary approaches have emerged to enhance model reasoning capabilities: chain of thought (CoT) and its more recent variation, tree of thoughts (ToT). While research continuously develops new theoretical concepts, the ideas around chain of thought and tree of thoughts are emerging as key concepts.

In the CoT approach—utilized by OpenAI’s o1 model and DeepSeek R1—the model is instructed to avoid immediate answers and instead dissect the input into subtasks and solve them individually. While this method can improve output accuracy, it has a notable drawback: it significantly increases inference time, which is the time it takes to get from an input to an output.

ToT seeks to invoke human-like reasoning by instructing the model to explore multiple potential reasoning paths simultaneously. The model then systematically self-evaluates them, progressively prunes less-promising paths, and narrows down the exploration until only the most promising path remains.

ToT has demonstrated impressive performance in complex cognitive tasks like mathematical problem-solving. But the method is not without drawbacks. Its comprehensive exploration of multiple reasoning paths demands significant computational resources and can be inefficient for simpler tasks that don’t require elaborate reasoning strategies.

To truly avoid hallucinations, a new architecture is required

While current methods of reducing hallucinations have promising implications for many enterprise use cases, they fail to create truly reliable AI for high-stakes scenarios demanding absolute precision. Critical domains such as medical diagnostics, legal document analysis, and safety-critical systems such as autonomous vehicles and emergency response systems cannot tolerate marginal error rates.

These kinds of applications require genuine logical inference—capabilities that current LLMs can’t provide. The good news is that promising solutions are emerging. One of them is neuro-symbolic AI.

This decades-old technology combines the generative power of neural networks (the “neuro”) with the rigorous logical capabilities of symbolic systems (the “symbolic”). Because it’s capable of both creative generation and precise, reliable inference, neuro-symbolic AI offers the potential to eliminate hallucinations in modern AI applications altogether.

Stay tuned for our upcoming articles, in which we will delve more deeply into this fascinating topic of neuro-symbolic AI.

Company building is not a spectator sport. Our team of operating professionals advises and works alongside our companies. Think of us as both coaches and teammates.